Commit Graph

753 Commits

Author SHA1 Message Date
Martin Kroeker
c9174ae8d7 Disable gcc's tree-vectorizer pass on all operating systems 2023-04-19 23:45:44 +02:00
Martin Kroeker
c2fe9cb91f Disable gcc's tree-vectorizer pass on all operating systems 2023-04-19 23:45:14 +02:00
Martin Kroeker
66b39b835c Disable gcc's tree-vectorizer pass on all operating systems 2023-04-19 23:44:45 +02:00
Martin Kroeker
bb6d6735bf Disable gcc's tree-vectorizer pass on all operating systems 2023-04-19 23:44:15 +02:00
Martin Kroeker
d18efaed20 Disable gcc's tree-vectorizer pass on all operating systems 2023-04-19 23:43:43 +02:00
Martin Kroeker
99f6d31ed5 Disable gcc's tree-vectorizer pass on all operating systems 2023-04-19 23:42:55 +02:00
Martin Kroeker
7de9335c56 Disable gcc's tree-vectorizer pass on all operating systems 2023-04-19 23:42:09 +02:00
Bart Oldeman
60e49b851c Fix typo in clobber list, should be xmm14 instead of ymm14. 2022-12-06 16:30:46 -05:00
Bart Oldeman
4afe1439a1 Fix skylake fallback kernel name for old compilers. 2022-12-06 16:09:54 -05:00
Bart Oldeman
5ceca1a4d8 Add sscal.c + microkernels for Haswell, Zen, Skylake and newer.
Unlike [dcz]scal, sscal still used the original GotoBLAS SSE code from scal_sse.S.
This code follows dscal as closely as possible, except for the inc_x > 1 code
for which a plain C loop is used much like the one in cscal.c, instead of an
adaptation of the SSE2 asm code of dscal.c (I tried but the performance wasn't
better than the plain C loop).
2022-12-06 14:05:49 -05:00
Bart Oldeman
5c3169ecd8 dscal: use ymm registers in Haswell microkernel
Using 256-bit registers in dscal makes this microkernel consistent with
cscal and zscal, and generally doubles performance if the vector fits in
L1 cache.
2022-12-01 07:48:05 -05:00
Martin Kroeker
f73cfb7e2c change line endings from CRLF to LF 2022-11-17 09:39:56 +01:00
Martin Kroeker
1688c7da43 change line endings from CRLF to LF 2022-11-16 22:24:01 +01:00
Bart Oldeman
6c1043eb41 Add [cz]scal microkernels for SKYLAKEX
These are as similar to dscal_microk_skylakex-2.c as possible
for consistency.

Note that before this change SKYLAKEX+ uses generic C functions for
cscal/zscal via commit 2271c350 from #2610 (which is masked by
commit 086d87a30). However now #3799 disables FMAs (in turn enabled
by `-march=skylake-avx512`) in the plain C code which fixes excessive
LAPACK test failures more nicely.
2022-11-09 08:57:03 -05:00
Bart Oldeman
e7e3aa2948 x86_64: prevent GCC and Clang from generating FMAs in cscal/zscal.
If e.g. -march=haswell is set in CFLAGS, GCC generates FMAs by default, which
is inconsistent with the microkernels, none of which use FMAs. These
inconsistencies cause a few failures in the LAPACK testcases, where
eigenvalue results with/without eigenvectors are compared.

Moreover using FMAs for multiplication of complex numbers can give surprising
results, see 22aa81f for more information.

This uses the same syntax as used in 22aa81f for zarch (s390x).
2022-10-27 18:16:43 -04:00
Martin Kroeker
101a2c77c3 Fix warnings 2022-09-15 09:19:19 +02:00
Martin Kroeker
739c3c44a7 Work around windows/osx gcc12 x86_64 tree-optimizer problem and add an osx/gcc12 build to Azure CI (#3745)
Add pragma to disable the gcc tree-optimizer for some x86_64 S and Z kernels with gcc12 on OSX or Windows
2022-09-03 15:01:22 +02:00
Martin Kroeker
dc49edd4e6 Revert "roll back DGEMM kernel ... for DYNAMIC_ARCH" 2022-05-20 11:23:30 +02:00
Caroline Newcombe
5cc1111383 fix unsafe read of Y in assembly kernel 2022-03-11 11:56:33 -06:00
Wangyang Guo
225683218c Small Matrix: use proper inline asm input constraint for AVX512 mask 2022-02-28 03:22:31 +00:00
Martin Kroeker
9c626e466e really fix definition of SHUFFLE_MAGIC_NO 2022-02-25 15:36:02 +01:00
Martin Kroeker
9d7429406f Declare SHUFFLE_MAGIC_NO as const to placate clang 2022-02-25 10:05:36 +01:00
Martin Kroeker
522f809825 Merge pull request #3542 from martin-frbg/issue3540
Fix compilation for CooperLake on Windows/clang
2022-02-24 00:00:00 +01:00
Mosè Giordano
abbc947edb Fix compilation of Skylake AVX512 kernels with GCC 6 2022-02-23 22:51:59 +00:00
Martin Kroeker
c62f8e2c01 Prevent compiler attempts to use k0 as mask register 2022-02-23 20:12:20 +01:00
Martin Kroeker
80eb581c83 Fix non-portable u_int64_t 2022-02-23 20:10:59 +01:00
Martin Kroeker
73ffabe6ba Guard uses of _mm512_reduce_add_p? 2022-02-23 20:06:14 +01:00
Martin Kroeker
7b146e590c fix function typecast 2021-12-24 20:01:52 +01:00
Martin Kroeker
e9a0e52201 fix function typecast 2021-12-24 20:00:50 +01:00
Martin Kroeker
d1ee6ff73f fix function typecasts 2021-12-21 18:45:28 +01:00
Martin Kroeker
5378046abd roll back DGEMM kernels to 4x8 when compiling for DYNAMIC_ARCH 2021-12-06 19:43:54 +01:00
Caroline Newcombe
feeb8283a5 Fix unsafe read during final iteration of zsymv_L_sse2.S 2021-11-19 14:29:32 -06:00
Wangyang Guo
63a103ba6e sbgemm: spr: disable small matrix path by default 2021-10-17 19:08:03 -07:00
Wangyang Guo
82194ea9d2 sbgemm: spr: implement otcopy_16 2021-10-17 19:08:03 -07:00
Wangyang Guo
8632380a96 sbgemm: spr: reuse ncopy_16 from cooperlake as incopy 2021-10-17 19:08:03 -07:00
Wangyang Guo
6bc8204ce5 sbgemm: spr: optimization for tmp_c buffer 2021-10-17 19:08:03 -07:00
Wangyang Guo
f018aa342a sbgemm: spr: kernel handle alpha != 1.0 2021-10-17 19:08:03 -07:00
Wangyang Guo
a52456b168 sbgemm: spr: oncopy: use tile load/store instead 2021-10-17 19:08:03 -07:00
Wangyang Guo
f2485352a6 sbgemm: spr: only load A once in tail_k handling 2021-10-17 19:08:03 -07:00
Wangyang Guo
9ab33228bb sbgemm: spr: process k2 and odd k at the same time 2021-10-17 19:08:03 -07:00
Wangyang Guo
10d52646e2 sbgemm: spr: oncopy: avoid handling too much pointer at a time 2021-10-17 19:08:03 -07:00
Wangyang Guo
88154ed02d sbgemm: spr: reduce tile conf loading by seperate tail k handling 2021-10-17 19:08:03 -07:00
Wangyang Guo
a70bfb52d5 sbgemm: spr: kernel works for NN case when alpha is 1.0 2021-10-17 19:08:03 -07:00
Wangyang Guo
6051c86741 sbgemm: spr: kernel works for m32 in NN case 2021-10-17 19:08:03 -07:00
Wangyang Guo
d0b253ac6e sbgemm: spr: implement oncopy_16 2021-10-17 19:08:03 -07:00
Wangyang Guo
1d48b7cb16 sbgemm: spr: add dummy source files 2021-10-17 19:08:03 -07:00
Wangyang Guo
3dc6052c7e initial support for Sapphire Rapids platform 2021-10-12 01:30:40 -07:00
Wangyang Guo
ee5ca8a328 x86_64: BFLOAT16: fix build warning 2021-09-28 18:30:06 +08:00
Martin Kroeker
8dfa61a61c Initialize abs_mask1 with itself to silence a gcc warning 2021-09-15 22:11:35 +02:00
Martin Kroeker
99aa10b3ff Initialize abs_mask1 with itself to silence a gcc warning
actual initialization is via the _mm_cmpeq_ep18, which I've seen claimed to be the fastest way to set an xmm register to all 1s
2021-09-15 22:10:43 +02:00