Martin Kroeker
9ea30f3788
Replace ISMIN and ISAMIN kernels on all x86_64 platforms ( #2125 )
...
* Mark iamax_sse.S as unsuitable for MIN due to issue #2116
* Use iamax.S rather than iamax_sse.S for ISMIN/ISAMIN on all x86_64 as workaround for #2116
2019-05-09 14:42:36 +02:00
Martin Kroeker
b1561ecc68
Disable DGEMMINCOPY as well for now
...
#1955
2019-05-05 15:52:01 +02:00
Martin Kroeker
7ed8431527
Disable the SkyLakeX DGEMMITCOPY kernel as well
...
as a stopgap measure for https://github.com/numpy/numpy/issues/13401 as mentioned in #1955
2019-05-04 22:54:41 +02:00
Martin Kroeker
c04a729081
Add ?sum definitions for generic kernel
2019-03-31 13:55:49 +02:00
Martin Kroeker
9d717cb5ee
Add x86_64 implementation of ?sum
...
as trivial copy of ?asum with the fabs calls removed
2019-03-30 22:27:04 +01:00
Martin Kroeker
32c7063cb0
Merge pull request #2061 from martin-frbg/martin-frbg-patch-1
...
Disable the AVX512 DGEMM kernel (again)
2019-03-30 21:21:38 +01:00
Martin Kroeker
e608d4f7fe
Disable the AVX512 DGEMM kernel (again)
...
Due to as yet unresolved errors seen in #1955 and #2029
2019-03-13 22:10:28 +01:00
Celelibi
b7f59da42d
Fix crash in sgemm SSE/nano kernel on x86_64
...
Fix bug #2047 .
Signed-off-by: Celelibi <celelibi@gmail.com>
2019-03-07 16:55:13 +01:00
Andrew
6eee1beac5
move fix to right place
2019-02-24 20:41:02 +02:00
Martin Kroeker
e12cdf58ef
Merge pull request #2024 from martin-frbg/gcc9fixes4
...
Fix inline assembly constraints in Bulldozer TRSM kernels
2019-02-17 11:49:15 +01:00
Martin Kroeker
1860c9456d
Merge pull request #2023 from martin-frbg/gcc9fixes3
...
Fix inline assembly constraints in various x86_64 GEMVN kernels
2019-02-17 11:48:57 +01:00
Martin Kroeker
f9bb76d29a
Fix inline assembly constraints in Bulldozer TRSM kernels
...
rework indices to allow marking i,as and bs as both input and output (marked operand n1 as well for simplicity). For #2009
2019-02-16 20:06:48 +01:00
Martin Kroeker
efb9038f72
Fix inline assembly constraints
2019-02-16 18:46:17 +01:00
Martin Kroeker
e976557d29
Fix inline assembly constraints
...
rework indices to allow marking argument lda as input and output.
2019-02-16 18:36:39 +01:00
Martin Kroeker
9d8be15789
Fix inline assembly constraints
...
rework indices to allow marking argument lda4 as input and output. For #2009
2019-02-16 18:24:11 +01:00
Martin Kroeker
d752799a0f
Merge pull request #2021 from martin-frbg/gcc9fixes2
...
Fix wrong constraints in inline assembly of Haswell DTRSM kernel
2019-02-16 18:05:40 +01:00
Martin Kroeker
c26c0b77a7
Fix wrong constraints in inline assembly
...
for #2009
2019-02-15 15:08:16 +01:00
Martin Kroeker
1c6da2d03c
Merge pull request #2019 from martin-frbg/gcc9fixes
...
Fix unannounced modification of input operand 8 (lda4) in Haswell GEMVN microkernel
2019-02-15 15:02:54 +01:00
Martin Kroeker
4255a58cd2
Rename operands to put lda on the input/output constraint list
2019-02-15 10:10:04 +01:00
Martin Kroeker
46e415b140
Save and restore input argument 8 (lda4)
...
Fixes miscompilation with gcc9 -ftree-vectorize (related to issue #2009 )
2019-02-14 22:43:18 +01:00
Bart Oldeman
69a97ca7b9
dgemv_kernel_4x4(Haswell): add missing clobbers for xmm0,xmm1,xmm2,xmm3
...
This fixes a crash in dblat2 when OpenBLAS is compiled using
-march=znver1 -ftree-vectorize -O2
See also:
https://github.com/easybuilders/easybuild-easyconfigs/issues/7180
2019-02-14 16:27:58 +00:00
Martin Kroeker
ab1630f9fa
Fix declaration of arguments in inline assembly
...
Argument 0 is modified so should be input and output
2019-02-12 16:14:02 +01:00
Martin Kroeker
b824fa70eb
Fix declaration of assembly arguments in SSYMV and DSYMV microkernels
...
Arguments 0 and 1 are both input and output
2019-02-12 16:00:18 +01:00
Martin Kroeker
91481a3e4e
Fix declaration of input arguments in inline assembly
...
Argument 0 is modified as it doubles as a counter
2019-02-12 15:51:43 +01:00
Martin Kroeker
dc6ac9eab0
Fix declaration of input arguments in the x86_64 s/dGEMV_T and s/dGEMV_N kernels
...
Arguments 0 and 1 need to be tagged as both input and output
2019-02-12 15:33:48 +01:00
Martin Kroeker
32b0f1168e
Fix declaration of input arguments in the Sandybridge GER microkernels ( #1967 )
...
* Tag arguments 0 and 1 as both input and output
2019-01-18 08:11:39 +01:00
Martin Kroeker
b495e54310
Fix declaration of input arguments in the x86_64 SCAL microkernels ( #1966 )
...
* Tag arguments 0 and 1 as both input and output (see #1964 )
2019-01-18 08:11:07 +01:00
Martin Kroeker
d5e6940253
Fix declaration of input arguments in the x86_64 microkernels for DOT and AXPY ( #1965 )
...
* Tag operands 0 and 1 as both input and output
For #1964 (basically a continuation of coding problems first seen in #1292 )
2019-01-17 23:20:32 +01:00
Arjan van de Ven
795285c587
Fix thinko in skylake beta handling
...
casting ints is cheaper but it has a rounding, not memory casing effect, resulting in
invalid outcome
2018-12-24 18:49:50 +00:00
Arjan van de Ven
d321448a63
dgemm: use dgemm_ncopy_8_skylakex.c also for Haswell
...
The dgemm_ncopy_8_skylakex.c code is not avx512 specific and gives
a nice performance boost for medium sized matrices
2018-12-16 23:09:22 +00:00
Arjan van de Ven
c43331ad0a
dgemm: Use the skylakex beta function also for haswell
...
it's more efficient for certain tall/skinny matrices
2018-12-16 23:09:17 +00:00
Arjan van de Ven
69d206440a
Make the skylakex/haswell sgemm code compile and run even with compilers without avx2 support
2018-12-16 00:19:41 +00:00
Arjan van de Ven
0586899a10
Use sgemm_ncopy_4_skylakex.c also for Haswell
...
sgemm_ncopy_4_skylakex.c uses SSE transpose operations where the
real perf win happens; this also works great for Haswell.
This gives double digit percentage gains on small and skinny matrices
2018-12-15 13:49:19 +00:00
Arjan van de Ven
00dc09ad19
Use the skylake sgemm beta code also for haswell
...
with a few small changes it's possible to use the skylake sgemm code
also for haswell, this gives a modest gain (10% range) for smallish
matrixes but does wonders for very skinny matrixes
2018-12-15 13:49:13 +00:00
Arjan van de Ven
cdc668d82b
Add a "sgemm direct" mode for small matrixes
...
OpenBLAS has a fancy algorithm for copying the input data while laying
it out in a more CPU friendly memory layout.
This is great for large matrixes; the cost of the copy is easily
ammortized by the gains from the better memory layout.
But for small matrixes (on CPUs that can do efficient unaligned loads) this
copy can be a net loss.
This patch adds (for SKYLAKEX initially) a "sgemm direct" mode, that bypasses
the whole copy machinary for ALPHA=1/BETA=0/... standard arguments,
for small matrixes only.
What is small? For the non-threaded case this has been measured to be
in the M*N*K = 28 * 512 * 512 range, while in the threaded case it's
less, around M*N*K = 1 * 512 * 512
2018-12-13 13:47:31 +00:00
Martin Kroeker
701ea88347
Use p2align instead of align for OSX compatibility
...
fixes #1902
2018-12-03 13:06:43 +01:00
Andrew
19c4bdd8b3
Add return value so that freebsd system clang does not err out
2018-11-25 21:35:01 +01:00
Arjan van de Ven
dcc5d6291e
skylakex: Make the sgemm/dgemm beta code robust for a N=0 or M=0 case
...
in the threading code there are cases where N or M can become 0,
and the optimized beta code did not handle this well, leading
to a crash
during the audit for the crash a few edge conditions on the if statements
were found and fixed as well
2018-11-01 01:42:09 +00:00
Arjan van de Ven
55b244ca0d
enable the SGEMM/SKX C based kernel
...
In QA the final bug was found so now the sklyakex sgemm C based kernel can
be activated....
2018-10-12 09:30:35 +00:00
Arjan van de Ven
d4bad73834
Add a C+intrinsics version of the SGEMM/skylakex kernel
...
for most sizes this is 1.2x to 1.4x faster than the current code
2018-10-10 01:49:22 +00:00
Arjan van de Ven
582c589727
dgemm/skylakex: replace discrete mul/add with fma
...
very minor gains since it's not super hot code, but general principles
2018-10-06 23:13:26 +00:00
Arjan van de Ven
adbf6afa25
Add vector optimizations for ncopy as well for dgemm/skylakex
2018-10-06 21:18:12 +00:00
Arjan van de Ven
32bec8afbb
add a skylakex optimized dgemm beta function
2018-10-06 16:36:26 +00:00
Arjan van de Ven
20c5d668fe
dgemm/avx512 simplify and speed up the 4x4 kernel
2018-10-06 14:12:32 +00:00
Arjan van de Ven
6d43c51ccf
undo slow dgemm/skylake microoptimization
...
the compare is more costly than the work
2018-10-06 14:00:37 +00:00
Arjan van de Ven
d74dc39b0f
Add optimized *copy versions for skylakex
...
Add optimized n/t copy versions for skylakex; in the patch the
tcopy is also rewritten using intrinsics; the ncopy file
will be worked on in a future commit
2018-10-06 13:51:44 +00:00
Arjan van de Ven
66b43affbc
Add a 24x8 kernel to the skylakex dgemm implementation
...
Minor gains for small matrixes, but at 512x512 and above the gain
gets more significant.
2018-10-05 13:22:21 +00:00
Arjan van de Ven
1938819c25
skylake dgemm: Add a 16x8 kernel
...
The next step for the avx512 dgemm code is adding a 16x8 kernel.
In the 8x8 kernel, each FMA has a matching load (the broadcast);
in the 16x8 kernel we can reuse this load for 2 FMAs, which
in turn reduces pressure on the load ports of the CPU and gives
a nice performance boost (in the 25% range).
2018-10-05 13:11:35 +00:00
Martin Kroeker
b7496c3638
Function name needs to be CNAME, set from outside to allow suffixing for dynamic_arch
2018-10-04 19:14:59 +02:00
Arjan van de Ven
45fe8cb0c5
Create a AVX512 enabled version of DGEMM
...
This patch adds dgemm_kernel_4x8_skylakex.c which is
* dgemm_kernel_4x8_haswell.s converted to C + intrinsics
* 8x8 support added
* 8x8 kernel implemented using AVX512
Performance is a work in progress, but already shows a 10% - 20%
increase for a wide range of matrix sizes.
2018-10-03 14:45:25 +00:00