wjc404
1b980001dd
Update zgemm_kernel_4x2_haswell.c
2020-02-26 18:38:12 +08:00
wjc404
2515e1152f
Update cgemm_kernel_8x2_haswell.c
2020-02-26 18:36:54 +08:00
wjc404
903854c168
Add files via upload
2020-02-22 23:40:02 +08:00
wjc404
a2ff577a30
Update KERNEL.ZEN
2020-02-22 23:39:43 +08:00
wjc404
97a32cb0a5
Update KERNEL.HASWELL
2020-02-22 23:39:20 +08:00
Martin Liska
aeea14ee40
Come up with LOAD_AND_COMPARE_TO_MXX macro in iamax_sse.S.
2020-02-17 09:01:53 +01:00
Martin Liska
18bcc36a69
Fix implementation of iamax_sse.S as reported in #2116 .
...
The was a typo in iamax_sse.S where one of the comparison
was cmpeqps instead of cmpeqss. That misdetected index
for sequences where the minimum value was 0.
2020-02-17 09:01:53 +01:00
wjc404
f566787e6e
Update KERNEL.SKYLAKEX
2020-02-16 22:58:44 +08:00
wjc404
e3368cbf18
AVX512 STRMM kernel
2020-02-16 22:58:00 +08:00
Bart Oldeman
7ea5e07d1c
Fix inline asm in dscal: mark x, x1 as clobbered. Fixes #2408
...
The leaq instructions in dscal_kernel_inc_8 modify x and x1 so they
must be declared as input/output constraints, otherwise the compiler
may assume the corresponding registers are not modified.
2020-02-12 14:11:44 +00:00
wjc404
3447d04eaf
Update dgemm_kernel_16x2_skylakex.c
2020-02-06 02:14:10 +00:00
wjc404
8b5cdcc64c
Update sgemm_kernel_8x4_haswell.c
2020-02-06 01:47:46 +00:00
wjc404
4e00d96a78
Update dgemm_kernel_16x2_skylakex.c
2020-02-06 01:46:36 +00:00
wjc404
096da2f51a
Update dgemm_kernel_16x2_skylakex.c
2020-02-05 13:36:57 +08:00
wjc404
081b188529
Update KERNEL.SKYLAKEX
2020-02-03 21:38:08 +08:00
wjc404
8019e70211
AVX512 16x2 DGEMM kernel
2020-02-03 21:32:56 +08:00
wjc404
e5dcdeb550
Update sgemm_direct_skylakex.c
2020-01-13 16:59:23 +08:00
wjc404
952cc2ba38
Update sgemm_kernel_16x4_skylakex_2.c
2020-01-13 16:58:54 +08:00
wjc404
feaafbedd3
make skylakex sgemm code more friendly for readers
...
BTW some kernels were adjusted to improve performance
2020-01-13 16:28:41 +08:00
wjc404
3a100b2797
Update KERNEL.SKYLAKEX
2020-01-09 13:48:41 +08:00
wjc404
bd4c032f52
Update sgemm_kernel_8x4_haswell.c
2020-01-07 11:22:46 +08:00
wjc404
9dc9b7b95e
Update sgemm_kernel_8x4_haswell.c
2020-01-06 20:11:36 +08:00
wjc404
92b10212de
optimize AVX2 SGEMM
2020-01-06 12:11:21 +08:00
wjc404
b73bf01378
optimize AVX2 SGEMM
2020-01-06 12:09:14 +08:00
wjc404
eb3c9f1db9
optimize AVX2 SGEMM
2020-01-06 12:07:02 +08:00
wjc404
a0f0a802fc
Update zgemm3m_kernel_4x4_haswell.c
2019-12-30 17:33:42 +08:00
wjc404
700fe5b5ee
Add files via upload
2019-12-30 17:18:59 +08:00
wjc404
f60840c420
Update KERNEL.ZEN
2019-12-30 16:04:23 +08:00
wjc404
109e18cd96
Update KERNEL.HASWELL
2019-12-30 16:03:24 +08:00
wjc404
ae1579be13
Create zgemm3m_kernel_4x4_haswell.c
2019-12-30 16:02:51 +08:00
wjc404
cd765f094b
Update cgemm3m_kernel_8x4_haswell.c
2019-12-27 18:23:29 +08:00
wjc404
3a66c8cac1
Update KERNEL.ZEN
2019-12-27 18:04:08 +08:00
wjc404
ed9af2f7da
Update KERNEL.HASWELL
2019-12-27 18:01:38 +08:00
wjc404
5fd1edead9
Create cgemm3m_kernel_8x4_haswell.c
2019-12-27 18:00:55 +08:00
wjc404
eeecd623d8
Update cgemm_kernel_8x2_haswell.c
2019-12-24 00:40:16 +08:00
wjc404
2cd9306bb5
Update KERNEL.ZEN
2019-12-23 23:42:30 +08:00
wjc404
c418c81224
Update KERNEL.HASWELL
2019-12-23 23:41:44 +08:00
wjc404
025741f16a
Fast Haswell CGEMM kernel
2019-12-23 23:40:03 +08:00
wjc404
f41d52665d
Fast Haswell ZGEMM kernel
2019-12-21 14:37:06 +08:00
wjc404
d573d24de7
Fast Haswell ZGEMM kernel
2019-12-21 14:35:15 +08:00
Isuru Fernando
b863b32ac5
Workaround an ICE in clang 9.0.0
...
This bug is not there in 8.x nor in the 9.0 daily snapshot.
2019-12-01 12:59:46 -06:00
wjc404
934e601e93
Update dgemm_kernel_4x8_skylakex_2.c
2019-11-28 19:56:35 +08:00
wjc404
eb1e9c8c92
some optimizations
2019-11-26 14:12:20 +08:00
Wang, Long
bfb5fbdb4d
revised fix windows compatible for #2313
...
Signed-off-by: Wang, Long <long1.wang@intel.com>
2019-11-21 10:22:58 +08:00
Wang, Long
1191db1a49
For the sake of windows compatible, used "unsigned long long" to ensure 64-bit length
...
Signed-off-by: Wang, Long <long1.wang@intel.com>
2019-11-20 21:30:47 +08:00
Wang, Long
0caf1434c9
Fix the integer overflow issue for large matrix size
...
For large matrix, e.g. M=N=K, and M>1290, int mnk=M*N*K will overflow.
This will lead to wrong branching to single-threading. The performance
is downgraded significantly.
Signed-off-by: Wang, Long <long1.wang@intel.com>
2019-11-20 14:11:17 +08:00
wjc404
819e852ae7
AVX512 CGEMM & ZGEMM kernels
...
96-99% 1-thread performance of MKL2018
2019-11-11 20:04:52 +08:00
wjc404
836c414e22
optimizations of software prefetching
2019-11-05 13:36:56 +08:00
wjc404
430c11e135
Add files via upload
2019-11-04 20:10:12 +08:00
wjc404
fbacd2605d
optimizations via software prefetches
2019-11-04 19:37:19 +08:00
wjc404
1df9a2013d
new sgemm kernel for skylakex
2019-11-02 00:00:48 +08:00
wjc404
6ff013bae0
native support for icopy_4
...
90% MKL 1-thread performance.
2019-10-19 03:54:44 +08:00
wjc404
0d669e04bb
Update dgemm_kernel_8x8_skylakex.c
2019-10-18 15:00:17 +08:00
wjc404
17cdd9f9e1
some correction
2019-10-18 14:58:07 +08:00
wjc404
6bcb06fcb1
make further changes to icopy_8 easier
2019-10-18 10:47:31 +08:00
wjc404
b7315f8401
Add files via upload
2019-10-16 19:23:36 +08:00
wjc404
9b19e9e1b0
Update dgemm_kernel_8x8_skylakex.c
2019-10-16 10:14:51 +08:00
wjc404
6bd67ddbab
Update dgemm_kernel_8x8_skylakex.c
2019-10-16 03:20:08 +08:00
wjc404
844629af57
Add files via upload
2019-10-16 02:00:34 +08:00
Martin Kroeker
11c59acfb1
Keep both PGI/SUN and default code paths to avoid breaking Clang/WIndows
2019-08-28 18:07:44 +02:00
Martin Kroeker
3a55dca2dc
Make x86_64 zdot compile with PGI and Sun C again
...
broken by #2222 as CREAL,CIMAG do not expand to a valid lvalue with these compilers
2019-08-28 11:35:31 +02:00
Martin Kroeker
9ef96b32a6
Add multithreading support to the x86_64 zdot kernel ( #2222 )
...
* Add multithreading support
copied from the ThunderX2T99 kernel. For #2221
2019-08-15 22:09:12 +02:00
Martin Kroeker
dccff2e785
Merge pull request #2206 from martin-frbg/zen-dtrmm
...
Replace vpermpd with vpermilpd in the Haswell DTRMM kernel
2019-08-09 07:55:20 +02:00
Martin Kroeker
5c3458a6e7
Merge pull request #2199 from martin-frbg/zen-dtrsm
...
Replace most vpermpd calls in the Haswell DTRSM_RN kernel
2019-08-09 07:55:02 +02:00
Martin Kroeker
acf6002ab2
Replace most vpermpd calls in the Haswell DTRSM_RN kernel
2019-08-03 12:40:13 +02:00
Martin Kroeker
2dfb804cb9
Replace vpermpd with vpermilpd in the Haswell DTRMM kernel
...
to improve performance on AMD Zen (#2180 ) applying wjc404's improvement of the DGEMM kernel from #2186
2019-07-28 23:17:28 +02:00
Martin Kroeker
4c153ec9da
Merge pull request #2196 from wjc404/develop
...
Add vbroadcastsd kernel to dgemm_kernel_4x8_haswell.S
2019-07-28 23:11:40 +02:00
wjc404
7eecd8e39c
Add files via upload
2019-07-28 07:39:09 +08:00
Martin Kroeker
7b0b7c11d2
Merge pull request #2190 from martin-frbg/zdot-zen
...
Replace vpermpd with vpermilpd in the Haswell/Zen zdot microkernel
2019-07-23 16:15:08 +02:00
Martin Kroeker
28e96458e5
Replace vpermpd with vpermilpd
...
to improve performance on Zen/Zen2 (as demonstrated by wjc404 in #2180 )
2019-07-22 08:28:16 +02:00
wjc404
95fb98f556
Update dgemm_kernel_4x8_haswell.S
2019-07-21 01:10:32 +08:00
wjc404
4801c6d36b
Update dgemm_kernel_4x8_haswell.S
2019-07-21 00:47:45 +08:00
wjc404
9440fa607d
Add files via upload
2019-07-20 22:08:22 +08:00
wjc404
94db259e5b
Add files via upload
2019-07-20 22:04:41 +08:00
wjc404
f49f8047ac
Add files via upload
2019-07-20 14:33:37 +08:00
wjc404
825777faab
Update dgemm_kernel_4x8_haswell.S
2019-07-19 23:58:24 +08:00
wjc404
9c89757562
Add files via upload
2019-07-19 23:47:58 +08:00
wjc404
9b04baeaee
Update dgemm_kernel_4x8_haswell.S
2019-07-17 23:50:03 +08:00
wjc404
8a074b3965
Update dgemm_kernel_4x8_haswell.S
2019-07-17 23:47:30 +08:00
wjc404
211ab03b14
Update dgemm_kernel_4x8_haswell.S
2019-07-17 22:39:15 +08:00
wjc404
1733f927e6
Update dgemm_kernel_4x8_haswell.S
2019-07-17 21:27:41 +08:00
wjc404
182b06d6ad
Update dgemm_kernel_4x8_haswell.S
2019-07-17 17:02:35 +08:00
wjc404
7a9050d681
Update dgemm_kernel_4x8_haswell.S
2019-07-17 00:55:06 +08:00
wjc404
0ba29fd262
Update dgemm_kernel_4x8_haswell.S for zen2
...
replaced a bunch of vpermpd instructions with vpermilpd and vperm2f128
2019-07-17 00:46:51 +08:00
Martin Kroeker
9ea30f3788
Replace ISMIN and ISAMIN kernels on all x86_64 platforms ( #2125 )
...
* Mark iamax_sse.S as unsuitable for MIN due to issue #2116
* Use iamax.S rather than iamax_sse.S for ISMIN/ISAMIN on all x86_64 as workaround for #2116
2019-05-09 14:42:36 +02:00
Martin Kroeker
b1561ecc68
Disable DGEMMINCOPY as well for now
...
#1955
2019-05-05 15:52:01 +02:00
Martin Kroeker
7ed8431527
Disable the SkyLakeX DGEMMITCOPY kernel as well
...
as a stopgap measure for https://github.com/numpy/numpy/issues/13401 as mentioned in #1955
2019-05-04 22:54:41 +02:00
Martin Kroeker
c04a729081
Add ?sum definitions for generic kernel
2019-03-31 13:55:49 +02:00
Martin Kroeker
9d717cb5ee
Add x86_64 implementation of ?sum
...
as trivial copy of ?asum with the fabs calls removed
2019-03-30 22:27:04 +01:00
Martin Kroeker
32c7063cb0
Merge pull request #2061 from martin-frbg/martin-frbg-patch-1
...
Disable the AVX512 DGEMM kernel (again)
2019-03-30 21:21:38 +01:00
Martin Kroeker
e608d4f7fe
Disable the AVX512 DGEMM kernel (again)
...
Due to as yet unresolved errors seen in #1955 and #2029
2019-03-13 22:10:28 +01:00
Celelibi
b7f59da42d
Fix crash in sgemm SSE/nano kernel on x86_64
...
Fix bug #2047 .
Signed-off-by: Celelibi <celelibi@gmail.com>
2019-03-07 16:55:13 +01:00
Andrew
6eee1beac5
move fix to right place
2019-02-24 20:41:02 +02:00
Martin Kroeker
e12cdf58ef
Merge pull request #2024 from martin-frbg/gcc9fixes4
...
Fix inline assembly constraints in Bulldozer TRSM kernels
2019-02-17 11:49:15 +01:00
Martin Kroeker
1860c9456d
Merge pull request #2023 from martin-frbg/gcc9fixes3
...
Fix inline assembly constraints in various x86_64 GEMVN kernels
2019-02-17 11:48:57 +01:00
Martin Kroeker
f9bb76d29a
Fix inline assembly constraints in Bulldozer TRSM kernels
...
rework indices to allow marking i,as and bs as both input and output (marked operand n1 as well for simplicity). For #2009
2019-02-16 20:06:48 +01:00
Martin Kroeker
efb9038f72
Fix inline assembly constraints
2019-02-16 18:46:17 +01:00
Martin Kroeker
e976557d29
Fix inline assembly constraints
...
rework indices to allow marking argument lda as input and output.
2019-02-16 18:36:39 +01:00
Martin Kroeker
9d8be15789
Fix inline assembly constraints
...
rework indices to allow marking argument lda4 as input and output. For #2009
2019-02-16 18:24:11 +01:00
Martin Kroeker
d752799a0f
Merge pull request #2021 from martin-frbg/gcc9fixes2
...
Fix wrong constraints in inline assembly of Haswell DTRSM kernel
2019-02-16 18:05:40 +01:00
Martin Kroeker
c26c0b77a7
Fix wrong constraints in inline assembly
...
for #2009
2019-02-15 15:08:16 +01:00
Martin Kroeker
1c6da2d03c
Merge pull request #2019 from martin-frbg/gcc9fixes
...
Fix unannounced modification of input operand 8 (lda4) in Haswell GEMVN microkernel
2019-02-15 15:02:54 +01:00
Martin Kroeker
4255a58cd2
Rename operands to put lda on the input/output constraint list
2019-02-15 10:10:04 +01:00
Martin Kroeker
46e415b140
Save and restore input argument 8 (lda4)
...
Fixes miscompilation with gcc9 -ftree-vectorize (related to issue #2009 )
2019-02-14 22:43:18 +01:00
Bart Oldeman
69a97ca7b9
dgemv_kernel_4x4(Haswell): add missing clobbers for xmm0,xmm1,xmm2,xmm3
...
This fixes a crash in dblat2 when OpenBLAS is compiled using
-march=znver1 -ftree-vectorize -O2
See also:
https://github.com/easybuilders/easybuild-easyconfigs/issues/7180
2019-02-14 16:27:58 +00:00
Martin Kroeker
ab1630f9fa
Fix declaration of arguments in inline assembly
...
Argument 0 is modified so should be input and output
2019-02-12 16:14:02 +01:00
Martin Kroeker
b824fa70eb
Fix declaration of assembly arguments in SSYMV and DSYMV microkernels
...
Arguments 0 and 1 are both input and output
2019-02-12 16:00:18 +01:00
Martin Kroeker
91481a3e4e
Fix declaration of input arguments in inline assembly
...
Argument 0 is modified as it doubles as a counter
2019-02-12 15:51:43 +01:00
Martin Kroeker
dc6ac9eab0
Fix declaration of input arguments in the x86_64 s/dGEMV_T and s/dGEMV_N kernels
...
Arguments 0 and 1 need to be tagged as both input and output
2019-02-12 15:33:48 +01:00
Martin Kroeker
32b0f1168e
Fix declaration of input arguments in the Sandybridge GER microkernels ( #1967 )
...
* Tag arguments 0 and 1 as both input and output
2019-01-18 08:11:39 +01:00
Martin Kroeker
b495e54310
Fix declaration of input arguments in the x86_64 SCAL microkernels ( #1966 )
...
* Tag arguments 0 and 1 as both input and output (see #1964 )
2019-01-18 08:11:07 +01:00
Martin Kroeker
d5e6940253
Fix declaration of input arguments in the x86_64 microkernels for DOT and AXPY ( #1965 )
...
* Tag operands 0 and 1 as both input and output
For #1964 (basically a continuation of coding problems first seen in #1292 )
2019-01-17 23:20:32 +01:00
Arjan van de Ven
795285c587
Fix thinko in skylake beta handling
...
casting ints is cheaper but it has a rounding, not memory casing effect, resulting in
invalid outcome
2018-12-24 18:49:50 +00:00
Arjan van de Ven
d321448a63
dgemm: use dgemm_ncopy_8_skylakex.c also for Haswell
...
The dgemm_ncopy_8_skylakex.c code is not avx512 specific and gives
a nice performance boost for medium sized matrices
2018-12-16 23:09:22 +00:00
Arjan van de Ven
c43331ad0a
dgemm: Use the skylakex beta function also for haswell
...
it's more efficient for certain tall/skinny matrices
2018-12-16 23:09:17 +00:00
Arjan van de Ven
69d206440a
Make the skylakex/haswell sgemm code compile and run even with compilers without avx2 support
2018-12-16 00:19:41 +00:00
Arjan van de Ven
0586899a10
Use sgemm_ncopy_4_skylakex.c also for Haswell
...
sgemm_ncopy_4_skylakex.c uses SSE transpose operations where the
real perf win happens; this also works great for Haswell.
This gives double digit percentage gains on small and skinny matrices
2018-12-15 13:49:19 +00:00
Arjan van de Ven
00dc09ad19
Use the skylake sgemm beta code also for haswell
...
with a few small changes it's possible to use the skylake sgemm code
also for haswell, this gives a modest gain (10% range) for smallish
matrixes but does wonders for very skinny matrixes
2018-12-15 13:49:13 +00:00
Arjan van de Ven
cdc668d82b
Add a "sgemm direct" mode for small matrixes
...
OpenBLAS has a fancy algorithm for copying the input data while laying
it out in a more CPU friendly memory layout.
This is great for large matrixes; the cost of the copy is easily
ammortized by the gains from the better memory layout.
But for small matrixes (on CPUs that can do efficient unaligned loads) this
copy can be a net loss.
This patch adds (for SKYLAKEX initially) a "sgemm direct" mode, that bypasses
the whole copy machinary for ALPHA=1/BETA=0/... standard arguments,
for small matrixes only.
What is small? For the non-threaded case this has been measured to be
in the M*N*K = 28 * 512 * 512 range, while in the threaded case it's
less, around M*N*K = 1 * 512 * 512
2018-12-13 13:47:31 +00:00
Martin Kroeker
701ea88347
Use p2align instead of align for OSX compatibility
...
fixes #1902
2018-12-03 13:06:43 +01:00
Andrew
19c4bdd8b3
Add return value so that freebsd system clang does not err out
2018-11-25 21:35:01 +01:00
Arjan van de Ven
dcc5d6291e
skylakex: Make the sgemm/dgemm beta code robust for a N=0 or M=0 case
...
in the threading code there are cases where N or M can become 0,
and the optimized beta code did not handle this well, leading
to a crash
during the audit for the crash a few edge conditions on the if statements
were found and fixed as well
2018-11-01 01:42:09 +00:00
Arjan van de Ven
55b244ca0d
enable the SGEMM/SKX C based kernel
...
In QA the final bug was found so now the sklyakex sgemm C based kernel can
be activated....
2018-10-12 09:30:35 +00:00
Arjan van de Ven
d4bad73834
Add a C+intrinsics version of the SGEMM/skylakex kernel
...
for most sizes this is 1.2x to 1.4x faster than the current code
2018-10-10 01:49:22 +00:00
Arjan van de Ven
582c589727
dgemm/skylakex: replace discrete mul/add with fma
...
very minor gains since it's not super hot code, but general principles
2018-10-06 23:13:26 +00:00
Arjan van de Ven
adbf6afa25
Add vector optimizations for ncopy as well for dgemm/skylakex
2018-10-06 21:18:12 +00:00
Arjan van de Ven
32bec8afbb
add a skylakex optimized dgemm beta function
2018-10-06 16:36:26 +00:00
Arjan van de Ven
20c5d668fe
dgemm/avx512 simplify and speed up the 4x4 kernel
2018-10-06 14:12:32 +00:00
Arjan van de Ven
6d43c51ccf
undo slow dgemm/skylake microoptimization
...
the compare is more costly than the work
2018-10-06 14:00:37 +00:00
Arjan van de Ven
d74dc39b0f
Add optimized *copy versions for skylakex
...
Add optimized n/t copy versions for skylakex; in the patch the
tcopy is also rewritten using intrinsics; the ncopy file
will be worked on in a future commit
2018-10-06 13:51:44 +00:00
Arjan van de Ven
66b43affbc
Add a 24x8 kernel to the skylakex dgemm implementation
...
Minor gains for small matrixes, but at 512x512 and above the gain
gets more significant.
2018-10-05 13:22:21 +00:00
Arjan van de Ven
1938819c25
skylake dgemm: Add a 16x8 kernel
...
The next step for the avx512 dgemm code is adding a 16x8 kernel.
In the 8x8 kernel, each FMA has a matching load (the broadcast);
in the 16x8 kernel we can reuse this load for 2 FMAs, which
in turn reduces pressure on the load ports of the CPU and gives
a nice performance boost (in the 25% range).
2018-10-05 13:11:35 +00:00
Martin Kroeker
b7496c3638
Function name needs to be CNAME, set from outside to allow suffixing for dynamic_arch
2018-10-04 19:14:59 +02:00
Arjan van de Ven
45fe8cb0c5
Create a AVX512 enabled version of DGEMM
...
This patch adds dgemm_kernel_4x8_skylakex.c which is
* dgemm_kernel_4x8_haswell.s converted to C + intrinsics
* 8x8 support added
* 8x8 kernel implemented using AVX512
Performance is a work in progress, but already shows a 10% - 20%
increase for a wide range of matrix sizes.
2018-10-03 14:45:25 +00:00
Martin Kroeker
375dff54fc
Merge pull request #1733 from fenrus75/dsymv
...
Add an AVX512 enabled DSYMV (L) function
2018-08-12 18:18:36 +02:00
Martin Kroeker
a5f165275a
Merge pull request #1732 from fenrus75/dgemv
...
Add an AVX512 enabled DGEMV (n) function
2018-08-12 18:17:42 +02:00
Martin Kroeker
8c13aa495a
Merge pull request #1730 from fenrus75/fix-sdot
...
Fix typo in sdot function
2018-08-12 18:17:01 +02:00
Arjan van de Ven
9bec34cb67
Add an AVX512 enabled DSYMV (L) function
...
written in C intrinsics for best readability.
(the same C code works for Haswell as well)
For logistical reasons the code falls back to the existing
haswell AVX2 implementation if the GCC or LLVM compiler is not new enough
2018-08-11 17:46:24 +00:00
Arjan van de Ven
87bebdbd8a
Add an AVX512 enabled DGEMV (n) function
...
written in C intrinsics for best readability.
(the same C code works for Haswell as well)
For logistical reasons the code falls back to the existing
haswell AVX2 implementation if the GCC or LLVM compiler is not new enough
2018-08-11 17:38:12 +00:00
Arjan van de Ven
36add7570a
Fix typo in sdot function
...
it looks like my previous pull request was short the final commit;
fix a typo in sdot
2018-08-11 17:16:45 +00:00
Arjan van de Ven
cacacc8007
Add an AVX512 enabled DSCAL function
...
written in C intrinsics for best readability.
(the same C code works for Haswell as well)
For logistical reasons the code falls back to the existing
haswell AVX2 implementation if the GCC or LLVM compiler is not new enough
2018-08-11 17:14:57 +00:00
Martin Kroeker
1a00ef3d27
Merge pull request #1725 from fenrus75/axpy
...
Add a AVX512 enabled SAXPY/DAXPY functions
2018-08-11 11:01:20 +02:00
Arjan van de Ven
2e99873ff7
Add a AVX512 enabled SAXPY/DAXPY functions
...
written in C intrinsics for best readability.
(the same C code works for Haswell as well)
For logistical reasons the code falls back to the existing
haswell AVX2 implementation if the GCC or LLVM compiler is not new enough
2018-08-10 02:58:32 +00:00
Arjan van de Ven
00abaa865b
Add an AVX512 enabled SDOT function
...
written in C intrinsics for best readability.
(the same C code works for Haswell as well)
For logistical reasons the code falls back to the existing
haswell AVX2 implementation if the GCC or LLVM compiler is not new enough
2018-08-10 02:33:43 +00:00
Arjan van de Ven
7932ff3ea9
Add an AVX512 enabled DDOT function
...
written in C intrinsics for best readability.
(the same C code works for Haswell as well)
For logistical reasons the code falls back to the existing
haswell AVX2 implementation if the GCC or LLVM compiler is not new enough
2018-08-09 03:55:52 +00:00
Martin Kroeker
6e54b0a027
Disable the 16x2 DTRMM kernel on SkylakeX as well
2018-06-30 17:31:06 +02:00
Martin Kroeker
f0a8dc2eec
Disable the AVX512 DGEMM kernel for now
...
due to #1643
2018-06-30 11:34:48 +02:00
Craig Donner
c2545b0fd6
Fixed a few more unnecessary calls to num_cpu_avail.
...
I don't have as many benchmarks for these as for gemm, but it should still
make a difference for small matrices.
2018-06-11 10:17:16 +01:00
Arjan van de Ven
89372e0993
Use AVX512 also for DGEMM
...
this required switching to the generic gemm_beta code (which is faster anyway on SKX)
for both DGEMM and SGEMM
Performance for the not-retuned version is in the 30% range
2018-06-03 22:17:27 +00:00
Arjan van de Ven
99c7bba8e4
Initial support for SkylakeX / AVX512
...
This patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server)
target. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set,
which brings 2 basic things:
1) 512 bit wide SIMD (2x width of AVX2)
2) 32 SIMD registers (2x the number on AVX2)
This initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel
to AVX512VL; more will follow later but this patch aims to get the infrastructure
in place for this "later".
Full performance tuning has not been done yet; with more registers and wider SIMD
it's in theory possible to retune the kernels but even without that there's an
interesting enough performance increase (30-40% range) with just this change.
2018-06-03 07:58:52 +00:00
Martin Kroeker
840e01061f
Merge pull request #1491 from martin-frbg/ddot_mt
...
Add multithreading support for Haswell DDOT
2018-03-27 21:43:05 +02:00
Martin Kroeker
a55694dd5b
Declare dot_compute static to avoid conflicts in multiarch builds
2018-03-16 22:23:36 +01:00
Martin Kroeker
85a41e9cdb
Add multithreading support for Haswell DDOT
...
copied from ashwinyes' implementation in dot_thunderx2t99.c
2018-03-16 16:58:47 +01:00
Martin Kroeker
81215711a2
Re-enable DAXPY microkernels for x86_64
...
as the inaccuracies seen in the original testcase for #1332 appear to be due to an artefact that amplifies the very small rounding differences between FMA and discrete multiply+add
2018-03-04 19:37:03 +01:00
Martin Kroeker
497f0c3d8a
Replace .align with .p2align in the Nehalem microkernels
2018-02-26 20:58:33 +01:00
Martin Kroeker
ea37db828e
Convert .align to .p2align for OSX compatibility
2018-02-26 20:48:03 +01:00
Martin Kroeker
7c1925acec
Use .p2align instead of .align for compatibility on Sandybridge as well
2018-02-24 19:43:15 +01:00
Martin Kroeker
2359c7c1a9
Use .p2align instead of .align for portability
...
The OSX assembler apparently mishandles the argument to decimal .align, leading to a significant loss of performance
as observed in #730 , #901 and most recently #1470
2018-02-24 17:50:13 +01:00
Martin Kroeker
e388459a27
Merge pull request #1419 from brada4/develop
...
Initialize unitialized values for repeated calls
2018-01-31 23:48:34 +01:00
Andrew
4938faa822
core.IdenticalExpr clang501 checker
2018-01-19 23:15:58 +01:00
Martin Kroeker
42285d8e70
Merge pull request #1410 from brada4/develop
...
Address warnings #1357
2018-01-06 20:02:46 +01:00
Andrew
4d0b005e5b
Eliminate remaining unused results in kernels (clang5 analyzer)
2018-01-01 20:54:39 +01:00
Martin Kroeker
b81656936f
Merge pull request #1409 from martin-frbg/issue1292-2
...
Tag %1 and %2 as both input and output operands
2017-12-31 20:18:48 +01:00
Martin Kroeker
b973990df2
Tag %1 and %2 as both input and output operands
...
fix from #1292 extended to the other gemv microkernels
2017-12-31 18:03:36 +01:00
Martin Kroeker
1e31124eb0
Merge pull request #1406 from martin-frbg/issue1292
...
Tag %1 and %2 as both input and output
2017-12-30 14:52:03 +01:00
Martin Kroeker
723f396a20
Tag %1 and %2 as both input and output
...
The inline assembly modifies its input operands, so mark them as output to avoid surprises with optimization. Fixes #1292
2017-12-29 23:56:41 +01:00
Martin Kroeker
43c0622e7b
Retire Piledriver/Steamroller/Excavator daxpy microkernels as well
...
related to issue #1332
2017-12-13 18:40:39 +01:00
Martin Kroeker
0623636c98
Use Sandybridge daxpy kernel on Haswell and Zen for now
...
The testcase from #1332 exposes a problem in daxpy_microk_haswell-2.c that is not seen with
any of the other Intel x86_64 microkernels.
2017-12-10 19:24:31 +01:00
Andrew
281a2b952f
warning cleanup ( #1380 )
...
* dead increments in driver/level2
* dead increments in kernel/generic
* part dead increments in kernel/x86_64
2017-12-05 19:54:10 +01:00
Martin Kroeker
6c77b5f267
Merge pull request #1369 from martin-frbg/dsdot
...
Add optimized dsdot to all other x86_64 kernels that use sdot.c
2017-11-28 18:15:31 +01:00
Martin Kroeker
c92cd6d162
Add trivially optimized dsdot based on sdot
2017-11-24 20:05:27 +01:00
Martin Kroeker
cae5d9a20b
Add trivially optimized dsdot based on sdot
2017-11-24 20:04:29 +01:00
Martin Kroeker
3d891c3106
Add trivially optimized dsdot based on sdot
2017-11-24 20:03:40 +01:00
Martin Kroeker
4fbdcfa823
Add trivially optimized dsdot based on sdot
2017-11-24 20:02:28 +01:00
Martin Kroeker
1bb6a96ebc
Add trivially optimized dsdot based on sdot
2017-11-24 20:01:42 +01:00
Martin Kroeker
6bd163f37a
Add trivially optimized dsdot based on sdot
2017-11-24 20:00:23 +01:00
Martin Kroeker
f0333333d1
Add trivially optimized dsdot based on sdot
2017-11-24 19:59:28 +01:00
Andrew
e89b979b2c
fix spurious compiler warning fix (no code change)
2017-11-24 18:39:04 +01:00
Andrew
7e9b29b9b8
fix spurious compiler warning (no code change)
2017-11-24 18:36:37 +01:00
Martin Kroeker
6157d0902a
Merge pull request #1358 from martin-frbg/unused_vars
...
Clean up spurious unused variables in the kernels
2017-11-15 11:31:43 +01:00
Martin Kroeker
3fea849bbf
Remove unused variables from Haswell dtrmm and Bulldozer dtrsm
2017-11-14 23:35:10 +01:00
Martin Kroeker
8f177621bc
Remove unused variables at0...at3 from ?symv_U
2017-11-14 23:32:25 +01:00
Martin Kroeker
5f402b7759
Remove unused (loop?) variable j from the gemv_n_4 implementations
2017-11-14 23:29:42 +01:00
Martin Kroeker
a07807caac
Eliminate loop code when called as/from dsdot
2017-10-25 16:45:41 +02:00
Martin Kroeker
5e3e91d0fc
Split the microkernel workload into chunks of 32 floats for dsdot mode to limit loss of precision
2017-10-22 18:18:51 +02:00
Martin Kroeker
28c3fa8950
Add dsdot
2017-10-16 23:29:03 +02:00
Martin Kroeker
8ac87c1cb6
Implement DSDOT with unchanged sdot microkernels
2017-10-16 23:27:51 +02:00
Isuru Fernando
505b218829
Merge remote-tracking branch 'upstream/develop' into dyn
2017-08-06 19:07:00 +05:30
Isuru Fernando
1d1854032b
Add missing EXCAVATOR
2017-08-02 19:03:04 +05:30
Isuru Fernando
2c51a990ac
Fix extra whitespaces. CMake parser macro fails with it
...
TODO: Fix the parser macro to strip trailing whitespaces
2017-08-02 18:26:57 +05:30
Isuru Fernando
ca17b4b75c
Fix complex support for MSVC headers
2017-07-28 11:50:29 +05:30
Denis Steckelmacher
c9ff735da6
Add ZEN support (tested for auto-detected static backend)
2017-03-19 15:32:50 +01:00
Martin Kroeker
a6efabf155
Replace gnu _real_ , _imag_ extensions in initializers
2017-03-13 00:38:37 +01:00
Martin Kroeker
dc34a0da96
Merge pull request #915 from mdong/small_fix_for_icc
...
remove input from clobbered list
2017-02-23 20:00:22 +01:00
Martin Kroeker
4998e19869
Change file comments to work around clang 3.9 assembler bug
2016-10-13 16:51:08 +02:00
Martin Kroeker
16446d1d23
Remove explicit include of complex.h
2016-09-29 23:45:56 +02:00
mdong
098d8ec5d6
remove input from clobbered list
2016-06-24 16:37:58 -04:00
Werner Saar
298b13bba4
updated some kernel files for EXCAVATOR
2016-04-25 10:36:23 +02:00
Zhang Xianyi
f24d5307cf
Refs #834 . Fix zgemv config bug on Steamroller.
2016-04-12 22:26:11 +08:00
Zhang Xianyi
d4380c1fe4
Refs xianyi/OpenBLAS-CI#10 , Fix sdot for scipy test_iterative.test_convergence test failure on AMD bulldozer and piledriver.
2016-04-07 01:44:18 +08:00
Werner Saar
faa5e2e5e3
FIX: forgot the add the files cgemv_n_4.c and cgemv_t_4.c
2016-03-10 11:10:38 +01:00
Werner Saar
fdf291be30
Added optimized cgemv_n and cgemv_t kernels for bulldozer, piledriver and steamroller
2016-03-10 09:42:07 +01:00
Werner Saar
c99cc41cbd
Added optimized zgemv_n kernel for bulldozer, piledriver and steamroller
2016-03-09 14:02:03 +01:00
Werner Saar
acdff55a6a
Bugfix for ztrmv
2016-03-07 09:39:34 +01:00
Zhang Xianyi
7d6b68eb4a
Refs #786 . Revert to default assembly kernel.
2016-03-07 11:34:58 +08:00
Zhang Xianyi
8f758eeff9
Refs #786 . avoid old assembly c/zgemv kernels.
2016-03-05 08:32:03 +08:00
Zhang Xianyi
efa4f5c936
Refs #695 #783 . Replace default x86_64 cgemv_t
...
asm kernel by C kernel.
2016-03-01 11:18:56 +08:00
Zhang Xianyi
6e7be06e07
Refs JuliaLang/julia#5728 . Fix gemv performance bug on Haswell Mac OSX.
...
On Mac OS X, it should use .align 4 (equal to .align 16 on Linux).
I didn't get the performance benefit from .align. Thus, I deleted it.
2016-02-19 17:56:07 -05:00
Zhang Xianyi
962376664d
Refs #768 . Swap the result of zdot x87 fp kernel.
2016-02-02 09:15:02 +08:00
Zhang Xianyi
c44ff4d648
Refs #714 . avoid compiling warnings.
2016-01-28 04:38:07 +08:00
Werner Saar
c8f2c5d636
added optimized trsm_kernels
2016-01-05 13:05:05 +01:00
Zhang Xianyi
69363622a8
Fix DYNAMIC_ARCH=1 bug.
2015-10-27 05:10:40 +08:00
Zhang Xianyi
f874465bb8
Use cmake to build OpenBLAS GENERIC Target on MSVC x86 64-bit.
...
Disable CBLAS and LAPACK.
2015-08-10 14:10:44 -05:00
Zhang Xianyi
ab0a0a75fc
Merge branch 'develop' into cmake
2015-08-03 23:59:01 -05:00
Zhang Xianyi
1cf2b10224
Use pure C generic target on x86 and x86_64.
...
make TARGET=GENERIC
?gemm3m is unimplemented on generic target.
2015-08-03 23:55:56 -05:00
Zhang Xianyi
7ac7e147d4
Fixed cmake building bugs on Linux. Disable LAPACK by default.
2015-08-04 04:37:05 +08:00
Werner Saar
e7c969e164
added optimized dtrmm_kernel for haswell
2015-06-13 16:16:29 +02:00
Werner Saar
9bd962f655
modified haswell parameter dgemm_unroll_n
2015-06-13 10:28:27 +02:00
Werner Saar
24f58c8bb1
added optimized cscal and zscal kernels for steamroller
2015-05-18 12:40:07 +02:00
Werner Saar
95b1faf667
added optimized cscal and zscal kernels for steamroller and piledriver
2015-05-18 10:50:57 +02:00
Werner Saar
2d9e406050
added optimized cscal kernel for sandybridge
2015-05-18 08:46:06 +02:00
Werner Saar
59083e3ce1
added optimized cscal kernel for bulldozer
2015-05-18 07:33:52 +02:00
wernsaar
685be40339
Merge pull request #571 from wernsaar/develop
...
added optimized cscal and zscal functions
2015-05-17 14:09:14 +02:00
Werner Saar
31c9e399e9
added optimized cscal kernel for haswell
2015-05-17 13:44:09 +02:00
Werner Saar
7de6bb9889
added optimized zscal kernel for bulldozer
2015-05-17 11:45:19 +02:00
Werner Saar
d63034303b
added optimized zscal kernel for haswell
2015-05-16 16:41:45 +02:00
Zhang Xianyi
51ff17d46e
Add AMD Excavator target.
2015-05-13 16:16:30 -05:00
Werner Saar
18e90ee2e3
bugfix: added static to functions
2015-05-13 13:31:26 +02:00
Werner Saar
e00cccc41e
added optimized dscal kernel for piledriver
2015-05-13 13:05:35 +02:00
Werner Saar
73f09bf64f
optimized dscal kernel for increment != 1
2015-05-13 12:14:39 +02:00
Werner Saar
02e772c7e4
added optimized dscal kernel for haswell
2015-05-12 17:19:58 +02:00
Werner Saar
7aee913991
added optimized dscal kernel for sandybridge
2015-05-12 16:27:43 +02:00
Werner Saar
e50a933037
added optimized dscal kernel for bulldozer
2015-05-12 12:28:44 +02:00
Werner Saar
133c11a156
updated dgemv_n kernel for nehalem
2015-04-30 14:38:06 +02:00
Werner Saar
30f52d53df
optimized dgemv_n kernel for haswell
2015-04-30 12:11:39 +02:00
Werner Saar
5e83d80725
optimized dger kernel for sandybridge
2015-04-28 16:58:11 +02:00
Werner Saar
b2e1797dc6
added optimized sger kernel for sandybridge
2015-04-28 15:33:38 +02:00
Werner Saar
e216f686cb
optimized saxpy and daxpy for sandybridge
2015-04-28 10:18:32 +02:00
Werner Saar
fc0e0391f3
bugfixes: replaced int with BLASLONG
2015-04-24 14:30:44 +02:00
Werner Saar
c22068c406
optimized sdot.c for increments != 1
2015-04-24 13:13:20 +02:00
Werner Saar
dee100d0e4
optimized saxpy.c for increments != 1
2015-04-24 11:52:59 +02:00
Werner Saar
0273966abb
optimized daxpy kernel for increments != 1
2015-04-24 11:39:17 +02:00
Werner Saar
3a67daa954
optimized ddot.c for increments != 1
2015-04-24 10:56:55 +02:00
Werner Saar
b4f2153dcd
added optimized ssymv kernels for sandybridge
2015-04-23 12:19:24 +02:00
Werner Saar
1c4b0eeae3
added optimized ssymv kernels for haswell
2015-04-23 10:23:13 +02:00
Werner Saar
1bec9abb9a
added optimized dsymv kernels for sandybridge
2015-04-22 12:09:43 +02:00
Werner Saar
3814bf60d3
added optimized dsymv kernels for haswell
2015-04-22 10:42:50 +02:00
Werner Saar
6d0db0151f
added optimized zaxpy-kernels
2015-04-16 11:19:37 +02:00
Zhang Xianyi
37b9033c90
Merge pull request #543 from jeromerobert/develop
...
Fix a buffer overflow with MAX_STACK_ALLOC size in dgemv_t
2015-04-15 11:18:14 -05:00
Werner Saar
13889515b3
added optimized caxpy-kernel for sandybridge
2015-04-15 16:29:25 +02:00