Martin Kroeker
5b0093b5fe
Convert aligned moves to unaligned
...
should have no performance impact on reasonably modern cpus and fixes occasional crashes in actual user code.
2020-04-13 14:58:52 +02:00
Martin Kroeker
e9bfa2291a
Fix parameter overflow
2020-04-12 19:47:02 +02:00
gxw
8d07cf9b67
Fix compilation problem on loongson platform
...
Using "make TARGET=GENERIC" on loongson platform will get the following
error messages:
"make[1]: *** No rule to make target 'sgemm_incopy.o', needed by 'libs'"
Add kernel/mips64/KERNEL.generic to slove the problem.
2020-04-09 19:28:15 +08:00
Martin Kroeker
806f89166e
Make ARMV7 compile with xcode and add a CI job for it ( #2537 )
...
* Add an ARMV7 iOS build on Travis
* thread_local appears to be unavailable on ARMV7 iOS
* Add no-thumb option for ARMV7 IOS build to get it to accept DMB ISH
* Make local labels in macros of nrm2_vfpv3.S compatible with the xcode assembler
2020-04-02 10:30:37 +02:00
Martin Kroeker
c6af9bbb32
Merge pull request #2534 from martin-frbg/issue2496
...
Fix zero initialization for beta=0 case
2020-03-31 20:53:13 +02:00
Martin Kroeker
144be81ca1
fix initialization to zero in the NEON SGEMM_BETA kernel as well
2020-03-31 16:53:56 +02:00
Martin Kroeker
07cdd5d05c
Fix zero initialization for beta=0 case
...
use immediate initialization instead of multiplication in case register content is a NaN
2020-03-31 00:21:02 +02:00
Martin Kroeker
567d2760e6
Merge pull request #2520 from wjc404/develop
...
Fix avx512 sgemm performance bug when ldc is a multiple of 1024
2020-03-30 20:15:59 +02:00
wjc404
b8307768e2
Add files via upload
2020-03-21 05:42:10 +08:00
Martin Kroeker
af8a619e1f
Merge pull request #2517 from wjc404/develop
...
Temporary fix for SKX STRSM
2020-03-17 10:12:53 +01:00
wjc404
62b9608986
Update KERNEL.SKYLAKEX
2020-03-17 12:52:55 +08:00
Martin Kroeker
a1b181cea2
Merge pull request #2516 from wjc404/develop
...
AVX2 STRSM kernels
2020-03-16 21:58:34 +01:00
wjc404
cdc0e9011e
Update KERNEL.ZEN
2020-03-16 16:39:37 +00:00
wjc404
fa049d49c2
AVX2 STRSM kernel
2020-03-17 00:34:08 +08:00
s00548429
bec7923a0d
Fix the functional bugs for zamax.
2020-03-09 15:36:50 +08:00
Rajalakshmi Srinivasaraghavan
2afc074803
Fix DYNAMIC_ARCH build for POWER9
...
Setting DYNAMIC_ARCH=1 on POWER9 does not build POWER9 files due to some
compiler version checks. This patch fixes some of the macros that are used
to check compiler version. On fixing those checks, there are some new make
failures related to icamin, icamax, isamin, isamax and caxpy files on POWER9.
This patch fixes those failures as well.
2020-03-03 12:35:10 -06:00
Martin Kroeker
4f371b0fbf
Use POWER8 kernels on big-endian POWER9 for now
2020-03-01 23:45:58 +01:00
Martin Kroeker
ea8eec5d17
Merge pull request #2422 from wjc404/develop
...
Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix performance bugs in AVX2 s/c/z GEMM
2020-02-29 19:07:35 +01:00
Ali Saidi
c623a965f9
Add Neoverse-N1 core
...
The implementation is a hybird of the ARMV8 one with some of the
improved TX2 rountines along with specifying -march=v8.2-a
2020-02-29 03:22:04 +00:00
wjc404
dd22eb7621
Update cgemm_kernel_8x2_haswell.c
2020-02-27 22:26:15 +08:00
wjc404
2352331e60
Update zgemm_kernel_4x2_haswell.c
2020-02-27 22:25:19 +08:00
wjc404
1b980001dd
Update zgemm_kernel_4x2_haswell.c
2020-02-26 18:38:12 +08:00
wjc404
2515e1152f
Update cgemm_kernel_8x2_haswell.c
2020-02-26 18:36:54 +08:00
Martin Kroeker
ddcbed6690
Merge pull request #2437 from martin-frbg/issue2434
...
[WIP] Add support for Ampere EMAG8180 ARMV8 cpu
2020-02-25 18:42:52 +01:00
wjc404
903854c168
Add files via upload
2020-02-22 23:40:02 +08:00
wjc404
a2ff577a30
Update KERNEL.ZEN
2020-02-22 23:39:43 +08:00
wjc404
97a32cb0a5
Update KERNEL.HASWELL
2020-02-22 23:39:20 +08:00
Martin Kroeker
07454bf4d5
Add proper defaults for IxMIN/IxMAX kernels
...
the fallbacks from Makefile.L1 assume a combined source for absolute value and non-absolute (with ifdef USE_ABS) but here we have separate implementations
2020-02-21 11:58:15 +01:00
Martin Kroeker
4046985913
Add proper defaults for IxMIN/IxMAX kernels
...
the fallbacks from Makefile.L1 assume a combined source for absolute value and non-absolute (with ifdef USE_ABS) but here we have separate implementations
2020-02-21 11:55:52 +01:00
Martin Kroeker
e57b11acca
Add preliminary support for EMAG8180
2020-02-19 19:00:28 +01:00
Martin Kroeker
0b39cf95b0
Fix endianness conditionals
2020-02-19 18:09:54 +01:00
Martin Kroeker
9f39f0a2c3
Specify ismin/ismax assembly kernels for POWER8 directly
...
to fix utest failure in new ismin test - Makefile.L1 defaults look wrong
2020-02-17 19:55:39 +01:00
Martin Liska
aeea14ee40
Come up with LOAD_AND_COMPARE_TO_MXX macro in iamax_sse.S.
2020-02-17 09:01:53 +01:00
Martin Liska
18bcc36a69
Fix implementation of iamax_sse.S as reported in #2116 .
...
The was a typo in iamax_sse.S where one of the comparison
was cmpeqps instead of cmpeqss. That misdetected index
for sequences where the minimum value was 0.
2020-02-17 09:01:53 +01:00
Martin Liska
0e7f43c898
Add missing USE_MIN in kernel/CMakeLists.txt.
2020-02-17 09:01:53 +01:00
wjc404
f566787e6e
Update KERNEL.SKYLAKEX
2020-02-16 22:58:44 +08:00
wjc404
e3368cbf18
AVX512 STRMM kernel
2020-02-16 22:58:00 +08:00
Martin Kroeker
cafdd999b8
Update caxpy_power8.S
2020-02-13 22:44:09 +01:00
Martin Kroeker
92ca92a46c
Update caxpy_power8.S
2020-02-13 21:24:54 +01:00
Martin Kroeker
486c35c5dc
Update icamin_power8.S
2020-02-13 18:38:43 +01:00
Martin Kroeker
5ba3699f41
Update isamin_power8.S
2020-02-13 00:00:32 +01:00
Martin Kroeker
8eefa530cd
Update isamax_power8.S
2020-02-12 23:59:50 +01:00
Martin Kroeker
de40d47edf
Update isamin_power8.S
2020-02-12 23:57:48 +01:00
Martin Kroeker
7c162b8a21
Update isamax_power8.S
2020-02-12 23:56:57 +01:00
Martin Kroeker
0544cbc806
Fix syntax of endianness conditional
2020-02-12 20:00:29 +01:00
Martin Kroeker
120d20731f
Fix syntax of endianness conditional
2020-02-12 19:58:42 +01:00
Martin Kroeker
dc345d84df
Fix syntax of endianness conditional and add gcc version check for workaround
2020-02-12 19:56:52 +01:00
Bart Oldeman
7ea5e07d1c
Fix inline asm in dscal: mark x, x1 as clobbered. Fixes #2408
...
The leaq instructions in dscal_kernel_inc_8 modify x and x1 so they
must be declared as input/output constraints, otherwise the compiler
may assume the corresponding registers are not modified.
2020-02-12 14:11:44 +00:00
Martin Kroeker
7e5cbb6f35
Fix bad conditional syntax that caused spurious application of USE_TRMM
2020-02-10 21:17:39 +01:00
wjc404
3447d04eaf
Update dgemm_kernel_16x2_skylakex.c
2020-02-06 02:14:10 +00:00
wjc404
8b5cdcc64c
Update sgemm_kernel_8x4_haswell.c
2020-02-06 01:47:46 +00:00
wjc404
4e00d96a78
Update dgemm_kernel_16x2_skylakex.c
2020-02-06 01:46:36 +00:00
wjc404
096da2f51a
Update dgemm_kernel_16x2_skylakex.c
2020-02-05 13:36:57 +08:00
wjc404
081b188529
Update KERNEL.SKYLAKEX
2020-02-03 21:38:08 +08:00
wjc404
8019e70211
AVX512 16x2 DGEMM kernel
2020-02-03 21:32:56 +08:00
Qiyu8
ff42e68652
Optimize genenal Gemm Beta
2020-01-20 11:49:42 +08:00
Martin Kroeker
70f45749b9
Merge pull request #2367 from wjc404/develop
...
Improve paralleled SGEMM performance on SKYLAKEX CPUs
2020-01-15 21:13:43 +01:00
wjc404
e5dcdeb550
Update sgemm_direct_skylakex.c
2020-01-13 16:59:23 +08:00
wjc404
952cc2ba38
Update sgemm_kernel_16x4_skylakex_2.c
2020-01-13 16:58:54 +08:00
wjc404
feaafbedd3
make skylakex sgemm code more friendly for readers
...
BTW some kernels were adjusted to improve performance
2020-01-13 16:28:41 +08:00
Martin Kroeker
b36018be6d
Merge pull request #2365 from wjc404/develop
...
Fix SKYLAKEX STRMM issues
2020-01-09 23:23:09 +01:00
wjc404
3a100b2797
Update KERNEL.SKYLAKEX
2020-01-09 13:48:41 +08:00
Martin Kroeker
38742d5547
Merge pull request #2361 from wjc404/develop
...
Optimize AVX2 SGEMM & STRMM
2020-01-08 16:20:28 +01:00
wjc404
bd4c032f52
Update sgemm_kernel_8x4_haswell.c
2020-01-07 11:22:46 +08:00
wjc404
9dc9b7b95e
Update sgemm_kernel_8x4_haswell.c
2020-01-06 20:11:36 +08:00
wjc404
92b10212de
optimize AVX2 SGEMM
2020-01-06 12:11:21 +08:00
wjc404
b73bf01378
optimize AVX2 SGEMM
2020-01-06 12:09:14 +08:00
wjc404
eb3c9f1db9
optimize AVX2 SGEMM
2020-01-06 12:07:02 +08:00
Martin Kroeker
456ee2e1f0
Merge pull request #2357 from chenxuqiang/dgemm_beta_zero
...
kernel/arm64/dgemm_beta.S: add beta == zero branch
2020-01-02 22:28:36 +01:00
shengyang
80db5f11e1
update
2020-01-02 11:01:57 +08:00
chenxuqiang
52de4cc8fd
kernel/arm64/dgemm_beta.S: add beta == zero branch
...
added beta == zero branch, and no need to load C matrix.
Signed by: Xuqiang Chen <chenxuqiang3@hisilicon.com>
2020-01-01 21:50:45 -05:00
Martin Kroeker
44028581cc
Merge pull request #2355 from Zeyiii/dev-zeyi2
...
Use arm neon instructions to optimize sgemm_beta operation
2020-01-01 22:14:16 +01:00
Martin Kroeker
86ab939936
Merge pull request #2354 from ZuoQ3/develop
...
[WIP] Use arm neon instructions to optimize tcopy operation
2020-01-01 22:13:37 +01:00
Martin Kroeker
6c85cb1869
Merge pull request #2352 from wjc404/develop
...
AVX2 ZGEMM3M kernel
2019-12-31 18:08:10 +01:00
Martin Kroeker
995768bbc5
Merge pull request #2351 from Zeyiii/develop
...
prefetching for dgemm_beta
2019-12-31 18:07:37 +01:00
int_13h
96ad579428
add in runtime cpu detection for zarch ( #2349 )
...
add in runtime cpu detection for zarch
2019-12-31 18:03:27 +01:00
shengyang
8d84403205
Use arm neon instructions to optimize ncopy operation
...
modified: KERNEL.ARMV8
modified: KERNEL.TSV110
new file: sgemm_ncopy_4.S
2019-12-31 17:06:35 +08:00
w00421467
0833a4846a
Use arm neon instructions to optimize sgemm_beta operation
2019-12-31 10:42:03 +08:00
zq
50f7fc1401
[WIP] Use arm neon instructions to optimize tcopy operation
2019-12-31 10:21:23 +08:00
w00421467
d1b53806be
Merge remote-tracking branch 'pub/develop' into develop
2019-12-31 10:13:24 +08:00
wjc404
a0f0a802fc
Update zgemm3m_kernel_4x4_haswell.c
2019-12-30 17:33:42 +08:00
wjc404
700fe5b5ee
Add files via upload
2019-12-30 17:18:59 +08:00
wjc404
f60840c420
Update KERNEL.ZEN
2019-12-30 16:04:23 +08:00
wjc404
109e18cd96
Update KERNEL.HASWELL
2019-12-30 16:03:24 +08:00
wjc404
ae1579be13
Create zgemm3m_kernel_4x4_haswell.c
2019-12-30 16:02:51 +08:00
w00421467
3ccf8885ac
prefetching for dgemm_beta
2019-12-30 11:45:49 +08:00
wjc404
cd765f094b
Update cgemm3m_kernel_8x4_haswell.c
2019-12-27 18:23:29 +08:00
wjc404
3a66c8cac1
Update KERNEL.ZEN
2019-12-27 18:04:08 +08:00
wjc404
ed9af2f7da
Update KERNEL.HASWELL
2019-12-27 18:01:38 +08:00
wjc404
5fd1edead9
Create cgemm3m_kernel_8x4_haswell.c
2019-12-27 18:00:55 +08:00
wjc404
eeecd623d8
Update cgemm_kernel_8x2_haswell.c
2019-12-24 00:40:16 +08:00
wjc404
2cd9306bb5
Update KERNEL.ZEN
2019-12-23 23:42:30 +08:00
wjc404
c418c81224
Update KERNEL.HASWELL
2019-12-23 23:41:44 +08:00
wjc404
025741f16a
Fast Haswell CGEMM kernel
2019-12-23 23:40:03 +08:00
wjc404
f41d52665d
Fast Haswell ZGEMM kernel
2019-12-21 14:37:06 +08:00
wjc404
d573d24de7
Fast Haswell ZGEMM kernel
2019-12-21 14:35:15 +08:00
w00421467
b7cc69ee62
declare DGEMM_BETA in KERNEL.ARMV8 rather than the generic KERNEL
2019-12-20 10:11:50 +08:00
w00421467
aeef942c4f
use arm neon instructions to optimize gemm beta operation
2019-12-17 10:00:13 +08:00
Martin Kroeker
1a6ea8ee6d
Merge pull request #2338 from kavanabhat/aix_mod
...
Changes to build on AIX in POWER8 mode
2019-12-09 17:54:49 +01:00
Kavana Bhat
6baa9b07d7
AIX changes for Power8
2019-12-06 04:33:32 -06:00
Kavana Bhat
3938e59569
AIX changes for Power8
2019-12-04 00:23:46 -06:00
Isuru Fernando
b863b32ac5
Workaround an ICE in clang 9.0.0
...
This bug is not there in 8.x nor in the 9.0 daily snapshot.
2019-12-01 12:59:46 -06:00
Martin Kroeker
dd04143d4a
Merge pull request #2328 from martin-frbg/ppc9
...
Fix precompiled kernels on POWER9 and make their use conditional on (old) gcc version
2019-11-30 12:23:57 +01:00
Martin Kroeker
f3a6164bff
Merge pull request #2324 from antonblanchard/power9_segv
...
Fix SEGV in cdot_power9
2019-11-30 00:03:42 +01:00
Martin Kroeker
dedd822d1a
Fix caxpy/caxpyc naming in localentry
2019-11-29 23:56:57 +01:00
Martin Kroeker
2181fb7047
Fix caxpy/caxpyc naming in localentry
2019-11-29 23:54:15 +01:00
Martin Kroeker
a9b62c03f8
Substitute precompiled gcc7 codes only when gcc is older than 9.x
2019-11-29 23:49:50 +01:00
Martin Kroeker
97762234f9
Add variable for gcc >=9 test
...
used in KERNEL.POWER9
2019-11-29 23:47:23 +01:00
wjc404
934e601e93
Update dgemm_kernel_4x8_skylakex_2.c
2019-11-28 19:56:35 +08:00
Anton Blanchard
cf2a8e410c
Fix SEGV in cdot_power9
...
We were corrupting r2 because the local entry wasn't being
setup correctly.
2019-11-26 21:55:04 -07:00
wjc404
eb1e9c8c92
some optimizations
2019-11-26 14:12:20 +08:00
Andreas Arnez
d117dfd505
Change bad usage of "asum" to "sum" in ZARCH versions of ?sum
...
The ZARCH implementations of ?sum contain a cut & paste-error: An inline
assembly argument is named "sum", but the assembly references "asum"
instead. The mismatch causes a build error. This is fixed.
2019-11-21 13:49:13 +01:00
Martin Kroeker
b09b5be0a4
Merge pull request #2315 from ewanglong/develop
...
revised fix windows compatible for #2313
2019-11-21 05:06:44 +01:00
Wang, Long
bfb5fbdb4d
revised fix windows compatible for #2313
...
Signed-off-by: Wang, Long <long1.wang@intel.com>
2019-11-21 10:22:58 +08:00
Martin Kroeker
08fa83aba2
Merge pull request #2312 from martin-frbg/power8be
...
Further Power8 big-endian corrections
2019-11-20 15:12:06 +01:00
Wang, Long
1191db1a49
For the sake of windows compatible, used "unsigned long long" to ensure 64-bit length
...
Signed-off-by: Wang, Long <long1.wang@intel.com>
2019-11-20 21:30:47 +08:00
Wang, Long
0caf1434c9
Fix the integer overflow issue for large matrix size
...
For large matrix, e.g. M=N=K, and M>1290, int mnk=M*N*K will overflow.
This will lead to wrong branching to single-threading. The performance
is downgraded significantly.
Signed-off-by: Wang, Long <long1.wang@intel.com>
2019-11-20 14:11:17 +08:00
Martin Kroeker
cad0d150db
Define alternate kernels for big-endian POWER8
2019-11-17 23:12:10 +01:00
Martin Kroeker
eba0aeb7cd
Fix compilation for big-endian POWER8
2019-11-17 22:58:32 +01:00
Martin Kroeker
0c07c356c1
Define alternate kernels for big-endian PPC440
2019-11-17 19:25:08 +01:00
Martin Kroeker
3e67017ac8
Merge pull request #2309 from martin-frbg/ppc970-be
...
Fix PPC970 big-endian support
2019-11-17 18:22:24 +01:00
Martin Kroeker
b3ac6ee222
Define alternate kernels for big-endian PPC970
...
The altivec versions of SGEMM and CGEMM fail most test in LAPACK-TESTING when compiled for big endian, STRSM/CTRSM even cause segfaults. The rot kernels either fail the corresponding utest or lead to failures in LAPACK-TESTING.
2019-11-17 15:19:39 +01:00
Martin Kroeker
71e96163db
Merge pull request #2305 from wjc404/develop
...
AVX512 CGEMM & ZGEMM kernels
2019-11-12 07:38:37 +01:00
wjc404
819e852ae7
AVX512 CGEMM & ZGEMM kernels
...
96-99% 1-thread performance of MKL2018
2019-11-11 20:04:52 +08:00
Martin Kroeker
4c6a457358
Merge pull request #2300 from wjc404/develop
...
Optimize SGEMM on SKYLAKEX CPUs
2019-11-06 07:27:33 +01:00
wjc404
836c414e22
optimizations of software prefetching
2019-11-05 13:36:56 +08:00
Martin Kroeker
3cd97f1a80
Merge pull request #2301 from martin-frbg/ppc8be
...
Disable IDAMIN/MAX and IZAMIN/MAX optimizations on big-endian POWER8
2019-11-04 22:54:28 +01:00
wjc404
430c11e135
Add files via upload
2019-11-04 20:10:12 +08:00
wjc404
fbacd2605d
optimizations via software prefetches
2019-11-04 19:37:19 +08:00
Martin Kroeker
68597002ea
The assembly microkernel is not safe to use on ELFv1
2019-11-03 22:42:46 +01:00
Martin Kroeker
d2a6285549
The assembly microkernel is not safe to use on ELFv1
2019-11-03 22:41:19 +01:00
Martin Kroeker
d999688d1a
The assembly microkernel is not safe to use on ELFv1
2019-11-03 22:39:06 +01:00
Martin Kroeker
928fe1b28e
The assembly microkernel is not safe to use on ELFv1
2019-11-03 22:37:27 +01:00
wjc404
1df9a2013d
new sgemm kernel for skylakex
2019-11-02 00:00:48 +08:00
Martin Kroeker
85ccdce8c4
Remove the IOS fallbacks to generic C kernels
2019-10-25 23:02:37 +02:00
wjc404
6ff013bae0
native support for icopy_4
...
90% MKL 1-thread performance.
2019-10-19 03:54:44 +08:00
wjc404
0d669e04bb
Update dgemm_kernel_8x8_skylakex.c
2019-10-18 15:00:17 +08:00
wjc404
17cdd9f9e1
some correction
2019-10-18 14:58:07 +08:00
wjc404
6bcb06fcb1
make further changes to icopy_8 easier
2019-10-18 10:47:31 +08:00
wjc404
b7315f8401
Add files via upload
2019-10-16 19:23:36 +08:00
wjc404
9b19e9e1b0
Update dgemm_kernel_8x8_skylakex.c
2019-10-16 10:14:51 +08:00
wjc404
6bd67ddbab
Update dgemm_kernel_8x8_skylakex.c
2019-10-16 03:20:08 +08:00
wjc404
844629af57
Add files via upload
2019-10-16 02:00:34 +08:00
Martin Kroeker
a448884a63
Remove automatic label postfixes from macro included only once
2019-10-08 08:37:50 +02:00
Martin Kroeker
3a2df19db6
Fix accidental duplication of jump instruction
2019-10-08 08:09:26 +02:00
Martin Kroeker
d2093a40d3
Merge pull request #2277 from martin-frbg/issue2275
...
Rewrite ARMV8 code to allow cross-compilation for IOS
2019-10-06 23:01:54 +02:00
Martin Kroeker
56837e9d92
Make local labels in macro compatible with the xcode assembler
...
... which does not perform the automatic numbering on instantiation that the _@ suffix signifies
2019-10-04 14:53:23 +02:00
Martin Kroeker
5e244d80f2
Merge pull request #2271 from quickwritereader/strmm_fix
...
fixed bug power9 strmm . BLAS-TESTER passes
2019-09-29 13:53:45 +02:00
AbdelRauf
ede5efebab
trmm fix
2019-09-29 02:28:34 +00:00
Martin Kroeker
596a22325a
Fix prologue of power9 assembly cdot(c) kernel to provide cdotc
2019-09-27 00:47:18 +02:00