OpenBLAS/kernel/arm64
Chris Sidebottom 4f7b77e08a Remove unnecessary instructions from Advanced SIMD dot
The existing kernel was issuing extra instructions to organise the arguments into the same registers they would usually be in and similarly to put the result into the appropriate register.

This has an impact on smaller sized dots and seemed like a quick fix
2022-11-25 16:19:03 +00:00
..
KERNEL declare DGEMM_BETA in KERNEL.ARMV8 rather than the generic KERNEL 2019-12-20 10:11:50 +08:00
KERNEL.A64FX add sve ztrsm 2022-01-15 22:27:25 +01:00
KERNEL.ARMV8 Add workaround for NVIDIA HPC mishandling of the asm DOT kernels 2021-01-12 16:38:51 +01:00
KERNEL.ARMV8SVE update armv8sve + contributors 2022-01-18 08:28:31 +01:00
KERNEL.CORTEXA53 optimize cgemm on ARM cortex A53 & cortex A55 2021-12-12 17:22:52 +08:00
KERNEL.CORTEXA55 optimize cgemm on ARM cortex A53 & cortex A55 2021-12-12 17:22:52 +08:00
KERNEL.CORTEXA57 Add workaround for NVIDIA HPC mishandling of the asm DOT kernels 2021-01-12 16:39:35 +01:00
KERNEL.CORTEXA72 Simplifying ARMv8 build parameters 2018-11-19 16:41:49 +00:00
KERNEL.CORTEXA73 Simplifying ARMv8 build parameters 2018-11-19 16:41:49 +00:00
KERNEL.CORTEXA510 Add initial support for Phytium FT2000 series and ARMV9 Cortex 510/710/X1/X2 2022-03-27 15:29:20 +02:00
KERNEL.CORTEXA710 Add initial support for Phytium FT2000 series and ARMV9 Cortex 510/710/X1/X2 2022-03-27 15:29:20 +02:00
KERNEL.CORTEXX1 CortexX1 is ARMV8 like A7x 2022-03-28 17:28:29 +02:00
KERNEL.CORTEXX2 Add initial support for Phytium FT2000 series and ARMV9 Cortex 510/710/X1/X2 2022-03-27 15:29:20 +02:00
KERNEL.EMAG8180 Add preliminary support for EMAG8180 2020-02-19 19:00:28 +01:00
KERNEL.FALKOR Simplifying ARMv8 build parameters 2018-11-19 16:41:49 +00:00
KERNEL.FT2000 Add initial support for Phytium FT2000 series and ARMV9 Cortex 510/710/X1/X2 2022-03-27 15:29:20 +02:00
KERNEL.NEOVERSEN1 arm64: Fix nrm2 for input vectors with Inf 2021-01-01 02:49:37 -08:00
KERNEL.NEOVERSEN2 Change file name to match the norm and delete useless code. 2022-10-28 17:09:39 +08:00
KERNEL.NEOVERSEV1 OpenBLAS: aarch64: Add neoverse-v1/n2 architecture specifics 2022-01-07 00:28:17 +00:00
KERNEL.THUNDERX Add workaround for NVIDIA HPC 2021-01-12 16:49:39 +01:00
KERNEL.THUNDERX2T99 arm64: Fix nrm2 for input vectors with Inf 2021-01-01 02:49:37 -08:00
KERNEL.THUNDERX3T110 arm64: Fix nrm2 for input vectors with Inf 2021-01-01 02:49:37 -08:00
KERNEL.TSV110 Add workaround for NVIDIA HPC 2021-01-12 16:51:35 +01:00
KERNEL.VORTEX Use Neoverse's current mix of ThunderX2 kernels for Vortex as well 2021-10-06 11:06:43 +02:00
KERNEL.generic Fix MSVC ARM64 build. Add generic kernel for ARM64 2022-06-02 16:53:54 +02:00
Makefile added experimental support for ARMV8 2013-11-24 15:47:00 +01:00
amax.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
asum.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
axpy.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
casum.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
casum_thunderx2t99.c Fixed a few more unnecessary calls to num_cpu_avail. 2018-06-11 10:17:16 +01:00
cgemm_kernel_4x4.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
cgemm_kernel_8x4.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
cgemm_kernel_8x4_cortexa53.c optimize cgemm on ARM cortex A53 & cortex A55 2021-12-12 17:22:52 +08:00
cgemm_kernel_8x4_thunderx2t99.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
cgemm_kernel_sve_v1x4.S add cgemm ctrmm sve kernels 2022-01-05 09:09:18 +01:00
cgemm_ncopy_sve_v1.c sve copy functions for cgemm chemm zsymm 2022-01-05 09:12:22 +01:00
cgemm_tcopy_sve_v1.c sve copy functions for cgemm chemm zsymm 2022-01-05 09:12:22 +01:00
copy.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
copy_thunderx2t99.c Fixed a few more unnecessary calls to num_cpu_avail. 2018-06-11 10:17:16 +01:00
csum.S Add ARM64 implementations of ?sum 2019-03-30 22:13:36 +01:00
ctrmm_kernel_4x4.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
ctrmm_kernel_8x4.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
ctrmm_kernel_sve_v1x4.S add cgemm ctrmm sve kernels 2022-01-05 09:09:18 +01:00
dasum_thunderx2t99.c Fixed a few more unnecessary calls to num_cpu_avail. 2018-06-11 10:17:16 +01:00
daxpy_thunderx.c aarch64 fix std=c18 compilation 2020-10-03 18:00:34 +03:00
daxpy_thunderx2t99.S ARM64: Improve DAXPY for ThunderX2 2020-05-07 09:22:50 -07:00
ddot_thunderx.c ARM64: Rename kernel files to have consistent naming 2017-01-24 14:53:34 +05:30
dgemm_beta.S Fix zero initialization for beta=0 case 2020-03-31 00:21:02 +02:00
dgemm_kernel_4x4.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
dgemm_kernel_4x4_cortexa53.c MOD: optimize normal DGEMM on ARMV8 cortex-A53 & cortex-A55 2021-11-18 21:14:43 +08:00
dgemm_kernel_4x8.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
dgemm_kernel_8x4.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
dgemm_kernel_8x4_thunderx2t99.S ARM64: Move parameters from parameter.c to param.h 2018-10-22 01:45:51 -07:00
dgemm_kernel_sve_v1x8.S some clean-up & commentary 2021-11-21 14:56:27 +01:00
dgemm_kernel_sve_v2x8.S some clean-up & commentary 2021-11-21 14:56:27 +01:00
dgemm_ncopy_4.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
dgemm_ncopy_8.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
dgemm_ncopy_sve_v1.c some clean-up & commentary 2021-11-21 14:56:27 +01:00
dgemm_tcopy_4.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
dgemm_tcopy_8.S Remove unused TEMP2 and reshuffle to leave x18 unused (reserved on OSX) 2021-09-17 09:18:25 +02:00
dgemm_tcopy_sve_v1.c some clean-up & commentary 2021-11-21 14:56:27 +01:00
dot.S ARM64: Fix utest dsdot errors 2018-02-27 10:47:55 +00:00
dot_thunderx.c ARM64: Rename kernel files to have consistent naming 2017-01-24 14:53:34 +05:30
dot_thunderx2t99.c Remove unnecessary instructions from Advanced SIMD dot 2022-11-25 16:19:03 +00:00
dtrmm_kernel_4x4.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
dtrmm_kernel_4x8.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
dtrmm_kernel_8x4.S Move temp to x21 to leave x18 unused (reserved on OSX) 2021-09-17 09:24:11 +02:00
dtrmm_kernel_sve_v1x8.S some clean-up & commentary 2021-11-21 14:56:27 +01:00
dznrm2_thunderx2t99.c workaround fault with ssq=inf,scale=0 2022-07-02 23:47:17 +02:00
dznrm2_thunderx2t99_fast.c Fixed a few more unnecessary calls to num_cpu_avail. 2018-06-11 10:17:16 +01:00
gemv_n.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
gemv_t.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
iamax.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
iamax_thunderx2t99.c Fixed a few more unnecessary calls to num_cpu_avail. 2018-06-11 10:17:16 +01:00
izamax.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
izamax_thunderx2t99.c Fixed a few more unnecessary calls to num_cpu_avail. 2018-06-11 10:17:16 +01:00
nrm2.S Fix accidental duplication of jump instruction 2019-10-08 08:09:26 +02:00
rot.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
sasum_thunderx2t99.c Fixed a few more unnecessary calls to num_cpu_avail. 2018-06-11 10:17:16 +01:00
sbgemm_beta_neoversen2.c neoverse n2 sbgemm: init file 2022-06-29 10:14:21 +08:00
sbgemm_kernel_8x4_neoversen2.c Change file name to match the norm and delete useless code. 2022-10-28 17:09:39 +08:00
sbgemm_kernel_8x4_neoversen2_impl.c Change file name to match the norm and delete useless code. 2022-10-28 17:09:39 +08:00
sbgemm_ncopy_4_neoversen2.c Change file name to match the norm and delete useless code. 2022-10-28 17:09:39 +08:00
sbgemm_tcopy_8_neoversen2.c Change file name to match the norm and delete useless code. 2022-10-28 17:09:39 +08:00
scal.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
scnrm2_thunderx2t99.c Fixed a few more unnecessary calls to num_cpu_avail. 2018-06-11 10:17:16 +01:00
sgemm_beta.S fix initialization to zero in the NEON SGEMM_BETA kernel as well 2020-03-31 16:53:56 +02:00
sgemm_kernel_4x4.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
sgemm_kernel_8x8.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
sgemm_kernel_8x8_cortexa53.S fix INIT8x4 2020-06-10 01:01:16 +08:00
sgemm_kernel_16x4.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
sgemm_kernel_16x4_thunderx2t99.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
sgemm_kernel_sve_v1x8.S add sgemm kernel and copy functions for sgemm and ssymm 2021-11-28 18:12:47 +01:00
sgemm_kernel_sve_v2x8.S sgemm v2x8 SVE kernel 2021-12-05 18:47:29 +01:00
sgemm_ncopy_4.S change line endings from CRLF to LF 2022-11-16 22:24:01 +01:00
sgemm_ncopy_8.S sgemm copy source init 2020-06-04 02:10:45 +08:00
sgemm_ncopy_sve_v1.c add sgemm kernel and copy functions for sgemm and ssymm 2021-11-28 18:12:47 +01:00
sgemm_tcopy_8.S sgemm copy source init 2020-06-04 02:10:45 +08:00
sgemm_tcopy_16.S change line endings from CRLF to LF 2022-11-16 22:24:01 +01:00
sgemm_tcopy_sve_v1.c add sgemm kernel and copy functions for sgemm and ssymm 2021-11-28 18:12:47 +01:00
strmm_kernel_4x4.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
strmm_kernel_8x8.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
strmm_kernel_8x8_cortexa53.S use general register to speedup 2020-05-20 22:26:58 +08:00
strmm_kernel_16x4.S Move temp to x21 to leave x18 unused (reserved on OSX) 2021-09-17 09:28:19 +02:00
strmm_kernel_sve_v1x8.S strmm sve v1x8 kernel 2021-12-05 14:03:08 +01:00
sum.S Add ARM64 implementations of ?sum 2019-03-30 22:13:36 +01:00
swap.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
swap_thunderx2t99.S THUNDERX2T99: Add optimized S/D/C/Z SWAP Implementations 2017-02-03 03:55:06 -08:00
symm_lcopy_sve.c add sgemm kernel and copy functions for sgemm and ssymm 2021-11-28 18:12:47 +01:00
symm_ucopy_sve.c add sgemm kernel and copy functions for sgemm and ssymm 2021-11-28 18:12:47 +01:00
trmm_lncopy_sve_v1.c trmm sve copy fucntions for single precision 2021-11-29 21:25:05 +01:00
trmm_ltcopy_sve_v1.c trmm sve copy fucntions for single precision 2021-11-29 21:25:05 +01:00
trmm_uncopy_sve_v1.c trmm sve copy fucntions for single precision 2021-11-29 21:25:05 +01:00
trmm_utcopy_sve_v1.c trmm sve copy fucntions for single precision 2021-11-29 21:25:05 +01:00
trsm_kernel_LN_sve.c add sve ztrsm 2022-01-15 22:27:25 +01:00
trsm_kernel_LT_sve.c add sve ztrsm 2022-01-15 22:27:25 +01:00
trsm_kernel_RN_sve.c add sve ztrsm 2022-01-15 22:27:25 +01:00
trsm_kernel_RT_sve.c add sve ztrsm 2022-01-15 22:27:25 +01:00
trsm_lncopy_sve.c add sve ztrsm 2022-01-15 22:27:25 +01:00
trsm_ltcopy_sve.c add sve ztrsm 2022-01-15 22:27:25 +01:00
trsm_uncopy_sve.c add sve ztrsm 2022-01-15 22:27:25 +01:00
trsm_utcopy_sve.c add sve ztrsm 2022-01-15 22:27:25 +01:00
zamax.S Fix the functional bugs for zamax. 2020-03-09 15:36:50 +08:00
zasum.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
zasum_thunderx2t99.c Fixed a few more unnecessary calls to num_cpu_avail. 2018-06-11 10:17:16 +01:00
zaxpy.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
zdot.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
zdot_thunderx2t99.c Eliminate uses of CREAL on left-hand side of assignments 2022-07-05 00:01:09 +02:00
zgemm_kernel_4x4.S move alpha to x19/x20 to leave x18 unused for OSX 2021-09-17 09:42:17 +02:00
zgemm_kernel_4x4_cortexa53.c MOD: add comments to a53 zgemm kernel 2021-11-25 22:48:48 +08:00
zgemm_kernel_4x4_thunderx2t99.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
zgemm_kernel_sve_v1x4.S fix zgemm kernel 2021-12-29 11:42:04 +01:00
zgemm_ncopy_sve_v1.c modify sve zgemmcopy kernels 2022-01-05 09:07:28 +01:00
zgemm_tcopy_sve_v1.c modify sve zgemmcopy kernels 2022-01-05 09:07:28 +01:00
zgemv_n.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
zgemv_t.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
zhemm_ltcopy_sve.c combine zchemm into single file 2022-01-05 14:42:37 +01:00
zhemm_utcopy_sve.c combine zchemm into single file 2022-01-05 14:42:37 +01:00
znrm2.S Remove automatic label postfixes from macro included only once 2019-10-08 08:37:50 +02:00
zrot.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
zscal.S ARM64: Convert all labels to local labels 2017-10-24 11:40:05 +00:00
zsum.S Add ARM64 implementations of ?sum 2019-03-30 22:13:36 +01:00
zsymm_lcopy_sve.c sve copy functions for cgemm chemm zsymm 2022-01-05 09:12:22 +01:00
zsymm_ucopy_sve.c sve copy functions for cgemm chemm zsymm 2022-01-05 09:12:22 +01:00
ztrmm_kernel_4x4.S Move alphaI to x22 to leave x18 unused (reserved on OSX) 2021-09-17 09:53:18 +02:00
ztrmm_kernel_sve_v1x4.S fix sve ztrmm kernel 2022-01-04 14:42:07 +01:00
ztrmm_lncopy_sve_v1.c ztrmm sve copy functions 2022-01-04 14:40:59 +01:00
ztrmm_ltcopy_sve_v1.c ztrmm sve copy functions 2022-01-04 14:40:59 +01:00
ztrmm_uncopy_sve_v1.c ztrmm sve copy functions 2022-01-04 14:40:59 +01:00
ztrmm_utcopy_sve_v1.c ztrmm sve copy functions 2022-01-04 14:40:59 +01:00
ztrsm_lncopy_sve.c add sve ztrsm 2022-01-15 22:27:25 +01:00
ztrsm_ltcopy_sve.c fix ztrsm lt/ut copy 2022-01-16 21:39:57 +01:00
ztrsm_uncopy_sve.c add sve ztrsm 2022-01-15 22:27:25 +01:00
ztrsm_utcopy_sve.c fix ztrsm lt/ut copy 2022-01-16 21:39:57 +01:00