Commit Graph

230 Commits

Author SHA1 Message Date
Bine Brank 39ab219704 sve copy functions for cgemm chemm zsymm 2022-01-05 09:12:22 +01:00
Bine Brank 18102ae8c3 add cgemm ctrmm sve kernels 2022-01-05 09:09:18 +01:00
Bine Brank 87537b8c55 modify sve zgemmcopy kernels 2022-01-05 09:07:28 +01:00
Bine Brank d30157d891 update configuration of kernels for A64FX and ARMV8SVE 2022-01-05 09:00:54 +01:00
Bine Brank 2e2c02b762 fix sve ztrmm kernel 2022-01-04 14:42:07 +01:00
Bine Brank 68c414d3a6 ztrmm sve copy functions 2022-01-04 14:40:59 +01:00
Bine Brank ce329ab686 add sve zhemm copy routines 2022-01-03 15:56:05 +01:00
Bine Brank 0140373802 add sve ztrmm 2022-01-02 19:15:33 +01:00
Bine Brank f7b6912868 ztrmm sve copy kernels 2021-12-30 21:00:16 +01:00
Bine Brank 40b14e4957 fix zgemm kernel 2021-12-29 11:42:04 +01:00
Bine Brank 6ec4aab875 zgemm sve copy routines 2021-12-26 17:05:46 +01:00
Bine Brank 878064f394 sve zgemm kernel 2021-12-26 08:44:05 +01:00
Bine Brank 683a7548bf added macros for sve zgemm kernels 2021-12-25 11:46:41 +01:00
Bine Brank e3c9947c0f prepare kernel for sve zgemm 2021-12-21 11:19:27 +01:00
Jia-Chen b610d2de37 optimize cgemm on ARM cortex A53 & cortex A55 2021-12-12 17:22:52 +08:00
Bine Brank a8f62a347b fix UNROLL_MN and add to targets for SVE 2021-12-11 16:37:23 +01:00
Bine Brank a1fea1fe2a sgemm v2x8 SVE kernel 2021-12-05 18:47:29 +01:00
Bine Brank abe1ce3434 strmm sve v1x8 kernel 2021-12-05 14:03:08 +01:00
Bine Brank 0de36f7b5c trmm sve copy fucntions for single precision 2021-11-29 21:25:05 +01:00
Bine Brank 86ae89bf33 add sgemm kernel and copy functions for sgemm and ssymm 2021-11-28 18:12:47 +01:00
Martin Kroeker 454edd741c
Merge pull request #3425 from binebrank/arm_sve_dgemm
Add dgemm kernel for arm64 SVE
2021-11-26 16:14:55 +01:00
Jia-Chen 5c1cd5e0c2 MOD: add comments to a53 zgemm kernel 2021-11-25 22:48:48 +08:00
Jia-Chen 9f59b19fcd MOD: optimize zgemm on cortex-A53/cortex-A55 2021-11-24 21:51:45 +08:00
Bine Brank 531a28b6a0 removed unused code (compiler warnings) 2021-11-22 10:12:34 +01:00
Bine Brank 9b9cb90bb1 modify Makefile for SVE copy 2021-11-22 09:54:20 +01:00
Bine Brank b58d4f31ab some clean-up & commentary 2021-11-21 14:56:27 +01:00
Bine Brank e6ed4be02e symm SVE copy rutines 2021-11-20 16:35:29 +01:00
Jia-Chen 302f22693a MOD: optimize normal DGEMM on ARMV8 cortex-A53 & cortex-A55 2021-11-18 21:14:43 +08:00
Bine Brank 3c7eed0e53 add remaining trmm copy rutines for SVE 2021-11-14 16:00:10 +01:00
Bine Brank 7d996b1c36 dtrmm_utcopy sve function 2021-11-13 18:48:53 +01:00
Bine Brank ab7917910d add v2x8 kernel + fix sve dtrmm 2021-11-07 20:37:51 +01:00
Bine Brank 7093372e32 add ARMV8SVE target 2021-11-01 22:53:21 +01:00
Bine Brank a8fbdbac34 fix sve dgemm kernel + sve dtrmm 2021-10-31 10:24:25 +01:00
Bine Brank 746b4f0f17 added SVE ncopy and tcopy 2021-10-30 12:11:44 +02:00
Bine Brank 1a10d3e09d add sve dgemm prototype 2021-10-27 16:37:18 +02:00
Martin Kroeker 22bf5c27ba
Add basic support for the Fujitsu A64FX (#3415)
* Add initial support for Fujitsu A64FX as generic ARMV8
2021-10-18 15:00:19 +02:00
Martin Kroeker 8c20ca345a
Use Neoverse's current mix of ThunderX2 kernels for Vortex as well 2021-10-06 11:06:43 +02:00
Martin Kroeker 90cc944625
Move alphaI to x22 to leave x18 unused (reserved on OSX) 2021-09-17 09:53:18 +02:00
Martin Kroeker 590fbff06e
move alpha to x19/x20 to leave x18 unused for OSX 2021-09-17 09:42:17 +02:00
Martin Kroeker 380940271b
Move temp to x21 to leave x18 unused (reserved on OSX) 2021-09-17 09:28:19 +02:00
Martin Kroeker 7d75177446
Move temp to x21 to leave x18 unused (reserved on OSX) 2021-09-17 09:24:11 +02:00
Martin Kroeker 0a4ac4b585
Use x21 for I to leave x18 unused (reserved on OSX) 2021-09-17 09:19:51 +02:00
Martin Kroeker 7d4a221579
Remove unused TEMP2 and reshuffle to leave x18 unused (reserved on OSX) 2021-09-17 09:18:25 +02:00
User User-User 39ef0880ae copy conf 2021-06-19 21:49:58 +02:00
Gilles Gouaillardet 9d292d37b2 arm64: add the missing d9 register to the clobber list
Refs. numpy/numpy#18422

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2021-06-14 17:01:28 +09:00
CodesWithWolves d2bda3b56a Remove Unnecessary/Erroneous Reads In sgemm_tcopy_16.S COPY1x8 Macro
There appears to have been some code leak when copying from the COPY2x8
macro above where we're reading 8 bytes into d4-d7 directly after
reading 4 bytes into s4-s7. These 32 bytes in d4-7 are unused and can
possibly overrun the boundary of allocated memory -- Valgrind detected
this which is what dragged my attention to it for a 128,1 copy.

Additionally, there is no need to update the addresses stored in A0-A7
as the only possible paths after running this macro will overwrite A0-7
if looping to the next 8 rows, or overwrite A0-3 if moving to 4 rows --
in which case A4-7 are unused.
2021-03-31 15:44:25 -04:00
Martin Kroeker b716c0ef01
Add workaround for NVIDIA HPC 2021-01-12 16:51:35 +01:00
Martin Kroeker 2efa3b70dc
Add workaround for NVIDIA HPC 2021-01-12 16:49:39 +01:00
Martin Kroeker 49959d4f1c
Add workaround for NVIDIA HPC 2021-01-12 16:47:15 +01:00
Martin Kroeker 0f27a03607
Add workaround for NVIDIA HPC mishandling of the asm DOT kernels 2021-01-12 16:39:35 +01:00
Martin Kroeker c2a8ebfe69
Add workaround for NVIDIA HPC mishandling of the asm DOT kernels 2021-01-12 16:38:51 +01:00
Ashwin Sekhar T K 1b2508362b arm64: Fix nrm2 for input vectors with Inf
Fix double precision nrm2 kernels returning NaN when the
input vectors contain Inf/-Inf.
2021-01-01 02:49:37 -08:00
Martin Kroeker 8631e2976a
Temporarily revert to the old nrm2 kernels 2020-12-21 07:45:13 +01:00
Martin Kroeker 2768bc1764
Temporarily revert to the old nrm2 kernels 2020-12-21 07:42:51 +01:00
Martin Kroeker 6f4698ee1f
Temporarily revert to the old nrm2 kernel 2020-12-21 07:41:18 +01:00
Martin Kroeker e1b7123bbe
Merge pull request #2867 from Qiyu8/usimd-floatdot
Optimize the performance of dot by using universal intrinsics in X86/ARM
2020-10-10 12:10:25 +02:00
User User-User d2333e7842 aarch64 fix std=c18 compilation 2020-10-03 18:00:34 +03:00
Qiyu8 60e6c68e38 Adapt ARM architect 2020-09-29 16:36:14 +08:00
Martin Kroeker 775a87242d
Rename KERNEL.SILICON to KERNEL.VORTEX 2020-09-03 08:44:20 +02:00
Martin Kroeker 80794fe8fd
Create KERNEL.SILICON 2020-09-02 22:56:58 +02:00
Ashwin Sekhar T K 4e1be0e481 ARM64: Add THUNDERX3T110 Target 2020-07-26 23:32:24 -07:00
ZhangDanfeng bc6fd20a40 fix INIT8x4
Signed-off-by: ZhangDanfeng <467688405@qq.com>
2020-06-10 01:01:16 +08:00
ZhangDanfeng 9b7877ccf1 sgemm copy source init
Signed-off-by: ZhangDanfeng <467688405@qq.com>
2020-06-04 02:10:45 +08:00
ZhangDanfeng f82fa802d1 Insert prefetch
Signed-off-by: ZhangDanfeng <467688405@qq.com>
2020-06-04 02:08:48 +08:00
张丹枫 9df79ae9a3 update sgemm and strmm kernel selecting strategy 2020-05-20 22:26:58 +08:00
张丹枫 a1fc6041cd use general register to speedup 2020-05-20 22:26:58 +08:00
张丹枫 edb423d772 align general register using to strmm_kernel_8x8 2020-05-20 22:26:58 +08:00
zhangdanfeng 0e6eb8c247 sgemm kernel use sgemm_kernel_8x8_cortexa53
Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>
2020-05-20 22:26:58 +08:00
zhangdanfeng d475db29c6 optimized for cortex-a53
Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>
2020-05-20 22:26:58 +08:00
Ashwin Sekhar T K 8353cb245a ARM64: Improve DAXPY for ThunderX2
Improve performance of DAXPY for ThunderX2
when the vector fits in L1 Cache.
2020-05-07 09:22:50 -07:00
Martin Kroeker 144be81ca1
fix initialization to zero in the NEON SGEMM_BETA kernel as well 2020-03-31 16:53:56 +02:00
Martin Kroeker 07cdd5d05c
Fix zero initialization for beta=0 case
use immediate initialization instead of multiplication in case register content is a NaN
2020-03-31 00:21:02 +02:00
s00548429 bec7923a0d Fix the functional bugs for zamax. 2020-03-09 15:36:50 +08:00
Ali Saidi c623a965f9 Add Neoverse-N1 core
The implementation is a hybird of the ARMV8 one with some of the
improved TX2 rountines along with specifying -march=v8.2-a
2020-02-29 03:22:04 +00:00
Martin Kroeker e57b11acca
Add preliminary support for EMAG8180 2020-02-19 19:00:28 +01:00
Martin Kroeker 456ee2e1f0
Merge pull request #2357 from chenxuqiang/dgemm_beta_zero
kernel/arm64/dgemm_beta.S: add beta == zero branch
2020-01-02 22:28:36 +01:00
shengyang 80db5f11e1 update 2020-01-02 11:01:57 +08:00
chenxuqiang 52de4cc8fd kernel/arm64/dgemm_beta.S: add beta == zero branch
added beta == zero branch, and no need to load C matrix.

Signed by: Xuqiang Chen <chenxuqiang3@hisilicon.com>
2020-01-01 21:50:45 -05:00
Martin Kroeker 44028581cc
Merge pull request #2355 from Zeyiii/dev-zeyi2
Use arm neon instructions to optimize sgemm_beta operation
2020-01-01 22:14:16 +01:00
Martin Kroeker 86ab939936
Merge pull request #2354 from ZuoQ3/develop
[WIP] Use arm neon instructions to optimize tcopy operation
2020-01-01 22:13:37 +01:00
shengyang 8d84403205 Use arm neon instructions to optimize ncopy operation
modified:   KERNEL.ARMV8
	modified:   KERNEL.TSV110
	new file:   sgemm_ncopy_4.S
2019-12-31 17:06:35 +08:00
w00421467 0833a4846a Use arm neon instructions to optimize sgemm_beta operation 2019-12-31 10:42:03 +08:00
zq 50f7fc1401 [WIP] Use arm neon instructions to optimize tcopy operation 2019-12-31 10:21:23 +08:00
w00421467 3ccf8885ac prefetching for dgemm_beta 2019-12-30 11:45:49 +08:00
w00421467 b7cc69ee62 declare DGEMM_BETA in KERNEL.ARMV8 rather than the generic KERNEL 2019-12-20 10:11:50 +08:00
w00421467 aeef942c4f use arm neon instructions to optimize gemm beta operation 2019-12-17 10:00:13 +08:00
Martin Kroeker 85ccdce8c4
Remove the IOS fallbacks to generic C kernels 2019-10-25 23:02:37 +02:00
Martin Kroeker a448884a63
Remove automatic label postfixes from macro included only once 2019-10-08 08:37:50 +02:00
Martin Kroeker 3a2df19db6
Fix accidental duplication of jump instruction 2019-10-08 08:09:26 +02:00
Martin Kroeker 56837e9d92
Make local labels in macro compatible with the xcode assembler
... which does not perform the automatic numbering on instantiation that the _@ suffix signifies
2019-10-04 14:53:23 +02:00
Martin Kroeker 3e3ccb9011
Add ARM64 implementations of ?sum
as trivial copies of the respective ?asum kernels with the fabs calls removed
2019-03-30 22:13:36 +01:00
maomao194313 783ba8058f
HiSilicon tsv110 CPUs optimization branch
add HiSilicon tsv110 CPUs  optimization branch
2019-03-04 16:30:50 +08:00
Martin Kroeker 7639f2e1f0
Rewrite the conditional for OSX to fix cmake parsing on others
The Makefile variable parser in utils.cmake currently does not handle conditionals. Having the definitions for non-OSX last will at least make cmake builds work again on non-OSX platforms.
2018-12-06 14:04:27 +01:00
Martin Kroeker 6ba30e270d
Fix typo that broke CNRM2 on ARMV8 since 0.3.0
must have happened in my #1449
2018-12-06 13:42:25 +01:00
Renato Golin 310ea55f29 Simplifying ARMv8 build parameters
ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode
(which is not right because TX2 is ARMv8.1) as well as requiring a few
redundancies in the defines, making it harder to maintain and understand
what core has what. A few other minor issues were also fixed.

Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX,
ThunderX2, and XGene.

Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester.

A summary:
 * Removed TX2 code from ARMv8 build, to make sure it is compatible with
   all ARMv8 cores, not just v8.1. Also, the TX2 code has actually
   harmed performance on big cores.
 * Commoned up ARMv8 architectures' defines in params.h, to make sure
   that all will benefit from ARMv8 settings, in addition to their own.
 * Adding a few more cores, using ARMv8's include strategy, to benefit
   from compiler optimisations using mtune. Also updated cache
   information from the manuals, making sure we set good conservative
   values by default. Removed Vulcan, as it's an alias to TX2.
 * Auto-detecting most of those cores, but also updating the forced
   compilation in getarch.c, to make sure the parameters are the same
   whether compiled natively or forced arch.

Benefits:
 * ARMv8 build is now guaranteed to work on all ARMv8 cores
 * Improved performance for ARMv8 builds on some cores (A72, Falkor,
   ThunderX1 and 2: up to 11%) over current develop
 * Improved performance for *all* cores comparing to develop branch
   before TX2's patch (9% ~ 36%)
 * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than
   current develop's branch and 8% faster than deveop before tx2 patches

Issues:
 * Regression from current develop branch for A53 (-12%) and A57 (-3%)
   with ARMv8 builds, but still faster than before TX2's commit (+15%
   and +24% respectively). This can be improved with a simplification of
   TX2's code, to be done in future patches. At least the code is
   guaranteed to be ARMv8.0 now.

Comments:
 * CortexA57 builds are unchanged on A57 hardware from develop's branch,
   which makes sense, as it's untouched.
 * CortexA72 builds improve over A57 on A72 hardware, even if they're
   using the same includes due to new compiler tunning in the makefile.
2018-11-19 16:41:49 +00:00
Ashwin Sekhar T K d5aeff636f ARM64: Enable DYNAMIC_ARCH
Enable DYNAMIC_ARCH feature on ARM64. This patch uses the cpuid
feature in linux kernel to detect the core type at runtime
(https://www.kernel.org/doc/Documentation/arm64/cpu-feature-registers.txt).

If this feature is missing in kernel, then the user should use the
OPENBLAS_CORETYPE env variable to select the desired core type.
2018-10-22 01:49:35 -07:00
Ashwin Sekhar T K d50abc8903 ARM64: Move parameters from parameter.c to param.h
Remove the runtime setting of P, Q, R parameters for
targets ARMV8, THUNDERX2T99. Instead set them as constants
in param.h at compile time.
2018-10-22 01:45:51 -07:00
Ashwin Sekhar T K 351a0c777c ARM64: Remove XGENE1 references
Remove XGENE1 target as the implementation for the
same is incomplete. Moreover whoever wishes to use
on XGENE1 can use the generic ARMV8 target as there
are no XGENE1 specific optimizations in OpenBLAS.
2018-10-22 01:45:51 -07:00
Ashwin Sekhar T K 21f46a1cf2 ARM64: Use THUNDERX2T99 Neon Kernels for ARMV8
Currently the generic ARMV8 target uses C implementations
for many routines. Replace these with the neon implementations
written for THUNDERX2T99 target which are upto 6x faster for
certain routines.
2018-10-17 10:44:37 -07:00
Ashwin Sekhar T K caf339412f ARM64: Remove dependency of THUNDERX2T99 Makefile on CORTEXA57 Makefile 2018-10-17 08:02:40 -07:00