Commit Graph

245 Commits

Author SHA1 Message Date
Martin Kroeker 7cfd433d0c
revert the C/Z NRM2 kernels to the base NEON kernel as well 2024-04-12 15:34:04 +02:00
Martin Kroeker 441c81026e
Add support for Cortex-A76 2024-04-02 19:41:44 +02:00
Martin Kroeker 9ead81bd39
Revert S/DNRM2 to the base NEON kernel to fix precision loss 2024-04-02 15:59:20 +02:00
Martin Kroeker 552c521353
remove another early exit for incx < 0 2024-03-12 18:49:27 +01:00
Martin Kroeker ed532dc75b
remove another early exit for incx < 0 2024-03-12 18:47:00 +01:00
Martin Kroeker e41d01bad9
remove early exit on negative inc_x 2024-03-11 22:53:54 +01:00
Martin Kroeker 02a025f9c1
remove early exit on negative inc_x 2024-03-11 22:52:18 +01:00
Martin Kroeker 7d506984fa
fix assignment of default CSUM kernel 2024-02-25 17:57:11 +01:00
Martin Kroeker 12787775d9
add csum/zsum kernels (trivially derived from the asum ones)s) 2024-02-25 17:55:36 +01:00
Martin Kroeker c9df62e883
Fix handling of NAN 2024-01-07 17:49:40 +01:00
Chris Sidebottom ecae1389df Reduce duplication in kernel definitions
These files are exactly the same, so I believe we can reduce these files
down. Other files require a slightly more complex unpicking.
2023-12-23 12:39:53 +00:00
Chris Sidebottom 60e66725e4 Use numeric labels to allow repeated inlining 2023-12-19 13:11:06 +00:00
Chris Sidebottom 7a4fef4f60 Tweak SVE dot kernel
This changes the SVE dot kernel to only predicate when necessary as well
as streamlining the assembly a bit. The benchmarks seem to indicate this
can improve performance by ~33%.
2023-12-19 12:08:54 +00:00
Martin Kroeker 3bfa4d4dcc
Fix outdated SVE kernel definitions for Cortex cpus by aliasing to ARMV8SVE 2023-11-03 14:55:31 +01:00
Martin Kroeker e7d05402e0
Fix up S/D GEMM copy function definitions after #4009 2023-10-12 14:24:53 +02:00
Martin Kroeker fc8894dd98
Workaround miscompilation by NVIDIA nvc 2023-08-26 00:30:17 +02:00
Martin Kroeker 5720fa02c5
Merge pull request #4168 from Mousius/sve-zgemm-cgemm
Use SVE zgemm/cgemm on Arm(R) Neoverse(TM) V1 core
2023-07-27 17:41:45 +02:00
Chris Sidebottom 84a268b6ca Use SVE zgemm/cgemm on Arm(R) Neoverse(TM) V1 core
This patch removes the prefetches from cgemm/zgemm which improves the performance similar to sgemm/dgemm did in #3868, this means I'm happy to enable this on any applicable cores.

I also replicated the unrolling the copies from sgemm and dgemm.
2023-07-27 14:12:20 +01:00
Chris Sidebottom 730ca04b48 Fix ZHEMM copy for SVE
Whilst disambiguating whilelt, I inadvertantly used the wrong datatype
for offsets, which can be negative. This rectifies that.
2023-07-27 13:27:28 +01:00
Martin Kroeker 849c8806b8
Merge pull request #4161 from Mousius/non-sve-kernels
Use latest non-SVE kernels in ARMV8SVE
2023-07-26 15:49:40 +02:00
Chris Sidebottom 24586bc4ff Disambiguate whilelt 2023-07-25 20:15:44 +01:00
Chris Sidebottom aea2a4622b Use latest non-SVE kernels in ARMV8SVE
These are generally better and, in some cases, include threading which helps in the cores we're targeting here.
2023-07-25 14:12:26 +01:00
martin-frbg 7976deff80 Fix file permissions (issue 4095) 2023-07-23 20:37:07 +02:00
Martin Kroeker 3d31191b0f
Work around Clang failing to disambiguate SVE intrinsics and add AppleClang crossbuild to MacOS/arm64 DYNAMIC_ARCH in AzureCI (#4140)
* Add AppleClang crossbuild to MacOS/arm64 DYNAMIC_ARCH

* add casts to disambiguate svwhilelt for clang
2023-07-14 11:06:48 +02:00
Martin Kroeker 72caceb324
Merge pull request #4009 from Mousius/sve-gemm
Use SVE kernel for SGEMM/DGEMM on Arm(R) Neoverse(TM) V1
2023-04-22 13:56:45 +02:00
Chris Sidebottom ec334e69dc Use SVE kernel for SGEMM/DGEMM on Arm(R) Neoverse(TM) V1
This re-spins #3869 with some additional copy unrolling which helps maintain SYRK performance.

After #3868, the SVE kernels represent a pretty good boost.

This re-uses ARMV8SVE as a base and I'm going to incrementally move everything to use ARMV8SVE in additional patches (as well as fix up anything that's not already in ARMV8SVE).
2023-04-17 17:38:42 +01:00
Martin Kroeker 44164e3a3d
revert "move alpha out of register 18" (out of PR scope, no SVE on Apple hw) 2023-04-17 14:23:13 +02:00
Martin Kroeker 8be68fa7f4
move declaration of sca to really keep the compiler from throwing it out (for now) 2023-04-15 12:02:39 +02:00
Martin Kroeker 3727672a74
Improve workaround and keep compilers from optimizing it out 2023-04-13 18:07:52 +02:00
Martin Kroeker 108a21e47a
Move ALPHA out of register 18 (reserved on OSX) 2023-04-13 18:05:14 +02:00
Martin Kroeker 0b1acb0ba3
Move ALPHA_I out of register 18 (reserved on OSX) 2023-04-13 18:03:35 +02:00
Martin Kroeker c7bbad09ad
Move ALPHA_I out of register 18 (reserved on OSX) 2023-04-13 18:00:47 +02:00
Martin Kroeker cda29633a3
move ALPHA_I out of register 18 (reserved on OSX) 2023-04-13 17:59:48 +02:00
Martin Kroeker 09ace3cf23
Merge pull request #3846 from lilh9598/sbgemm_opt
Improve the performance of sbgemm_tcopy on neoversen2
2023-03-26 19:04:57 +02:00
Chris Sidebottom 1361229291 Remove prefetches from SVE kernels
This is a precursor to enabling the SVE kernels for Arm(R) Neoverse(TM)
V1 which has 256-bit SVE. Testing revealed that the SVE kernel was
actually worse in some cases than the existing kernel which seemed odd -
removing these prefetches the underlying architecture seems to do a better job
😸
2022-12-16 14:43:09 +00:00
lilianhuang 729af6406f bugfix for sbgemm_ncopy_8_neoversen2 2022-12-05 05:10:18 -05:00
Chris Sidebottom eea006a688 Wrap SVE header with __has_include check 2022-12-01 12:07:55 +00:00
Chris Sidebottom fd4f52c797 Add SVE implementation for sdot/ddot
This adds an SVE implementation to sdot/ddot when available, falling back to the previous Advanced SIMD kernel where there's no SVE implementation for the kernel.

All the targets were essentially treating `dot_thunderx2t99.c` as the Advanced SIMD implementation so I've renamed it to better fit with the feature detection.
2022-12-01 12:07:50 +00:00
lilianhuang fdac8a97c1 Add sbgemm_ncopy_8 and sbgemm_tcopy_4 2022-11-29 04:46:14 -05:00
lilianhuang 135718eafc Improve the performance of sbgemm_tcopy on neoversen2 2022-11-28 04:17:54 -05:00
Chris Sidebottom 4f7b77e08a Remove unnecessary instructions from Advanced SIMD dot
The existing kernel was issuing extra instructions to organise the arguments into the same registers they would usually be in and similarly to put the result into the appropriate register.

This has an impact on smaller sized dots and seemed like a quick fix
2022-11-25 16:19:03 +00:00
Martin Kroeker 1688c7da43
change line endings from CRLF to LF 2022-11-16 22:24:01 +01:00
Honglin Zhu 79066b6bf3 Change file name to match the norm and delete useless code. 2022-10-28 17:09:39 +08:00
Honglin Zhu b00d5b9746 New sbgemm implementation for Neoverse N2
1. Use UZP instructions but not gather load and scatter store instructions to get lower latency.
    2. Padding k to a power of 4.
2022-10-26 15:09:41 +08:00
Martin Kroeker e12d474780
Eliminate uses of CREAL on left-hand side of assignments 2022-07-05 00:01:09 +02:00
Martin Kroeker 9e29598575
workaround fault with ssq=inf,scale=0 2022-07-02 23:47:17 +02:00
Honglin Zhu 123e0dfb62 Neoverse N2 sbgemm:
1. Modify the algorithm to resolve multithreading failures
    2. No memory allocation in sbgemm kernel
    3. Optimize when alpha == 1.0f
2022-06-29 10:14:21 +08:00
Honglin Zhu bc3728475f format code 2022-06-29 10:14:21 +08:00
Honglin Zhu 55d686d41e neoverse n2 sbgemm:
implement ncopy tcopy kernel_8x4
2022-06-29 10:14:21 +08:00
Honglin Zhu 04593bb27c neoverse n2 sbgemm: init file 2022-06-29 10:14:21 +08:00