Commit Graph

264 Commits

Author SHA1 Message Date
Chris Sidebottom ba2e989c67 Add accumulators to AArch64 GEMV Kernels
This helps to reduce values going missing as we accumulate.
2024-07-31 13:09:14 +01:00
Martin Kroeker fb7c53c5e5
Merge pull request #4807 from martin-frbg/scalfixes
[WIP]Make NAN handling in the SCAL kernels depend on the dummy2 parameter
2024-07-25 23:42:50 +02:00
Martin Kroeker a4e56e0452
Merge pull request #4806 from Mousius/small-gemm
Small GEMM for AArch64 with SVE
2024-07-25 21:50:04 +02:00
yamazaki-mitsufumi 88caf02f62 Fix ambiguous error on Mac OS 2024-07-25 22:43:13 +09:00
Chris Sidebottom ea4ab3b310 Better header guard around bridge 2024-07-20 14:39:57 +01:00
Chris Sidebottom 7311d93016 Unroll TT further 2024-07-19 17:51:20 +01:00
Chris Sidebottom a9edddb695 Unroll TN further 2024-07-18 20:04:15 +01:00
Chris Sidebottom 9984c5ce9d Clean up k2 removal more and unroll SGEMM more 2024-07-18 18:35:43 +01:00
Chris Sidebottom b1c9fafabb Remove k2 loop from DGEMM TN and use a more conservative heuristic for SGEMM 2024-07-18 17:37:18 +01:00
Martin Kroeker eb4879e04c
make NAN handling depend on the dummy2 parameter 2024-07-17 23:24:19 +02:00
iha fujitsu 0985fdc82b A64FX: Add support for SVE to SGEMV/DGEMV kernels. 2024-07-16 17:31:33 +09:00
Martin Kroeker 3677b3886c
Merge pull request #4702 from bashimao/detect-nv-grace
Correctly detect ARM Neoverse V2 CPUs.
2024-06-30 22:48:48 +02:00
Chris Sidebottom 8c472ef7e3 Further tweak small GEMM for AArch64 2024-06-24 10:47:47 +01:00
Martin Kroeker a2ee4b1966
Merge branch 'OpenMathLib:develop' into issue4728 2024-06-21 09:35:56 +02:00
Martin Kroeker 3ec59922b6
Add a clobber list to fix utest errors seen with gcc13 on Apple M 2024-06-20 16:19:32 +02:00
Martin Kroeker 3d8054fb16
add clobber list 2024-06-14 22:07:44 +02:00
Martin Kroeker c7cacd9b38
disable the shortcut for da=0 to ensure proper handling of INF and NAN 2024-06-07 13:48:56 +02:00
Matthias Langer 0050a9660b Correctly detect ARM Neoverse V2 CPUs. 2024-05-16 09:59:52 +00:00
Martin Kroeker 7cfd433d0c
revert the C/Z NRM2 kernels to the base NEON kernel as well 2024-04-12 15:34:04 +02:00
Martin Kroeker 441c81026e
Add support for Cortex-A76 2024-04-02 19:41:44 +02:00
Martin Kroeker 9ead81bd39
Revert S/DNRM2 to the base NEON kernel to fix precision loss 2024-04-02 15:59:20 +02:00
Martin Kroeker 552c521353
remove another early exit for incx < 0 2024-03-12 18:49:27 +01:00
Martin Kroeker ed532dc75b
remove another early exit for incx < 0 2024-03-12 18:47:00 +01:00
Martin Kroeker e41d01bad9
remove early exit on negative inc_x 2024-03-11 22:53:54 +01:00
Martin Kroeker 02a025f9c1
remove early exit on negative inc_x 2024-03-11 22:52:18 +01:00
Chris Sidebottom 7a6fa699f2 Small GEMM for AArch64
This is a fairly conservative addition of small matrix kernels using
SVE.
2024-03-04 15:48:47 +00:00
Martin Kroeker 7d506984fa
fix assignment of default CSUM kernel 2024-02-25 17:57:11 +01:00
Martin Kroeker 12787775d9
add csum/zsum kernels (trivially derived from the asum ones)s) 2024-02-25 17:55:36 +01:00
Martin Kroeker c9df62e883
Fix handling of NAN 2024-01-07 17:49:40 +01:00
Chris Sidebottom ecae1389df Reduce duplication in kernel definitions
These files are exactly the same, so I believe we can reduce these files
down. Other files require a slightly more complex unpicking.
2023-12-23 12:39:53 +00:00
Chris Sidebottom 60e66725e4 Use numeric labels to allow repeated inlining 2023-12-19 13:11:06 +00:00
Chris Sidebottom 7a4fef4f60 Tweak SVE dot kernel
This changes the SVE dot kernel to only predicate when necessary as well
as streamlining the assembly a bit. The benchmarks seem to indicate this
can improve performance by ~33%.
2023-12-19 12:08:54 +00:00
Martin Kroeker 3bfa4d4dcc
Fix outdated SVE kernel definitions for Cortex cpus by aliasing to ARMV8SVE 2023-11-03 14:55:31 +01:00
Martin Kroeker e7d05402e0
Fix up S/D GEMM copy function definitions after #4009 2023-10-12 14:24:53 +02:00
Martin Kroeker fc8894dd98
Workaround miscompilation by NVIDIA nvc 2023-08-26 00:30:17 +02:00
Martin Kroeker 5720fa02c5
Merge pull request #4168 from Mousius/sve-zgemm-cgemm
Use SVE zgemm/cgemm on Arm(R) Neoverse(TM) V1 core
2023-07-27 17:41:45 +02:00
Chris Sidebottom 84a268b6ca Use SVE zgemm/cgemm on Arm(R) Neoverse(TM) V1 core
This patch removes the prefetches from cgemm/zgemm which improves the performance similar to sgemm/dgemm did in #3868, this means I'm happy to enable this on any applicable cores.

I also replicated the unrolling the copies from sgemm and dgemm.
2023-07-27 14:12:20 +01:00
Chris Sidebottom 730ca04b48 Fix ZHEMM copy for SVE
Whilst disambiguating whilelt, I inadvertantly used the wrong datatype
for offsets, which can be negative. This rectifies that.
2023-07-27 13:27:28 +01:00
Martin Kroeker 849c8806b8
Merge pull request #4161 from Mousius/non-sve-kernels
Use latest non-SVE kernels in ARMV8SVE
2023-07-26 15:49:40 +02:00
Chris Sidebottom 24586bc4ff Disambiguate whilelt 2023-07-25 20:15:44 +01:00
Chris Sidebottom aea2a4622b Use latest non-SVE kernels in ARMV8SVE
These are generally better and, in some cases, include threading which helps in the cores we're targeting here.
2023-07-25 14:12:26 +01:00
martin-frbg 7976deff80 Fix file permissions (issue 4095) 2023-07-23 20:37:07 +02:00
Martin Kroeker 3d31191b0f
Work around Clang failing to disambiguate SVE intrinsics and add AppleClang crossbuild to MacOS/arm64 DYNAMIC_ARCH in AzureCI (#4140)
* Add AppleClang crossbuild to MacOS/arm64 DYNAMIC_ARCH

* add casts to disambiguate svwhilelt for clang
2023-07-14 11:06:48 +02:00
Martin Kroeker 72caceb324
Merge pull request #4009 from Mousius/sve-gemm
Use SVE kernel for SGEMM/DGEMM on Arm(R) Neoverse(TM) V1
2023-04-22 13:56:45 +02:00
Chris Sidebottom ec334e69dc Use SVE kernel for SGEMM/DGEMM on Arm(R) Neoverse(TM) V1
This re-spins #3869 with some additional copy unrolling which helps maintain SYRK performance.

After #3868, the SVE kernels represent a pretty good boost.

This re-uses ARMV8SVE as a base and I'm going to incrementally move everything to use ARMV8SVE in additional patches (as well as fix up anything that's not already in ARMV8SVE).
2023-04-17 17:38:42 +01:00
Martin Kroeker 44164e3a3d
revert "move alpha out of register 18" (out of PR scope, no SVE on Apple hw) 2023-04-17 14:23:13 +02:00
Martin Kroeker 8be68fa7f4
move declaration of sca to really keep the compiler from throwing it out (for now) 2023-04-15 12:02:39 +02:00
Martin Kroeker 3727672a74
Improve workaround and keep compilers from optimizing it out 2023-04-13 18:07:52 +02:00
Martin Kroeker 108a21e47a
Move ALPHA out of register 18 (reserved on OSX) 2023-04-13 18:05:14 +02:00
Martin Kroeker 0b1acb0ba3
Move ALPHA_I out of register 18 (reserved on OSX) 2023-04-13 18:03:35 +02:00