OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Martin Kroeker	7cfd433d0c	revert the C/Z NRM2 kernels to the base NEON kernel as well	2024-04-12 15:34:04 +02:00
Martin Kroeker	441c81026e	Add support for Cortex-A76	2024-04-02 19:41:44 +02:00
Martin Kroeker	9ead81bd39	Revert S/DNRM2 to the base NEON kernel to fix precision loss	2024-04-02 15:59:20 +02:00
Martin Kroeker	552c521353	remove another early exit for incx < 0	2024-03-12 18:49:27 +01:00
Martin Kroeker	ed532dc75b	remove another early exit for incx < 0	2024-03-12 18:47:00 +01:00
Martin Kroeker	e41d01bad9	remove early exit on negative inc_x	2024-03-11 22:53:54 +01:00
Martin Kroeker	02a025f9c1	remove early exit on negative inc_x	2024-03-11 22:52:18 +01:00
Martin Kroeker	7d506984fa	fix assignment of default CSUM kernel	2024-02-25 17:57:11 +01:00
Martin Kroeker	12787775d9	add csum/zsum kernels (trivially derived from the asum ones)s)	2024-02-25 17:55:36 +01:00
Martin Kroeker	c9df62e883	Fix handling of NAN	2024-01-07 17:49:40 +01:00
Chris Sidebottom	ecae1389df	Reduce duplication in kernel definitions These files are exactly the same, so I believe we can reduce these files down. Other files require a slightly more complex unpicking.	2023-12-23 12:39:53 +00:00
Chris Sidebottom	60e66725e4	Use numeric labels to allow repeated inlining	2023-12-19 13:11:06 +00:00
Chris Sidebottom	7a4fef4f60	Tweak SVE dot kernel This changes the SVE dot kernel to only predicate when necessary as well as streamlining the assembly a bit. The benchmarks seem to indicate this can improve performance by ~33%.	2023-12-19 12:08:54 +00:00
Martin Kroeker	3bfa4d4dcc	Fix outdated SVE kernel definitions for Cortex cpus by aliasing to ARMV8SVE	2023-11-03 14:55:31 +01:00
Martin Kroeker	e7d05402e0	Fix up S/D GEMM copy function definitions after #4009	2023-10-12 14:24:53 +02:00
Martin Kroeker	fc8894dd98	Workaround miscompilation by NVIDIA nvc	2023-08-26 00:30:17 +02:00
Martin Kroeker	5720fa02c5	Merge pull request #4168 from Mousius/sve-zgemm-cgemm Use SVE zgemm/cgemm on Arm(R) Neoverse(TM) V1 core	2023-07-27 17:41:45 +02:00
Chris Sidebottom	84a268b6ca	Use SVE zgemm/cgemm on Arm(R) Neoverse(TM) V1 core This patch removes the prefetches from cgemm/zgemm which improves the performance similar to sgemm/dgemm did in #3868, this means I'm happy to enable this on any applicable cores. I also replicated the unrolling the copies from sgemm and dgemm.	2023-07-27 14:12:20 +01:00
Chris Sidebottom	730ca04b48	Fix ZHEMM copy for SVE Whilst disambiguating whilelt, I inadvertantly used the wrong datatype for offsets, which can be negative. This rectifies that.	2023-07-27 13:27:28 +01:00
Martin Kroeker	849c8806b8	Merge pull request #4161 from Mousius/non-sve-kernels Use latest non-SVE kernels in ARMV8SVE	2023-07-26 15:49:40 +02:00
Chris Sidebottom	24586bc4ff	Disambiguate whilelt	2023-07-25 20:15:44 +01:00
Chris Sidebottom	aea2a4622b	Use latest non-SVE kernels in ARMV8SVE These are generally better and, in some cases, include threading which helps in the cores we're targeting here.	2023-07-25 14:12:26 +01:00
martin-frbg	7976deff80	Fix file permissions (issue 4095)	2023-07-23 20:37:07 +02:00
Martin Kroeker	3d31191b0f	Work around Clang failing to disambiguate SVE intrinsics and add AppleClang crossbuild to MacOS/arm64 DYNAMIC_ARCH in AzureCI (#4140 ) * Add AppleClang crossbuild to MacOS/arm64 DYNAMIC_ARCH * add casts to disambiguate svwhilelt for clang	2023-07-14 11:06:48 +02:00
Martin Kroeker	72caceb324	Merge pull request #4009 from Mousius/sve-gemm Use SVE kernel for SGEMM/DGEMM on Arm(R) Neoverse(TM) V1	2023-04-22 13:56:45 +02:00
Chris Sidebottom	ec334e69dc	Use SVE kernel for SGEMM/DGEMM on Arm(R) Neoverse(TM) V1 This re-spins #3869 with some additional copy unrolling which helps maintain SYRK performance. After #3868, the SVE kernels represent a pretty good boost. This re-uses ARMV8SVE as a base and I'm going to incrementally move everything to use ARMV8SVE in additional patches (as well as fix up anything that's not already in ARMV8SVE).	2023-04-17 17:38:42 +01:00
Martin Kroeker	44164e3a3d	revert "move alpha out of register 18" (out of PR scope, no SVE on Apple hw)	2023-04-17 14:23:13 +02:00
Martin Kroeker	8be68fa7f4	move declaration of sca to really keep the compiler from throwing it out (for now)	2023-04-15 12:02:39 +02:00
Martin Kroeker	3727672a74	Improve workaround and keep compilers from optimizing it out	2023-04-13 18:07:52 +02:00
Martin Kroeker	108a21e47a	Move ALPHA out of register 18 (reserved on OSX)	2023-04-13 18:05:14 +02:00
Martin Kroeker	0b1acb0ba3	Move ALPHA_I out of register 18 (reserved on OSX)	2023-04-13 18:03:35 +02:00
Martin Kroeker	c7bbad09ad	Move ALPHA_I out of register 18 (reserved on OSX)	2023-04-13 18:00:47 +02:00
Martin Kroeker	cda29633a3	move ALPHA_I out of register 18 (reserved on OSX)	2023-04-13 17:59:48 +02:00
Martin Kroeker	09ace3cf23	Merge pull request #3846 from lilh9598/sbgemm_opt Improve the performance of sbgemm_tcopy on neoversen2	2023-03-26 19:04:57 +02:00
Chris Sidebottom	1361229291	Remove prefetches from SVE kernels This is a precursor to enabling the SVE kernels for Arm(R) Neoverse(TM) V1 which has 256-bit SVE. Testing revealed that the SVE kernel was actually worse in some cases than the existing kernel which seemed odd - removing these prefetches the underlying architecture seems to do a better job 😸	2022-12-16 14:43:09 +00:00
lilianhuang	729af6406f	bugfix for sbgemm_ncopy_8_neoversen2	2022-12-05 05:10:18 -05:00
Chris Sidebottom	eea006a688	Wrap SVE header with __has_include check	2022-12-01 12:07:55 +00:00
Chris Sidebottom	fd4f52c797	Add SVE implementation for sdot/ddot This adds an SVE implementation to sdot/ddot when available, falling back to the previous Advanced SIMD kernel where there's no SVE implementation for the kernel. All the targets were essentially treating `dot_thunderx2t99.c` as the Advanced SIMD implementation so I've renamed it to better fit with the feature detection.	2022-12-01 12:07:50 +00:00
lilianhuang	fdac8a97c1	Add sbgemm_ncopy_8 and sbgemm_tcopy_4	2022-11-29 04:46:14 -05:00
lilianhuang	135718eafc	Improve the performance of sbgemm_tcopy on neoversen2	2022-11-28 04:17:54 -05:00
Chris Sidebottom	4f7b77e08a	Remove unnecessary instructions from Advanced SIMD dot The existing kernel was issuing extra instructions to organise the arguments into the same registers they would usually be in and similarly to put the result into the appropriate register. This has an impact on smaller sized dots and seemed like a quick fix	2022-11-25 16:19:03 +00:00
Martin Kroeker	1688c7da43	change line endings from CRLF to LF	2022-11-16 22:24:01 +01:00
Honglin Zhu	79066b6bf3	Change file name to match the norm and delete useless code.	2022-10-28 17:09:39 +08:00
Honglin Zhu	b00d5b9746	New sbgemm implementation for Neoverse N2 1. Use UZP instructions but not gather load and scatter store instructions to get lower latency. 2. Padding k to a power of 4.	2022-10-26 15:09:41 +08:00
Martin Kroeker	e12d474780	Eliminate uses of CREAL on left-hand side of assignments	2022-07-05 00:01:09 +02:00
Martin Kroeker	9e29598575	workaround fault with ssq=inf,scale=0	2022-07-02 23:47:17 +02:00
Honglin Zhu	123e0dfb62	Neoverse N2 sbgemm: 1. Modify the algorithm to resolve multithreading failures 2. No memory allocation in sbgemm kernel 3. Optimize when alpha == 1.0f	2022-06-29 10:14:21 +08:00
Honglin Zhu	bc3728475f	format code	2022-06-29 10:14:21 +08:00
Honglin Zhu	55d686d41e	neoverse n2 sbgemm: implement ncopy tcopy kernel_8x4	2022-06-29 10:14:21 +08:00
Honglin Zhu	04593bb27c	neoverse n2 sbgemm: init file	2022-06-29 10:14:21 +08:00

1 2 3 4 5

245 Commits