OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Martin Kroeker	e9a8d5b45f	Merge pull request #4015 from martin-frbg/issue4013-2 [WIP] Disable gcc's tree-vectorizer for x86_64 CGEMV	2023-04-23 18:51:12 +02:00
Martin Kroeker	72caceb324	Merge pull request #4009 from Mousius/sve-gemm Use SVE kernel for SGEMM/DGEMM on Arm(R) Neoverse(TM) V1	2023-04-22 13:56:45 +02:00
Martin Kroeker	84bcf6639f	Disable gcc's tree-vectorizer pass on all operating systems	2023-04-20 23:24:52 +02:00
Martin Kroeker	c9174ae8d7	Disable gcc's tree-vectorizer pass on all operating systems	2023-04-19 23:45:44 +02:00
Martin Kroeker	c2fe9cb91f	Disable gcc's tree-vectorizer pass on all operating systems	2023-04-19 23:45:14 +02:00
Martin Kroeker	66b39b835c	Disable gcc's tree-vectorizer pass on all operating systems	2023-04-19 23:44:45 +02:00
Martin Kroeker	bb6d6735bf	Disable gcc's tree-vectorizer pass on all operating systems	2023-04-19 23:44:15 +02:00
Martin Kroeker	d18efaed20	Disable gcc's tree-vectorizer pass on all operating systems	2023-04-19 23:43:43 +02:00
Martin Kroeker	99f6d31ed5	Disable gcc's tree-vectorizer pass on all operating systems	2023-04-19 23:42:55 +02:00
Martin Kroeker	7de9335c56	Disable gcc's tree-vectorizer pass on all operating systems	2023-04-19 23:42:09 +02:00
Martin Kroeker	437c0bf2b4	Merge pull request #3843 from Mousius/switch-ratio Propagate SWITCH_RATIO to DYNAMIC_ARCH builds	2023-04-19 11:51:54 +02:00
Chris Sidebottom	ec334e69dc	Use SVE kernel for SGEMM/DGEMM on Arm(R) Neoverse(TM) V1 This re-spins #3869 with some additional copy unrolling which helps maintain SYRK performance. After #3868, the SVE kernels represent a pretty good boost. This re-uses ARMV8SVE as a base and I'm going to incrementally move everything to use ARMV8SVE in additional patches (as well as fix up anything that's not already in ARMV8SVE).	2023-04-17 17:38:42 +01:00
Chris Sidebottom	32f2fafde7	Propagate SWITCH_RATIO to DYNAMIC_ARCH builds Previously dynamic builds were either using the default SWITCH_RATIO or one from the higher level architecture; this patch ensures the dynamic builds can use this parameter as well.	2023-04-17 15:34:12 +01:00
Martin Kroeker	44164e3a3d	revert "move alpha out of register 18" (out of PR scope, no SVE on Apple hw)	2023-04-17 14:23:13 +02:00
Martin Kroeker	8be68fa7f4	move declaration of sca to really keep the compiler from throwing it out (for now)	2023-04-15 12:02:39 +02:00
Martin Kroeker	3727672a74	Improve workaround and keep compilers from optimizing it out	2023-04-13 18:07:52 +02:00
Martin Kroeker	108a21e47a	Move ALPHA out of register 18 (reserved on OSX)	2023-04-13 18:05:14 +02:00
Martin Kroeker	0b1acb0ba3	Move ALPHA_I out of register 18 (reserved on OSX)	2023-04-13 18:03:35 +02:00
Martin Kroeker	c7bbad09ad	Move ALPHA_I out of register 18 (reserved on OSX)	2023-04-13 18:00:47 +02:00
Martin Kroeker	cda29633a3	move ALPHA_I out of register 18 (reserved on OSX)	2023-04-13 17:59:48 +02:00
Martin Kroeker	09ace3cf23	Merge pull request #3846 from lilh9598/sbgemm_opt Improve the performance of sbgemm_tcopy on neoversen2	2023-03-26 19:04:57 +02:00
Sergei Lewis	cb0a70e0e2	dot.c early bail fix	2023-03-02 09:51:10 +00:00
Martin Kroeker	38d6fb4225	Fix dependencies in builds with specified subsets of precision types	2023-02-23 23:12:06 +01:00
Martin Kroeker	e412bee313	fix GEMM kernel dependencies in builds that use only a subset of precisions	2023-02-22 00:37:14 +01:00
Martin Kroeker	d80adf253e	make SSYMV available to BUILD_DOUBLE-only builds	2023-02-22 00:30:20 +01:00
Martin Kroeker	5481c328e8	fix DYNAMIC_ARCH builds that use only a subset of precisions	2023-02-22 00:28:25 +01:00
Martin Kroeker	5a9cd87794	Merge pull request #3868 from Mousius/sve-prefetch Remove prefetches from SVE kernels	2022-12-24 10:52:29 +01:00
Chris Sidebottom	1361229291	Remove prefetches from SVE kernels This is a precursor to enabling the SVE kernels for Arm(R) Neoverse(TM) V1 which has 256-bit SVE. Testing revealed that the SVE kernel was actually worse in some cases than the existing kernel which seemed odd - removing these prefetches the underlying architecture seems to do a better job 😸	2022-12-16 14:43:09 +00:00
Bart Oldeman	60e49b851c	Fix typo in clobber list, should be xmm14 instead of ymm14.	2022-12-06 16:30:46 -05:00
Bart Oldeman	4afe1439a1	Fix skylake fallback kernel name for old compilers.	2022-12-06 16:09:54 -05:00
Bart Oldeman	5ceca1a4d8	Add sscal.c + microkernels for Haswell, Zen, Skylake and newer. Unlike [dcz]scal, sscal still used the original GotoBLAS SSE code from scal_sse.S. This code follows dscal as closely as possible, except for the inc_x > 1 code for which a plain C loop is used much like the one in cscal.c, instead of an adaptation of the SSE2 asm code of dscal.c (I tried but the performance wasn't better than the plain C loop).	2022-12-06 14:05:49 -05:00
lilianhuang	729af6406f	bugfix for sbgemm_ncopy_8_neoversen2	2022-12-05 05:10:18 -05:00
Martin Kroeker	042e3c0e7c	Merge pull request #3848 from bartoldeman/dscal-haswell-ymm dscal: use ymm registers in Haswell microkernel	2022-12-05 08:56:08 +01:00
Bart Oldeman	5c3169ecd8	dscal: use ymm registers in Haswell microkernel Using 256-bit registers in dscal makes this microkernel consistent with cscal and zscal, and generally doubles performance if the vector fits in L1 cache.	2022-12-01 07:48:05 -05:00
Chris Sidebottom	eea006a688	Wrap SVE header with __has_include check	2022-12-01 12:07:55 +00:00
Chris Sidebottom	fd4f52c797	Add SVE implementation for sdot/ddot This adds an SVE implementation to sdot/ddot when available, falling back to the previous Advanced SIMD kernel where there's no SVE implementation for the kernel. All the targets were essentially treating `dot_thunderx2t99.c` as the Advanced SIMD implementation so I've renamed it to better fit with the feature detection.	2022-12-01 12:07:50 +00:00
lilianhuang	fdac8a97c1	Add sbgemm_ncopy_8 and sbgemm_tcopy_4	2022-11-29 04:46:14 -05:00
lilianhuang	135718eafc	Improve the performance of sbgemm_tcopy on neoversen2	2022-11-28 04:17:54 -05:00
Chris Sidebottom	4f7b77e08a	Remove unnecessary instructions from Advanced SIMD dot The existing kernel was issuing extra instructions to organise the arguments into the same registers they would usually be in and similarly to put the result into the appropriate register. This has an impact on smaller sized dots and seemed like a quick fix	2022-11-25 16:19:03 +00:00
Martin Kroeker	f73cfb7e2c	change line endings from CRLF to LF	2022-11-17 09:39:56 +01:00
Martin Kroeker	1688c7da43	change line endings from CRLF to LF	2022-11-16 22:24:01 +01:00
Bart Oldeman	6c1043eb41	Add [cz]scal microkernels for SKYLAKEX These are as similar to dscal_microk_skylakex-2.c as possible for consistency. Note that before this change SKYLAKEX+ uses generic C functions for cscal/zscal via commit `2271c350` from #2610 (which is masked by commit `086d87a30`). However now #3799 disables FMAs (in turn enabled by `-march=skylake-avx512`) in the plain C code which fixes excessive LAPACK test failures more nicely.	2022-11-09 08:57:03 -05:00
Martin Kroeker	c9d78dc3b2	Remove excess initializer (leftover from rework of PR 3793)	2022-10-31 16:57:03 +01:00
Martin Kroeker	65338a9493	Merge pull request #3799 from bartoldeman/cscal-zscal-no-fma x86_64: prevent GCC and Clang from generating FMAs in cscal/zscal.	2022-10-30 18:56:10 +01:00
Honglin Zhu	79066b6bf3	Change file name to match the norm and delete useless code.	2022-10-28 17:09:39 +08:00
Bart Oldeman	e7e3aa2948	x86_64: prevent GCC and Clang from generating FMAs in cscal/zscal. If e.g. -march=haswell is set in CFLAGS, GCC generates FMAs by default, which is inconsistent with the microkernels, none of which use FMAs. These inconsistencies cause a few failures in the LAPACK testcases, where eigenvalue results with/without eigenvectors are compared. Moreover using FMAs for multiplication of complex numbers can give surprising results, see `22aa81f` for more information. This uses the same syntax as used in `22aa81f` for zarch (s390x).	2022-10-27 18:16:43 -04:00
Honglin Zhu	4989e039a5	Define SBGEMM_ALIGN_K for DYNAMIC_ARCH build	2022-10-27 14:10:26 +08:00
Honglin Zhu	843e9fd0b9	Fix typo error	2022-10-26 17:06:33 +08:00
Honglin Zhu	b00d5b9746	New sbgemm implementation for Neoverse N2 1. Use UZP instructions but not gather load and scatter store instructions to get lower latency. 2. Padding k to a power of 4.	2022-10-26 15:09:41 +08:00
Martin Kroeker	f6f35a4288	fix copyobj declarations to work with DYNAMIC_ARCH	2022-09-29 08:47:14 +02:00

1 2 3 4 5 ...

1957 Commits