OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Martin Kroeker	8f8ef3492a	Add CSUM and ZSUM kernels (trivially derived from their existing ASUM counterparts)	2024-02-24 23:57:50 +01:00
Martin Kroeker	be5e18c6f9	Add kernel definitions for CSUM and ZSUM	2024-02-24 23:55:43 +01:00
gxw	969601a1dc	X86_64: Fixed bug in zscal Fixed handling of NAN and INF arguments when inc is greater than 1.	2024-01-31 11:23:59 +08:00
Martin Kroeker	5f5b7c4f45	Merge pull request #4423 from martin-frbg/issue4422 Check compiler support for AVX512BF16 and base COL/SPR kernel choice on that	2024-01-12 16:30:50 +01:00
Martin Kroeker	995a990e24	Make AVX512 BFLOAT16 kernels conditional on compiler capability	2024-01-12 00:12:46 +01:00
Martin Kroeker	cf8b03ae8b	Use NAN rather than SNAN for portability	2024-01-07 23:09:57 +01:00
Martin Kroeker	def4996170	Fix handling of NAN and INF arguments	2024-01-07 15:29:42 +01:00
Martin Kroeker	f06b535566	Use C kernel for dgemv_t due to limitations of the old assembly one	2023-12-15 09:58:44 +01:00
Bart Oldeman	c34e2cf380	Use _mm_set1_epi{32,64x} to init mask in x86-64 [cz]asum for skylake kernels. This is the same method as used in [sd]asum. _mm_set1_epi64x was commented out for zasum, but has the advantage of avoiding possible undefined behaviour (using an uninitialized variable), optimized out by NVHPC and icx. The new code works fine with those compilers. For GCC 12.3 the generated code is identical; no matter what method you use, the compiler optimizes the code into a compile-time constant, there is no performance benefit using mm_cmpeq_epi8 since the corresponding instruction (VPCMPEQB) isn't actually generated!	2023-11-19 21:28:35 +00:00
Martin Kroeker	22aa401656	Temporarily disable the AVX512 CASUM/ZASUM microkernels for any version of NVIDIA HPC (#4327 ) * Temporarily disable the C/ZASUM microkernels for any version of NVHPC	2023-11-19 00:04:31 +01:00
Bart Oldeman	f8ad5344c2	Fix casum fallback kernel. This kernel is only used on Skylake+ if the kernel with AVX512 intrinsics can't be used, but used the variable x1 incorrectly in the tail end of the loop, as it is still at the initial value instead of where x points to. This caused 55 "other error"s in the LAPACK tests (https://github.com/OpenMathLib/OpenBLAS/issues/4282) This change makes casum.c as similar as possible as zasum.c, because zasum.c does this correctly.	2023-11-17 23:53:56 +00:00
Martin Kroeker	9019bc4945	Use SkylakeX ?ASUM microkernel for Cooperlake/Sapphirerapids as well	2023-11-04 22:10:06 +01:00
Martin Kroeker	675cd551da	fix improper function prototypes (empty parentheses)	2023-09-30 12:56:38 +02:00
Martin Kroeker	2c3034ff7f	Disable the C/ZASUM AVX512 microkernels when compiling with LLVM17 as well	2023-08-25 17:22:51 +02:00
Martin Kroeker	34da1a067d	Allow negative INCX (API change from version 3.10 of the reference implementation)	2023-08-10 17:01:50 +02:00
Martin Kroeker	4664b57e6e	use shortcut only when both incx and incy are zero	2023-08-04 12:25:34 +02:00
Martin Kroeker	6a428b5629	Update casum_microk_skylakex-2.c	2023-07-29 12:24:30 +02:00
Martin Kroeker	ebb447e32e	Update zasum_microk_skylakex-2.c	2023-07-29 12:23:57 +02:00
Martin Kroeker	9f6847583a	nvc currently miscompiles this, hopefully fixed in release 23.09	2023-07-29 11:50:16 +02:00
Martin Kroeker	fe54ee3d15	nvc currently miscompiles this, hopefully fixed in release 23.09	2023-07-29 11:48:38 +02:00
Martin Kroeker	2a62d2df96	Enable use of AVX512 microkernels with NVIDIA HPC from version 22.3	2023-07-26 19:39:11 +02:00
Honglin Zhu	a76afdc047	Compatible with older version of GNU make	2023-05-20 13:58:23 +08:00
Honglin Zhu	0b83088887	spr dynamic arch support	2023-05-19 10:48:18 +08:00
Honglin Zhu	f249ccb741	Fix spr sbgemm error	2023-05-19 10:48:18 +08:00
Martin Kroeker	84bcf6639f	Disable gcc's tree-vectorizer pass on all operating systems	2023-04-20 23:24:52 +02:00
Martin Kroeker	c9174ae8d7	Disable gcc's tree-vectorizer pass on all operating systems	2023-04-19 23:45:44 +02:00
Martin Kroeker	c2fe9cb91f	Disable gcc's tree-vectorizer pass on all operating systems	2023-04-19 23:45:14 +02:00
Martin Kroeker	66b39b835c	Disable gcc's tree-vectorizer pass on all operating systems	2023-04-19 23:44:45 +02:00
Martin Kroeker	bb6d6735bf	Disable gcc's tree-vectorizer pass on all operating systems	2023-04-19 23:44:15 +02:00
Martin Kroeker	d18efaed20	Disable gcc's tree-vectorizer pass on all operating systems	2023-04-19 23:43:43 +02:00
Martin Kroeker	99f6d31ed5	Disable gcc's tree-vectorizer pass on all operating systems	2023-04-19 23:42:55 +02:00
Martin Kroeker	7de9335c56	Disable gcc's tree-vectorizer pass on all operating systems	2023-04-19 23:42:09 +02:00
Bart Oldeman	60e49b851c	Fix typo in clobber list, should be xmm14 instead of ymm14.	2022-12-06 16:30:46 -05:00
Bart Oldeman	4afe1439a1	Fix skylake fallback kernel name for old compilers.	2022-12-06 16:09:54 -05:00
Bart Oldeman	5ceca1a4d8	Add sscal.c + microkernels for Haswell, Zen, Skylake and newer. Unlike [dcz]scal, sscal still used the original GotoBLAS SSE code from scal_sse.S. This code follows dscal as closely as possible, except for the inc_x > 1 code for which a plain C loop is used much like the one in cscal.c, instead of an adaptation of the SSE2 asm code of dscal.c (I tried but the performance wasn't better than the plain C loop).	2022-12-06 14:05:49 -05:00
Bart Oldeman	5c3169ecd8	dscal: use ymm registers in Haswell microkernel Using 256-bit registers in dscal makes this microkernel consistent with cscal and zscal, and generally doubles performance if the vector fits in L1 cache.	2022-12-01 07:48:05 -05:00
Martin Kroeker	f73cfb7e2c	change line endings from CRLF to LF	2022-11-17 09:39:56 +01:00
Martin Kroeker	1688c7da43	change line endings from CRLF to LF	2022-11-16 22:24:01 +01:00
Bart Oldeman	6c1043eb41	Add [cz]scal microkernels for SKYLAKEX These are as similar to dscal_microk_skylakex-2.c as possible for consistency. Note that before this change SKYLAKEX+ uses generic C functions for cscal/zscal via commit `2271c350` from #2610 (which is masked by commit `086d87a30`). However now #3799 disables FMAs (in turn enabled by `-march=skylake-avx512`) in the plain C code which fixes excessive LAPACK test failures more nicely.	2022-11-09 08:57:03 -05:00
Bart Oldeman	e7e3aa2948	x86_64: prevent GCC and Clang from generating FMAs in cscal/zscal. If e.g. -march=haswell is set in CFLAGS, GCC generates FMAs by default, which is inconsistent with the microkernels, none of which use FMAs. These inconsistencies cause a few failures in the LAPACK testcases, where eigenvalue results with/without eigenvectors are compared. Moreover using FMAs for multiplication of complex numbers can give surprising results, see `22aa81f` for more information. This uses the same syntax as used in `22aa81f` for zarch (s390x).	2022-10-27 18:16:43 -04:00
Martin Kroeker	101a2c77c3	Fix warnings	2022-09-15 09:19:19 +02:00
Martin Kroeker	739c3c44a7	Work around windows/osx gcc12 x86_64 tree-optimizer problem and add an osx/gcc12 build to Azure CI (#3745 ) Add pragma to disable the gcc tree-optimizer for some x86_64 S and Z kernels with gcc12 on OSX or Windows	2022-09-03 15:01:22 +02:00
Martin Kroeker	dc49edd4e6	Revert "roll back DGEMM kernel ... for DYNAMIC_ARCH"	2022-05-20 11:23:30 +02:00
Caroline Newcombe	5cc1111383	fix unsafe read of Y in assembly kernel	2022-03-11 11:56:33 -06:00
Wangyang Guo	225683218c	Small Matrix: use proper inline asm input constraint for AVX512 mask	2022-02-28 03:22:31 +00:00
Martin Kroeker	9c626e466e	really fix definition of SHUFFLE_MAGIC_NO	2022-02-25 15:36:02 +01:00
Martin Kroeker	9d7429406f	Declare SHUFFLE_MAGIC_NO as const to placate clang	2022-02-25 10:05:36 +01:00
Martin Kroeker	522f809825	Merge pull request #3542 from martin-frbg/issue3540 Fix compilation for CooperLake on Windows/clang	2022-02-24 00:00:00 +01:00
Mosè Giordano	abbc947edb	Fix compilation of Skylake AVX512 kernels with GCC 6	2022-02-23 22:51:59 +00:00
Martin Kroeker	c62f8e2c01	Prevent compiler attempts to use k0 as mask register	2022-02-23 20:12:20 +01:00

1 2 3 4 5 ...

778 Commits