OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Chris Sidebottom	60e66725e4	Use numeric labels to allow repeated inlining	2023-12-19 13:11:06 +00:00
Chris Sidebottom	7a4fef4f60	Tweak SVE dot kernel This changes the SVE dot kernel to only predicate when necessary as well as streamlining the assembly a bit. The benchmarks seem to indicate this can improve performance by ~33%.	2023-12-19 12:08:54 +00:00
Martin Kroeker	f06b535566	Use C kernel for dgemv_t due to limitations of the old assembly one	2023-12-15 09:58:44 +01:00
barracuda156	d9653af018	KERNEL.PPC970, KERNEL.PPCG4: unbreak CMake parsing Fixes: https://github.com/OpenMathLib/OpenBLAS/issues/4366	2023-12-14 12:00:11 +08:00
Chip-Kerchner	93747fb377	Merge remote-tracking branch 'origin/develop' into power10Copies	2023-12-12 09:32:49 -06:00
Chip-Kerchner	4e738e561a	Replace two vector loads with one vector pair load and fix endianess of stores.	2023-12-08 12:36:08 -06:00
yancheng	d32f38fb37	loongarch64: Add optimizations for nrm2.	2023-12-07 14:36:26 +08:00
yancheng	f9b468990e	loongarch64: Add optimizations for rot.	2023-12-07 14:36:26 +08:00
yancheng	c80e7e27d1	loongarch64: Add optimizations for sum and asum.	2023-12-07 14:36:26 +08:00
yancheng	d4c96a35a8	loongarch64: Add optimizations for axpy and axpby.	2023-12-07 14:36:26 +08:00
yancheng	360acc0a41	loongarch64: Add optimizations for swap.	2023-12-07 14:36:26 +08:00
yancheng	174c25766b	loongarch64: Add optimizations for copy.	2023-12-07 14:36:26 +08:00
yancheng	49829b2b7d	loongarch64: Add optimizations for iamin.	2023-12-07 14:36:07 +08:00
yancheng	be83f5e4e0	loongarch64: Add optimizations for iamax.	2023-12-07 14:36:07 +08:00
yancheng	e3fb2b5afa	loongarch64: Add optimizations for imin.	2023-12-07 14:36:07 +08:00
yancheng	e46b48e372	loongarch64: Add optimizations for imax.	2023-12-07 14:36:07 +08:00
yancheng	702fc1d56d	loongarch64: Add optimization for min.	2023-12-07 14:36:07 +08:00
yancheng	346b384d1c	loongarch64: Add optimization for max.	2023-12-07 14:36:07 +08:00
yancheng	ff2ecc6cda	loongarch64: Add optimization for amin.	2023-12-07 14:36:07 +08:00
yancheng	265b5f2e80	loongarch64: Add optimizations for amax.	2023-12-07 14:36:07 +08:00
yancheng	993ede7c70	loongarch64: Add optimizations for scal.	2023-12-07 14:36:07 +08:00
Martin Kroeker	39bf8ece20	Merge pull request #4340 from yinshiyou/la-dev Add some refines and optimizations for LoongArch.	2023-11-29 08:22:25 +01:00
Shiyou Yin	9fe07d82fd	loongarch: Add LSX optimization for dot.	2023-11-28 20:24:18 +08:00
Shiyou Yin	13b8c44b44	loongarch: Add optimization for dsdot kernel.	2023-11-28 20:24:16 +08:00
Shiyou Yin	3def6a8143	loongarch: Add LASX optimization for dot.	2023-11-28 20:24:14 +08:00
Bart Oldeman	c34e2cf380	Use _mm_set1_epi{32,64x} to init mask in x86-64 [cz]asum for skylake kernels. This is the same method as used in [sd]asum. _mm_set1_epi64x was commented out for zasum, but has the advantage of avoiding possible undefined behaviour (using an uninitialized variable), optimized out by NVHPC and icx. The new code works fine with those compilers. For GCC 12.3 the generated code is identical; no matter what method you use, the compiler optimizes the code into a compile-time constant, there is no performance benefit using mm_cmpeq_epi8 since the corresponding instruction (VPCMPEQB) isn't actually generated!	2023-11-19 21:28:35 +00:00
Martin Kroeker	22aa401656	Temporarily disable the AVX512 CASUM/ZASUM microkernels for any version of NVIDIA HPC (#4327 ) * Temporarily disable the C/ZASUM microkernels for any version of NVHPC	2023-11-19 00:04:31 +01:00
Bart Oldeman	f8ad5344c2	Fix casum fallback kernel. This kernel is only used on Skylake+ if the kernel with AVX512 intrinsics can't be used, but used the variable x1 incorrectly in the tail end of the loop, as it is still at the initial value instead of where x points to. This caused 55 "other error"s in the LAPACK tests (https://github.com/OpenMathLib/OpenBLAS/issues/4282) This change makes casum.c as similar as possible as zasum.c, because zasum.c does this correctly.	2023-11-17 23:53:56 +00:00
Martin Kroeker	04bc801999	(Re)apply fixes for supporting only a subset of precision types from PR 3915	2023-11-04 23:48:59 +01:00
Martin Kroeker	9019bc4945	Use SkylakeX ?ASUM microkernel for Cooperlake/Sapphirerapids as well	2023-11-04 22:10:06 +01:00
Martin Kroeker	3bfa4d4dcc	Fix outdated SVE kernel definitions for Cortex cpus by aliasing to ARMV8SVE	2023-11-03 14:55:31 +01:00
Rajalakshmi Srinivasaraghavan	980f702f72	POWER: AIX: Make use of power10 optimization POWER10 optimizations are disabled when using default AIX assembler. As we have fixed many issues recently, enabling optimization path for default assembler.	2023-10-19 18:48:19 -05:00
Rajalakshmi Srinivasaraghavan	9f42570e33	POWER: Increase macro size limit for AIX This patch increases the macro size limit from 4096 to 16384 to allow compiling larger assembly files in AIX. Tested with GCC and IBM Open XL C.	2023-10-12 12:37:40 -05:00
Martin Kroeker	9f49aef91b	Merge pull request #4255 from RajalakshmiSR/AIX-P10 POWER10: Fix compilation issues with Open XL C	2023-10-12 18:59:17 +02:00
Martin Kroeker	e7d05402e0	Fix up S/D GEMM copy function definitions after #4009	2023-10-12 14:24:53 +02:00
Rajalakshmi Srinivasaraghavan	71d733e5f7	POWER: Avoid m4 conversions for C files This patch removes intermediate m4 conversions used in sbgemm compilation as it is not needed for .c files. Tested on AIX with gcc and IBM Open XL C.	2023-10-11 17:18:42 -05:00
Rajalakshmi Srinivasaraghavan	82fc29a57a	POWER10: Fallback to POWER8 functions As cgemm and zgemm kernels are not optimized for big endian falling back to POWER8 versions. Tested on AIX using gcc and Open XL C.	2023-10-11 17:04:42 -05:00
Rajalakshmi Srinivasaraghavan	db0805906b	powerpc: Fix build errors with Open XL C This patch fixes errors when using Open XL C compiler on AIX. Tested with gcc/xlf and ibm-clang/xlf compiler combinations.	2023-10-04 14:04:03 -05:00
Martin Kroeker	675cd551da	fix improper function prototypes (empty parentheses)	2023-09-30 12:56:38 +02:00
gxw	d15e0a055c	LoongArch64: Fixed compilation issues when enable DYNAMIC_ARCH	2023-09-27 10:05:27 +08:00
gxw	4670eb1462	LoongArch64: Add dtrsm kernel	2023-09-26 15:45:14 +08:00
gxw	f2cf929374	LoongArch64: Add sgemv kernel	2023-09-04 14:28:37 +08:00
Martin Kroeker	8e6d93359d	Merge pull request #4196 from TiborGY/obsolete_inlines Modernize obsolete inline order	2023-09-03 14:12:42 +02:00
gxw	394a1fd1bf	LoongArch64: Compatible with early internal toolchain __loongarch_grlen and __loongarch_frlen were introduced in gcc version 8.3.0 (Loongnix 8.3.0-6.lnd.vec.31) internally within Loongson to standardize the general and floating-point register widths. However, previous versions did not have them, requiring additional checks to be added.	2023-08-31 16:55:29 +08:00
Martin Kroeker	9c4ae4d4fb	Merge pull request #4206 from martin-frbg/issue4201-2 Work around miscompilation of zdot_thunderx2t99 by the current NVIDIA HPC compiler	2023-08-26 10:17:27 +02:00
Martin Kroeker	88435104c8	Merge pull request #4204 from martin-frbg/llvm17-2 Work around LLVM17 miscompiling the AVX512 microkernels for CASUM/ZASUM	2023-08-26 00:32:18 +02:00
Martin Kroeker	fc8894dd98	Workaround miscompilation by NVIDIA nvc	2023-08-26 00:30:17 +02:00
Martin Kroeker	7a6203ffa1	restore default Neoverse SVE build instructions for non-NVIDIA compilers	2023-08-25 18:25:51 +02:00
Martin Kroeker	2c3034ff7f	Disable the C/ZASUM AVX512 microkernels when compiling with LLVM17 as well	2023-08-25 17:22:51 +02:00
Martin Kroeker	8794544b43	Add support for compiling the Neoverse SVE kernels with the NVIDIA HPC compiler	2023-08-25 16:47:32 +02:00

1 2 3 4 5 ...

2061 Commits