Chris Sidebottom
							
						 
						
							 
							
							
							
							
								
							
							
								ecae1389df 
								
							 
						 
						
							
							
								
								Reduce duplication in kernel definitions  
							
							 
							
							... 
							
							
							
							These files are exactly the same, so I believe we can reduce these files
down. Other files require a slightly more complex unpicking. 
							
						 
						
							2023-12-23 12:39:53 +00:00  
						
					 
				
					
						
							
							
								 
								Chris Sidebottom
							
						 
						
							 
							
							
							
							
								
							
							
								60e66725e4 
								
							 
						 
						
							
							
								
								Use numeric labels to allow repeated inlining  
							
							 
							
							
							
						 
						
							2023-12-19 13:11:06 +00:00  
						
					 
				
					
						
							
							
								 
								Chris Sidebottom
							
						 
						
							 
							
							
							
							
								
							
							
								7a4fef4f60 
								
							 
						 
						
							
							
								
								Tweak SVE dot kernel  
							
							 
							
							... 
							
							
							
							This changes the SVE dot kernel to only predicate when necessary as well
as streamlining the assembly a bit. The benchmarks seem to indicate this
can improve performance by ~33%. 
							
						 
						
							2023-12-19 12:08:54 +00:00  
						
					 
				
					
						
							
							
								 
								Martin Kroeker
							
						 
						
							 
							
							
								
								
							
							
							
								
							
							
								f06b535566 
								
							 
						 
						
							
							
								
								Use C kernel for dgemv_t due to limitations of the old assembly one  
							
							 
							
							
							
						 
						
							2023-12-15 09:58:44 +01:00  
						
					 
				
					
						
							
							
								 
								barracuda156
							
						 
						
							 
							
							
							
							
								
							
							
								d9653af018 
								
							 
						 
						
							
							
								
								KERNEL.PPC970, KERNEL.PPCG4: unbreak CMake parsing  
							
							 
							
							... 
							
							
							
							Fixes: https://github.com/OpenMathLib/OpenBLAS/issues/4366  
							
						 
						
							2023-12-14 12:00:11 +08:00  
						
					 
				
					
						
							
							
								 
								Chip-Kerchner
							
						 
						
							 
							
							
							
							
								
							
							
								93747fb377 
								
							 
						 
						
							
							
								
								Merge remote-tracking branch 'origin/develop' into power10Copies  
							
							 
							
							
							
						 
						
							2023-12-12 09:32:49 -06:00  
						
					 
				
					
						
							
							
								 
								Chip-Kerchner
							
						 
						
							 
							
							
							
							
								
							
							
								4e738e561a 
								
							 
						 
						
							
							
								
								Replace two vector loads with one vector pair load and fix endianess of stores.  
							
							 
							
							
							
						 
						
							2023-12-08 12:36:08 -06:00  
						
					 
				
					
						
							
							
								 
								yancheng
							
						 
						
							 
							
							
							
							
								
							
							
								d32f38fb37 
								
							 
						 
						
							
							
								
								loongarch64: Add optimizations for nrm2.  
							
							 
							
							
							
						 
						
							2023-12-07 14:36:26 +08:00  
						
					 
				
					
						
							
							
								 
								yancheng
							
						 
						
							 
							
							
							
							
								
							
							
								f9b468990e 
								
							 
						 
						
							
							
								
								loongarch64: Add optimizations for rot.  
							
							 
							
							
							
						 
						
							2023-12-07 14:36:26 +08:00  
						
					 
				
					
						
							
							
								 
								yancheng
							
						 
						
							 
							
							
							
							
								
							
							
								c80e7e27d1 
								
							 
						 
						
							
							
								
								loongarch64: Add optimizations for sum and asum.  
							
							 
							
							
							
						 
						
							2023-12-07 14:36:26 +08:00  
						
					 
				
					
						
							
							
								 
								yancheng
							
						 
						
							 
							
							
							
							
								
							
							
								d4c96a35a8 
								
							 
						 
						
							
							
								
								loongarch64: Add optimizations for axpy and axpby.  
							
							 
							
							
							
						 
						
							2023-12-07 14:36:26 +08:00  
						
					 
				
					
						
							
							
								 
								yancheng
							
						 
						
							 
							
							
							
							
								
							
							
								360acc0a41 
								
							 
						 
						
							
							
								
								loongarch64: Add optimizations for swap.  
							
							 
							
							
							
						 
						
							2023-12-07 14:36:26 +08:00  
						
					 
				
					
						
							
							
								 
								yancheng
							
						 
						
							 
							
							
							
							
								
							
							
								174c25766b 
								
							 
						 
						
							
							
								
								loongarch64: Add optimizations for copy.  
							
							 
							
							
							
						 
						
							2023-12-07 14:36:26 +08:00  
						
					 
				
					
						
							
							
								 
								yancheng
							
						 
						
							 
							
							
							
							
								
							
							
								49829b2b7d 
								
							 
						 
						
							
							
								
								loongarch64: Add optimizations for iamin.  
							
							 
							
							
							
						 
						
							2023-12-07 14:36:07 +08:00  
						
					 
				
					
						
							
							
								 
								yancheng
							
						 
						
							 
							
							
							
							
								
							
							
								be83f5e4e0 
								
							 
						 
						
							
							
								
								loongarch64: Add optimizations for iamax.  
							
							 
							
							
							
						 
						
							2023-12-07 14:36:07 +08:00  
						
					 
				
					
						
							
							
								 
								yancheng
							
						 
						
							 
							
							
							
							
								
							
							
								e3fb2b5afa 
								
							 
						 
						
							
							
								
								loongarch64: Add optimizations for imin.  
							
							 
							
							
							
						 
						
							2023-12-07 14:36:07 +08:00  
						
					 
				
					
						
							
							
								 
								yancheng
							
						 
						
							 
							
							
							
							
								
							
							
								e46b48e372 
								
							 
						 
						
							
							
								
								loongarch64: Add optimizations for imax.  
							
							 
							
							
							
						 
						
							2023-12-07 14:36:07 +08:00  
						
					 
				
					
						
							
							
								 
								yancheng
							
						 
						
							 
							
							
							
							
								
							
							
								702fc1d56d 
								
							 
						 
						
							
							
								
								loongarch64: Add optimization for min.  
							
							 
							
							
							
						 
						
							2023-12-07 14:36:07 +08:00  
						
					 
				
					
						
							
							
								 
								yancheng
							
						 
						
							 
							
							
							
							
								
							
							
								346b384d1c 
								
							 
						 
						
							
							
								
								loongarch64: Add optimization for max.  
							
							 
							
							
							
						 
						
							2023-12-07 14:36:07 +08:00  
						
					 
				
					
						
							
							
								 
								yancheng
							
						 
						
							 
							
							
							
							
								
							
							
								ff2ecc6cda 
								
							 
						 
						
							
							
								
								loongarch64: Add optimization for amin.  
							
							 
							
							
							
						 
						
							2023-12-07 14:36:07 +08:00  
						
					 
				
					
						
							
							
								 
								yancheng
							
						 
						
							 
							
							
							
							
								
							
							
								265b5f2e80 
								
							 
						 
						
							
							
								
								loongarch64: Add optimizations for amax.  
							
							 
							
							
							
						 
						
							2023-12-07 14:36:07 +08:00  
						
					 
				
					
						
							
							
								 
								yancheng
							
						 
						
							 
							
							
							
							
								
							
							
								993ede7c70 
								
							 
						 
						
							
							
								
								loongarch64: Add optimizations for scal.  
							
							 
							
							
							
						 
						
							2023-12-07 14:36:07 +08:00  
						
					 
				
					
						
							
							
								 
								Martin Kroeker
							
						 
						
							 
							
							
								
								
							
							
							
								
							
							
								39bf8ece20 
								
							 
						 
						
							
							
								
								Merge pull request  #4340  from yinshiyou/la-dev  
							
							 
							
							... 
							
							
							
							Add some refines and optimizations for LoongArch. 
							
						 
						
							2023-11-29 08:22:25 +01:00  
						
					 
				
					
						
							
							
								 
								Shiyou Yin
							
						 
						
							 
							
							
							
							
								
							
							
								9fe07d82fd 
								
							 
						 
						
							
							
								
								loongarch: Add LSX optimization for dot.  
							
							 
							
							
							
						 
						
							2023-11-28 20:24:18 +08:00  
						
					 
				
					
						
							
							
								 
								Shiyou Yin
							
						 
						
							 
							
							
							
							
								
							
							
								13b8c44b44 
								
							 
						 
						
							
							
								
								loongarch: Add optimization for dsdot kernel.  
							
							 
							
							
							
						 
						
							2023-11-28 20:24:16 +08:00  
						
					 
				
					
						
							
							
								 
								Shiyou Yin
							
						 
						
							 
							
							
							
							
								
							
							
								3def6a8143 
								
							 
						 
						
							
							
								
								loongarch: Add LASX optimization for dot.  
							
							 
							
							
							
						 
						
							2023-11-28 20:24:14 +08:00  
						
					 
				
					
						
							
							
								 
								Bart Oldeman
							
						 
						
							 
							
							
							
							
								
							
							
								c34e2cf380 
								
							 
						 
						
							
							
								
								Use _mm_set1_epi{32,64x} to init mask in x86-64 [cz]asum  
							
							 
							
							... 
							
							
							
							for skylake kernels. This is the same method as used in [sd]asum.
_mm_set1_epi64x was commented out for zasum, but has the advantage
of avoiding possible undefined behaviour (using an uninitialized
variable), optimized out by NVHPC and icx. The new code works
fine with those compilers.
For GCC 12.3 the generated code is identical; no matter what method
you use, the compiler optimizes the code into a compile-time
constant, there is no performance benefit using mm_cmpeq_epi8
since the corresponding instruction (VPCMPEQB) isn't actually
generated! 
							
						 
						
							2023-11-19 21:28:35 +00:00  
						
					 
				
					
						
							
							
								 
								Martin Kroeker
							
						 
						
							 
							
							
								
								
							
							
							
								
							
							
								22aa401656 
								
							 
						 
						
							
							
								
								Temporarily disable the AVX512 CASUM/ZASUM microkernels for any version of NVIDIA HPC ( #4327 )  
							
							 
							
							... 
							
							
							
							* Temporarily disable the C/ZASUM microkernels for any version of NVHPC 
							
						 
						
							2023-11-19 00:04:31 +01:00  
						
					 
				
					
						
							
							
								 
								Bart Oldeman
							
						 
						
							 
							
							
							
							
								
							
							
								f8ad5344c2 
								
							 
						 
						
							
							
								
								Fix casum fallback kernel.  
							
							 
							
							... 
							
							
							
							This kernel is only used on Skylake+ if the kernel with AVX512
intrinsics can't be used, but used the variable x1 incorrectly
in the tail end of the loop, as it is still at the initial
value instead of where x points to.
This caused 55 "other error"s in the LAPACK tests
(https://github.com/OpenMathLib/OpenBLAS/issues/4282 )
This change makes casum.c as similar as possible as zasum.c,
because zasum.c does this correctly. 
							
						 
						
							2023-11-17 23:53:56 +00:00  
						
					 
				
					
						
							
							
								 
								Martin Kroeker
							
						 
						
							 
							
							
								
								
							
							
							
								
							
							
								04bc801999 
								
							 
						 
						
							
							
								
								(Re)apply fixes for supporting only a subset of precision types from PR 3915  
							
							 
							
							
							
						 
						
							2023-11-04 23:48:59 +01:00  
						
					 
				
					
						
							
							
								 
								Martin Kroeker
							
						 
						
							 
							
							
								
								
							
							
							
								
							
							
								9019bc4945 
								
							 
						 
						
							
							
								
								Use SkylakeX ?ASUM microkernel for Cooperlake/Sapphirerapids as well  
							
							 
							
							
							
						 
						
							2023-11-04 22:10:06 +01:00  
						
					 
				
					
						
							
							
								 
								Martin Kroeker
							
						 
						
							 
							
							
								
								
							
							
							
								
							
							
								3bfa4d4dcc 
								
							 
						 
						
							
							
								
								Fix outdated SVE kernel definitions for Cortex cpus by aliasing to ARMV8SVE  
							
							 
							
							
							
						 
						
							2023-11-03 14:55:31 +01:00  
						
					 
				
					
						
							
							
								 
								Rajalakshmi Srinivasaraghavan
							
						 
						
							 
							
							
							
							
								
							
							
								980f702f72 
								
							 
						 
						
							
							
								
								POWER: AIX: Make use of power10 optimization  
							
							 
							
							... 
							
							
							
							POWER10 optimizations are disabled when using default AIX assembler.
As we have fixed many issues recently, enabling optimization path
for default assembler. 
							
						 
						
							2023-10-19 18:48:19 -05:00  
						
					 
				
					
						
							
							
								 
								Rajalakshmi Srinivasaraghavan
							
						 
						
							 
							
							
							
							
								
							
							
								9f42570e33 
								
							 
						 
						
							
							
								
								POWER: Increase macro size limit for AIX  
							
							 
							
							... 
							
							
							
							This patch increases the macro size limit from 4096 to 16384 to
allow compiling larger assembly files in AIX.
Tested with GCC and IBM Open XL C. 
							
						 
						
							2023-10-12 12:37:40 -05:00  
						
					 
				
					
						
							
							
								 
								Martin Kroeker
							
						 
						
							 
							
							
								
								
							
							
							
								
							
							
								9f49aef91b 
								
							 
						 
						
							
							
								
								Merge pull request  #4255  from RajalakshmiSR/AIX-P10  
							
							 
							
							... 
							
							
							
							POWER10: Fix compilation issues with Open XL C 
							
						 
						
							2023-10-12 18:59:17 +02:00  
						
					 
				
					
						
							
							
								 
								Martin Kroeker
							
						 
						
							 
							
							
								
								
							
							
							
								
							
							
								e7d05402e0 
								
							 
						 
						
							
							
								
								Fix up S/D GEMM copy function definitions after  #4009  
							
							 
							
							
							
						 
						
							2023-10-12 14:24:53 +02:00  
						
					 
				
					
						
							
							
								 
								Rajalakshmi Srinivasaraghavan
							
						 
						
							 
							
							
							
							
								
							
							
								71d733e5f7 
								
							 
						 
						
							
							
								
								POWER: Avoid m4 conversions for C files  
							
							 
							
							... 
							
							
							
							This patch removes intermediate m4 conversions used in sbgemm
compilation as it is not needed for .c files.
Tested on AIX with gcc and IBM Open XL C. 
							
						 
						
							2023-10-11 17:18:42 -05:00  
						
					 
				
					
						
							
							
								 
								Rajalakshmi Srinivasaraghavan
							
						 
						
							 
							
							
							
							
								
							
							
								82fc29a57a 
								
							 
						 
						
							
							
								
								POWER10: Fallback to POWER8 functions  
							
							 
							
							... 
							
							
							
							As cgemm and zgemm kernels are not optimized for big endian falling
back to POWER8 versions.  Tested on AIX using gcc and Open XL C. 
							
						 
						
							2023-10-11 17:04:42 -05:00  
						
					 
				
					
						
							
							
								 
								Rajalakshmi Srinivasaraghavan
							
						 
						
							 
							
							
							
							
								
							
							
								db0805906b 
								
							 
						 
						
							
							
								
								powerpc: Fix build errors with Open XL C  
							
							 
							
							... 
							
							
							
							This patch fixes errors when using Open XL C compiler on AIX.
Tested with gcc/xlf and ibm-clang/xlf compiler combinations. 
							
						 
						
							2023-10-04 14:04:03 -05:00  
						
					 
				
					
						
							
							
								 
								Martin Kroeker
							
						 
						
							 
							
							
								
								
							
							
							
								
							
							
								675cd551da 
								
							 
						 
						
							
							
								
								fix improper function prototypes (empty parentheses)  
							
							 
							
							
							
						 
						
							2023-09-30 12:56:38 +02:00  
						
					 
				
					
						
							
							
								 
								gxw
							
						 
						
							 
							
							
							
							
								
							
							
								d15e0a055c 
								
							 
						 
						
							
							
								
								LoongArch64: Fixed compilation issues when enable DYNAMIC_ARCH  
							
							 
							
							
							
						 
						
							2023-09-27 10:05:27 +08:00  
						
					 
				
					
						
							
							
								 
								gxw
							
						 
						
							 
							
							
							
							
								
							
							
								4670eb1462 
								
							 
						 
						
							
							
								
								LoongArch64: Add dtrsm kernel  
							
							 
							
							
							
						 
						
							2023-09-26 15:45:14 +08:00  
						
					 
				
					
						
							
							
								 
								gxw
							
						 
						
							 
							
							
							
							
								
							
							
								f2cf929374 
								
							 
						 
						
							
							
								
								LoongArch64: Add sgemv kernel  
							
							 
							
							
							
						 
						
							2023-09-04 14:28:37 +08:00  
						
					 
				
					
						
							
							
								 
								Martin Kroeker
							
						 
						
							 
							
							
								
								
							
							
							
								
							
							
								8e6d93359d 
								
							 
						 
						
							
							
								
								Merge pull request  #4196  from TiborGY/obsolete_inlines  
							
							 
							
							... 
							
							
							
							Modernize obsolete inline order 
							
						 
						
							2023-09-03 14:12:42 +02:00  
						
					 
				
					
						
							
							
								 
								gxw
							
						 
						
							 
							
							
							
							
								
							
							
								394a1fd1bf 
								
							 
						 
						
							
							
								
								LoongArch64: Compatible with early internal toolchain  
							
							 
							
							... 
							
							
							
							__loongarch_grlen and __loongarch_frlen were introduced in gcc version 8.3.0
(Loongnix 8.3.0-6.lnd.vec.31) internally within Loongson to standardize the
general and floating-point register widths. However, previous versions did
not have them, requiring additional checks to be added. 
							
						 
						
							2023-08-31 16:55:29 +08:00  
						
					 
				
					
						
							
							
								 
								Martin Kroeker
							
						 
						
							 
							
							
								
								
							
							
							
								
							
							
								9c4ae4d4fb 
								
							 
						 
						
							
							
								
								Merge pull request  #4206  from martin-frbg/issue4201-2  
							
							 
							
							... 
							
							
							
							Work around miscompilation of zdot_thunderx2t99 by the current NVIDIA HPC compiler 
							
						 
						
							2023-08-26 10:17:27 +02:00  
						
					 
				
					
						
							
							
								 
								Martin Kroeker
							
						 
						
							 
							
							
								
								
							
							
							
								
							
							
								88435104c8 
								
							 
						 
						
							
							
								
								Merge pull request  #4204  from martin-frbg/llvm17-2  
							
							 
							
							... 
							
							
							
							Work around LLVM17 miscompiling the AVX512 microkernels for CASUM/ZASUM 
							
						 
						
							2023-08-26 00:32:18 +02:00  
						
					 
				
					
						
							
							
								 
								Martin Kroeker
							
						 
						
							 
							
							
								
								
							
							
							
								
							
							
								fc8894dd98 
								
							 
						 
						
							
							
								
								Workaround miscompilation by NVIDIA nvc  
							
							 
							
							
							
						 
						
							2023-08-26 00:30:17 +02:00  
						
					 
				
					
						
							
							
								 
								Martin Kroeker
							
						 
						
							 
							
							
								
								
							
							
							
								
							
							
								7a6203ffa1 
								
							 
						 
						
							
							
								
								restore default Neoverse SVE build instructions for non-NVIDIA compilers  
							
							 
							
							
							
						 
						
							2023-08-25 18:25:51 +02:00  
						
					 
				
					
						
							
							
								 
								Martin Kroeker
							
						 
						
							 
							
							
								
								
							
							
							
								
							
							
								2c3034ff7f 
								
							 
						 
						
							
							
								
								Disable the C/ZASUM AVX512 microkernels when compiling with LLVM17 as well  
							
							 
							
							
							
						 
						
							2023-08-25 17:22:51 +02:00