bdd6e3a153 
								
							 
						 
						
							
							
								
								Merge pull request  #3157  from martin-frbg/issue3020-final  
							
							... 
							
							
							
							Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler on PPC 
							
						 
						
							2021-03-19 15:23:12 +01:00  
				
					
						
							
							
								 
						
							
								7b8f580941 
								
							 
						 
						
							
							
								
								Merge pull request  #3156  from martin-frbg/omatcopy_d  
							
							... 
							
							
							
							Move x86_64 DOMATCOPY_RT back to the C implementation 
							
						 
						
							2021-03-19 15:22:48 +01:00  
				
					
						
							
							
								 
						
							
								86c5a0013f 
								
							 
						 
						
							
							
								
								Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler  
							
							
							
						 
						
							2021-03-19 11:47:58 +01:00  
				
					
						
							
							
								 
						
							
								ef85c22474 
								
							 
						 
						
							
							
								
								Add workaround for LAPACK test failures with the NVIDIA HPC compiler  
							
							
							
						 
						
							2021-03-19 11:46:25 +01:00  
				
					
						
							
							
								 
						
							
								d3555d2e50 
								
							 
						 
						
							
							
								
								Add workaround for LAPACK test failures with the NVIDIA HPC compiler  
							
							
							
						 
						
							2021-03-19 11:44:31 +01:00  
				
					
						
							
							
								 
						
							
								0f5e86a0d9 
								
							 
						 
						
							
							
								
								Remove premature entry for DOMATCOPY_RT  
							
							
							
						 
						
							2021-03-18 21:53:50 +01:00  
				
					
						
							
							
								 
						
							
								7b294a99fd 
								
							 
						 
						
							
							
								
								Move common.h back to the top of the file so that SKYLAKEX (from config.h) is defined in time  
							
							
							
						 
						
							2021-03-18 21:28:19 +01:00  
				
					
						
							
							
								 
						
							
								0934568d9c 
								
							 
						 
						
							
							
								
								Move includes under the ifdef for compilers w/o intrinsics support  
							
							
							
						 
						
							2021-03-12 12:42:05 +01:00  
				
					
						
							
							
								 
						
							
								09d47af2c0 
								
							 
						 
						
							
							
								
								Optimize zscal function for POWER10  
							
							... 
							
							
							
							This patch makes use of new POWER10 vector pair instructions for
loads and stores. 
							
						 
						
							2021-03-10 17:15:33 -06:00  
				
					
						
							
							
								 
						
							
								ef0238ba2b 
								
							 
						 
						
							
							
								
								Merge pull request  #3130  from martin-frbg/issue3128  
							
							... 
							
							
							
							Replace spurious AVX512 requirement in the Haswell srot microkernel with an AVX2/FMA3 guard 
							
						 
						
							2021-03-06 19:15:53 +01:00  
				
					
						
							
							
								 
						
							
								a9f6f7ad39 
								
							 
						 
						
							
							
								
								Remove spurious AVX512 requirement and add AVX2/FMA3 guard  
							
							
							
						 
						
							2021-03-06 14:35:49 +01:00  
				
					
						
							
							
								 
						
							
								41646ed006 
								
							 
						 
						
							
							
								
								Optimize s/dasum function for POWER10  
							
							... 
							
							
							
							This patch makes use of new POWER10 vector pair instructions for
loads and stores. 
							
						 
						
							2021-03-05 16:22:36 -06:00  
				
					
						
							
							
								 
						
							
								0571c3187b 
								
							 
						 
						
							
							
								
								POWER10: Rename mma builtins  
							
							... 
							
							
							
							The LLVM and GCC teams agreed to rename the __builtin_mma_assemble_pair and
__builtin_mma_disassemble_pair built-ins to __builtin_vsx_assemble_pair and
__builtin_vsx_disassemble_pair respectively. This patch is to make
corresponding changes in dgemm kernel. Also made changes in
inputs to those builtins to avoid some potential typecasting issues.
Reference gcc commit id:77ef995c1fbcab76a2a69b9f4700bcfd005d8e62 
							
						 
						
							2021-02-26 20:56:34 -06:00  
				
					
						
							
							
								 
						
							
								292d1af1a0 
								
							 
						 
						
							
							
								
								Update omatcopy_rt.c  
							
							
							
						 
						
							2021-02-24 09:34:14 +01:00  
				
					
						
							
							
								 
						
							
								325b398e3c 
								
							 
						 
						
							
							
								
								Update omatcopy_rt.c  
							
							
							
						 
						
							2021-02-24 09:13:12 +01:00  
				
					
						
							
							
								 
						
							
								6f5667b4d4 
								
							 
						 
						
							
							
								
								Enable optimized S/D OMATCOPY_RT  
							
							
							
						 
						
							2021-02-24 09:03:41 +01:00  
				
					
						
							
							
								 
						
							
								cceeee7806 
								
							 
						 
						
							
							
								
								Add optimized omatcopy_rt  
							
							
							
						 
						
							2021-02-24 09:00:54 +01:00  
				
					
						
							
							
								 
						
							
								0a4546b742 
								
							 
						 
						
							
							
								
								Typo fix  
							
							
							
						 
						
							2021-02-23 13:14:35 +01:00  
				
					
						
							
							
								 
						
							
								b1eed27a54 
								
							 
						 
						
							
							
								
								Replace naive omatcopy_rt with 4x4 blocked implementation  
							
							... 
							
							
							
							as suggested by MigMuc in issue 2532 
							
						 
						
							2021-02-22 21:35:42 +01:00  
				
					
						
							
							
								 
						
							
								47691c031f 
								
							 
						 
						
							
							
								
								Use Haswell optimizations for Zen as well  
							
							
							
						 
						
							2021-02-11 09:26:15 +01:00  
				
					
						
							
							
								 
						
							
								ce7ddd8921 
								
							 
						 
						
							
							
								
								Use Haswell optimizations for Zen as well  
							
							
							
						 
						
							2021-02-11 09:25:36 +01:00  
				
					
						
							
							
								 
						
							
								950c047b49 
								
							 
						 
						
							
							
								
								Use Haswell optimizations for Zen as well  
							
							
							
						 
						
							2021-02-11 09:24:51 +01:00  
				
					
						
							
							
								 
						
							
								46509953a9 
								
							 
						 
						
							
							
								
								Use Haswell optimizations for Zen as well  
							
							
							
						 
						
							2021-02-11 09:24:16 +01:00  
				
					
						
							
							
								 
						
							
								db348dcff2 
								
							 
						 
						
							
							
								
								Enable optimized srot/drot kernels from Haswell  
							
							
							
						 
						
							2021-02-11 09:23:05 +01:00  
				
					
						
							
							
								 
						
							
								2056ffc227 
								
							 
						 
						
							
							
								
								Optimize cscal function for POWER10  
							
							... 
							
							
							
							This patch makes use of new POWER10 vector pair instructions for
loads and stores. 
							
						 
						
							2021-01-29 13:51:43 -06:00  
				
					
						
							
							
								 
						
							
								3ede843d50 
								
							 
						 
						
							
							
								
								Optimize s/dscal function for POWER10  
							
							... 
							
							
							
							This patch makes use of new POWER10 vector pair instructions for
loads and stores. 
							
						 
						
							2021-01-24 07:48:28 -06:00  
				
					
						
							
							
								 
						
							
								69a5558203 
								
							 
						 
						
							
							
								
								Merge pull request  #3059  from Guobing-Chen/BF16_gemm  
							
							... 
							
							
							
							Initial code for Cooperlake BF16 GEMM kernel 
							
						 
						
							2021-01-23 19:08:05 +01:00  
				
					
						
							
							
								 
						
							
								d6905403e3 
								
							 
						 
						
							
							
								
								Merge pull request  #3068  from alexhenrie/scan-build  
							
							... 
							
							
							
							scan-build fixes 
							
						 
						
							2021-01-23 19:06:29 +01:00  
				
					
						
							
							
								 
						
							
								439b93f6d2 
								
							 
						 
						
							
							
								
								Optimize s/drot function for POWER10  
							
							... 
							
							
							
							This patch makes use of new POWER10 vector pair instructions for
loads and stores. 
							
						 
						
							2021-01-21 13:24:45 -06:00  
				
					
						
							
							
								 
						
							
								eff7c9166e 
								
							 
						 
						
							
							
								
								Optimize cdot function for POWER10  
							
							... 
							
							
							
							This patch makes use of new POWER10 vector pair instructions for
loads and stores. 
							
						 
						
							2021-01-15 13:40:34 -06:00  
				
					
						
							
							
								 
						
							
								202fc9e8ed 
								
							 
						 
						
							
							
								
								Fix uninitialized argument value in dasum_k  
							
							
							
						 
						
							2021-01-14 19:40:31 -07:00  
				
					
						
							
							
								 
						
							
								e378b24487 
								
							 
						 
						
							
							
								
								Merge pull request  #3067  from albertziegenhagel/fix-generic-cmake  
							
							... 
							
							
							
							Fix building "generic" TRMM kernel with CMake 
							
						 
						
							2021-01-14 21:35:19 +01:00  
				
					
						
							
							
								 
						
							
								e3f4063683 
								
							 
						 
						
							
							
								
								Fix building "generic" TRMM kernel with CMake  
							
							... 
							
							
							
							The CMake "TARGET_CORE" variables stores the "generic" target name in all lowercase letters, but gets compared to an all uppercase string, which results in the wrong TRMM kernel being selected.
This commit converts the TARGET_CORE to all uppercase before comparing its value to make sure case mismatches are not an issue in the future anymore. 
							
						 
						
							2021-01-14 10:00:49 +01:00  
				
					
						
							
							
								 
						
							
								b716c0ef01 
								
							 
						 
						
							
							
								
								Add workaround for NVIDIA HPC  
							
							
							
						 
						
							2021-01-12 16:51:35 +01:00  
				
					
						
							
							
								 
						
							
								2efa3b70dc 
								
							 
						 
						
							
							
								
								Add workaround for NVIDIA HPC  
							
							
							
						 
						
							2021-01-12 16:49:39 +01:00  
				
					
						
							
							
								 
						
							
								49959d4f1c 
								
							 
						 
						
							
							
								
								Add workaround for NVIDIA HPC  
							
							
							
						 
						
							2021-01-12 16:47:15 +01:00  
				
					
						
							
							
								 
						
							
								0f27a03607 
								
							 
						 
						
							
							
								
								Add workaround for NVIDIA HPC mishandling of the asm DOT kernels  
							
							
							
						 
						
							2021-01-12 16:39:35 +01:00  
				
					
						
							
							
								 
						
							
								c2a8ebfe69 
								
							 
						 
						
							
							
								
								Add workaround for NVIDIA HPC mishandling of the asm DOT kernels  
							
							
							
						 
						
							2021-01-12 16:38:51 +01:00  
				
					
						
							
							
								 
						
							
								43aac5bacc 
								
							 
						 
						
							
							
								
								Support NVIDIA HPC compiler  
							
							
							
						 
						
							2021-01-12 16:36:12 +01:00  
				
					
						
							
							
								 
						
							
								b0beb0b1ca 
								
							 
						 
						
							
							
								
								Initial code for Cooperlake BF16 GEMM kernel  
							
							
							
						 
						
							2021-01-11 02:15:21 +08:00  
				
					
						
							
							
								 
						
							
								601b711c78 
								
							 
						 
						
							
							
								
								Optimize swap function for POWER10  
							
							... 
							
							
							
							This patch makes use of new POWER10 vector pair instructions for
loads and stores. 
							
						 
						
							2021-01-08 08:01:36 -06:00  
				
					
						
							
							
								 
						
							
								1b2508362b 
								
							 
						 
						
							
							
								
								arm64: Fix nrm2 for input vectors with Inf  
							
							... 
							
							
							
							Fix double precision nrm2 kernels returning NaN when the
input vectors contain Inf/-Inf. 
							
						 
						
							2021-01-01 02:49:37 -08:00  
				
					
						
							
							
								 
						
							
								3559c5d7a2 
								
							 
						 
						
							
							
								
								Merge pull request  #3048  from martin-frbg/issue2998  
							
							... 
							
							
							
							Temporarily revert to the old NRM2 kernels for ThunderX2/3 and NeoverseN1 
							
						 
						
							2020-12-21 13:30:08 +01:00  
				
					
						
							
							
								 
						
							
								8631e2976a 
								
							 
						 
						
							
							
								
								Temporarily revert to the old nrm2 kernels  
							
							
							
						 
						
							2020-12-21 07:45:13 +01:00  
				
					
						
							
							
								 
						
							
								2768bc1764 
								
							 
						 
						
							
							
								
								Temporarily revert to the old nrm2 kernels  
							
							
							
						 
						
							2020-12-21 07:42:51 +01:00  
				
					
						
							
							
								 
						
							
								6f4698ee1f 
								
							 
						 
						
							
							
								
								Temporarily revert to the old nrm2 kernel  
							
							
							
						 
						
							2020-12-21 07:41:18 +01:00  
				
					
						
							
							
								 
						
							
								114eb159a4 
								
							 
						 
						
							
							
								
								Disable FMA intrinsics in the srot kernel when the compiler is PGI/NVIDIA  
							
							
							
						 
						
							2020-12-19 22:15:58 +01:00  
				
					
						
							
							
								 
						
							
								005cce5507 
								
							 
						 
						
							
							
								
								Amend SkylakeX options to support the NVIDIA compiler  
							
							
							
						 
						
							2020-12-19 22:11:49 +01:00  
				
					
						
							
							
								 
						
							
								c73d8ee40d 
								
							 
						 
						
							
							
								
								Conditionally add -mfma to compiler options where needed  
							
							
							
						 
						
							2020-12-17 11:34:05 +01:00  
				
					
						
							
							
								 
						
							
								2fb11f873b 
								
							 
						 
						
							
							
								
								POWER10: Improve copy performance  
							
							... 
							
							
							
							This patch aligns the stores to 32 byte boundary for scopy and dcopy
before entering into vector pair loop. For ccopy, changed the store
instructions to stxv to improve performance of unaligned cases. 
							
						 
						
							2020-12-13 10:41:45 -06:00