OpenBLAS

Author	SHA1	Message	Date
Rajalakshmi Srinivasaraghavan	2379abaa5e	POWER10: Improve dgemm performance This patch uses vector pair pointer for input load operation which helps to generate power10 lxvp instructions.	2021-04-13 22:30:06 -05:00
Rajalakshmi Srinivasaraghavan	55bb9f639a	POWER10: Optimized zgemv This patch makes use of Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1 for zgemv_n and zgemv_t.	2021-04-10 19:00:24 -05:00
Martin Kroeker	2dfb24730d	Use "old" compute(24) function with clang due to register limitations	2021-04-06 19:58:32 +02:00
Martin Kroeker	147e0a75fd	Merge pull request #3170 from CodesWithWolves/sgemm_tcopy_16-invalid-read Remove Unnecessary/Erroneous Adds/Reads In sgemm_tcopy_16.S COPY1x8 Macro	2021-04-03 19:49:47 +02:00
Rajalakshmi Srinivasaraghavan	2dbcddd83d	POWER10: Adding check for little endian This patch makes sure that recent POWER10 patches are used only for little endian.	2021-03-31 21:32:42 -05:00
CodesWithWolves	d2bda3b56a	Remove Unnecessary/Erroneous Reads In sgemm_tcopy_16.S COPY1x8 Macro There appears to have been some code leak when copying from the COPY2x8 macro above where we're reading 8 bytes into d4-d7 directly after reading 4 bytes into s4-s7. These 32 bytes in d4-7 are unused and can possibly overrun the boundary of allocated memory -- Valgrind detected this which is what dragged my attention to it for a 128,1 copy. Additionally, there is no need to update the addresses stored in A0-A7 as the only possible paths after running this macro will overwrite A0-7 if looping to the next 8 rows, or overwrite A0-3 if moving to 4 rows -- in which case A4-7 are unused.	2021-03-31 15:44:25 -04:00
Martin Kroeker	bdd6e3a153	Merge pull request #3157 from martin-frbg/issue3020-final Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler on PPC	2021-03-19 15:23:12 +01:00
Martin Kroeker	7b8f580941	Merge pull request #3156 from martin-frbg/omatcopy_d Move x86_64 DOMATCOPY_RT back to the C implementation	2021-03-19 15:22:48 +01:00
Martin Kroeker	86c5a0013f	Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler	2021-03-19 11:47:58 +01:00
Martin Kroeker	ef85c22474	Add workaround for LAPACK test failures with the NVIDIA HPC compiler	2021-03-19 11:46:25 +01:00
Martin Kroeker	d3555d2e50	Add workaround for LAPACK test failures with the NVIDIA HPC compiler	2021-03-19 11:44:31 +01:00
Martin Kroeker	0f5e86a0d9	Remove premature entry for DOMATCOPY_RT	2021-03-18 21:53:50 +01:00
Martin Kroeker	7b294a99fd	Move common.h back to the top of the file so that SKYLAKEX (from config.h) is defined in time	2021-03-18 21:28:19 +01:00
Martin Kroeker	0934568d9c	Move includes under the ifdef for compilers w/o intrinsics support	2021-03-12 12:42:05 +01:00
Rajalakshmi Srinivasaraghavan	09d47af2c0	Optimize zscal function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-03-10 17:15:33 -06:00
Martin Kroeker	ef0238ba2b	Merge pull request #3130 from martin-frbg/issue3128 Replace spurious AVX512 requirement in the Haswell srot microkernel with an AVX2/FMA3 guard	2021-03-06 19:15:53 +01:00
Martin Kroeker	a9f6f7ad39	Remove spurious AVX512 requirement and add AVX2/FMA3 guard	2021-03-06 14:35:49 +01:00
Rajalakshmi Srinivasaraghavan	41646ed006	Optimize s/dasum function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-03-05 16:22:36 -06:00
Rajalakshmi Srinivasaraghavan	0571c3187b	POWER10: Rename mma builtins The LLVM and GCC teams agreed to rename the __builtin_mma_assemble_pair and __builtin_mma_disassemble_pair built-ins to __builtin_vsx_assemble_pair and __builtin_vsx_disassemble_pair respectively. This patch is to make corresponding changes in dgemm kernel. Also made changes in inputs to those builtins to avoid some potential typecasting issues. Reference gcc commit id:77ef995c1fbcab76a2a69b9f4700bcfd005d8e62	2021-02-26 20:56:34 -06:00
Martin Kroeker	292d1af1a0	Update omatcopy_rt.c	2021-02-24 09:34:14 +01:00
Martin Kroeker	325b398e3c	Update omatcopy_rt.c	2021-02-24 09:13:12 +01:00
Martin Kroeker	6f5667b4d4	Enable optimized S/D OMATCOPY_RT	2021-02-24 09:03:41 +01:00
Martin Kroeker	cceeee7806	Add optimized omatcopy_rt	2021-02-24 09:00:54 +01:00
Martin Kroeker	0a4546b742	Typo fix	2021-02-23 13:14:35 +01:00
Martin Kroeker	b1eed27a54	Replace naive omatcopy_rt with 4x4 blocked implementation as suggested by MigMuc in issue 2532	2021-02-22 21:35:42 +01:00
Martin Kroeker	47691c031f	Use Haswell optimizations for Zen as well	2021-02-11 09:26:15 +01:00
Martin Kroeker	ce7ddd8921	Use Haswell optimizations for Zen as well	2021-02-11 09:25:36 +01:00
Martin Kroeker	950c047b49	Use Haswell optimizations for Zen as well	2021-02-11 09:24:51 +01:00
Martin Kroeker	46509953a9	Use Haswell optimizations for Zen as well	2021-02-11 09:24:16 +01:00
Martin Kroeker	db348dcff2	Enable optimized srot/drot kernels from Haswell	2021-02-11 09:23:05 +01:00
Rajalakshmi Srinivasaraghavan	2056ffc227	Optimize cscal function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-29 13:51:43 -06:00
Rajalakshmi Srinivasaraghavan	3ede843d50	Optimize s/dscal function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-24 07:48:28 -06:00
Martin Kroeker	69a5558203	Merge pull request #3059 from Guobing-Chen/BF16_gemm Initial code for Cooperlake BF16 GEMM kernel	2021-01-23 19:08:05 +01:00
Martin Kroeker	d6905403e3	Merge pull request #3068 from alexhenrie/scan-build scan-build fixes	2021-01-23 19:06:29 +01:00
Rajalakshmi Srinivasaraghavan	439b93f6d2	Optimize s/drot function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-21 13:24:45 -06:00
Rajalakshmi Srinivasaraghavan	eff7c9166e	Optimize cdot function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-15 13:40:34 -06:00
Alex Henrie	202fc9e8ed	Fix uninitialized argument value in dasum_k	2021-01-14 19:40:31 -07:00
Martin Kroeker	e378b24487	Merge pull request #3067 from albertziegenhagel/fix-generic-cmake Fix building "generic" TRMM kernel with CMake	2021-01-14 21:35:19 +01:00
Albert Ziegenhagel	e3f4063683	Fix building "generic" TRMM kernel with CMake The CMake "TARGET_CORE" variables stores the "generic" target name in all lowercase letters, but gets compared to an all uppercase string, which results in the wrong TRMM kernel being selected. This commit converts the TARGET_CORE to all uppercase before comparing its value to make sure case mismatches are not an issue in the future anymore.	2021-01-14 10:00:49 +01:00
Martin Kroeker	b716c0ef01	Add workaround for NVIDIA HPC	2021-01-12 16:51:35 +01:00
Martin Kroeker	2efa3b70dc	Add workaround for NVIDIA HPC	2021-01-12 16:49:39 +01:00
Martin Kroeker	49959d4f1c	Add workaround for NVIDIA HPC	2021-01-12 16:47:15 +01:00
Martin Kroeker	0f27a03607	Add workaround for NVIDIA HPC mishandling of the asm DOT kernels	2021-01-12 16:39:35 +01:00
Martin Kroeker	c2a8ebfe69	Add workaround for NVIDIA HPC mishandling of the asm DOT kernels	2021-01-12 16:38:51 +01:00
Martin Kroeker	43aac5bacc	Support NVIDIA HPC compiler	2021-01-12 16:36:12 +01:00
Chen, Guobing	b0beb0b1ca	Initial code for Cooperlake BF16 GEMM kernel	2021-01-11 02:15:21 +08:00
Rajalakshmi Srinivasaraghavan	601b711c78	Optimize swap function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-08 08:01:36 -06:00
Ashwin Sekhar T K	1b2508362b	arm64: Fix nrm2 for input vectors with Inf Fix double precision nrm2 kernels returning NaN when the input vectors contain Inf/-Inf.	2021-01-01 02:49:37 -08:00
Martin Kroeker	3559c5d7a2	Merge pull request #3048 from martin-frbg/issue2998 Temporarily revert to the old NRM2 kernels for ThunderX2/3 and NeoverseN1	2020-12-21 13:30:08 +01:00
Martin Kroeker	8631e2976a	Temporarily revert to the old nrm2 kernels	2020-12-21 07:45:13 +01:00

1 2 3 4 5 ...

1640 Commits