OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Rajalakshmi Srinivasaraghavan	cbb70438df	POWER10: Fixes for sbgemm kernel While testing bfloat16 sbgemm kernel, there are some failures for odd value inputs due to array access beyond the boundary.	2021-06-09 12:20:09 -05:00
Zhaofeng Li	590be3fae3	riscv64: Add Makefile	2021-06-07 22:55:56 +00:00
Zhaofeng Li	3521cd48cb	RISCV64_GENERIC: Use generic kernel for DSDOT for better precision The implementation in `riscv64/dot.c` fails the `test_dsdot` test, and the generic kernel seems to have better precision. Tested on SiFive FU740 (HiFive Unmatched) and QEMU. Also see #1469.	2021-06-07 22:50:23 +00:00
Zhaofeng Li	1e0192a5cc	riscv64/imin: Fix wrong comparison Same as #1990.	2021-06-07 22:49:39 +00:00
Martin Kroeker	5f677e782e	Merge pull request #3196 from guowangy/skylakex-gemm-batch-k GEMM: skylake: improve the performance when m is small	2021-05-22 19:25:28 +02:00
Martin Kroeker	02087a62e7	Merge pull request #3205 from intelmy/sgemv_n_opt optimize on sgemv_n for small n	2021-05-17 17:49:01 +02:00
Martin Kroeker	4ecf631f95	Merge pull request #3228 from martin-frbg/issue3226 filter out -mavx flag on Sandybridge zgemm/ztrmm kernels	2021-05-15 09:06:12 +02:00
Martin Kroeker	310b76aad7	Merge pull request #3231 from martin-frbg/issue3227 Support compilation with pre-C99 versions of MSVC	2021-05-14 23:28:06 +02:00
Martin Kroeker	c4da892ba0	Only filter out -mavx on Sandybridge ZGEMM/ZTRMM kernels	2021-05-14 23:19:10 +02:00
Martin Kroeker	8b90e5f202	Drop redundant inclusion of complex.h	2021-05-14 15:06:44 +02:00
Martin Kroeker	bd60fb6ffc	filter out -mavx flag on zgemm kernels as it can cause problems with older gcc	2021-05-13 23:05:00 +02:00
Martin Kroeker	37ea8702ee	Merge pull request #3192 from damonyu1989/develop Update the intrinsic api to the offical name.	2021-05-11 16:00:45 +02:00
Martin Kroeker	c0ca63ea46	Fix missing conditionals for non-SKX kernels	2021-05-05 14:55:36 +02:00
pnp	3d4ccd2a13	fix for build error	2021-04-30 12:25:33 -04:00
pnp	c59652f0ce	optimize on sgemv_n for small n	2021-04-30 12:14:58 -04:00
Wangyang Guo	aa7b3dc3db	GEMM: skylake: improve the performance when m is small	2021-04-28 13:56:06 +00:00
damonyu	ceb44bef14	update the intrinsic api to the offical name.	2021-04-27 11:12:29 +08:00
Martin Kroeker	3d511f0e66	replace spurious avx512 requirement with fma check	2021-04-26 21:55:30 +02:00
Rajalakshmi Srinivasaraghavan	2379abaa5e	POWER10: Improve dgemm performance This patch uses vector pair pointer for input load operation which helps to generate power10 lxvp instructions.	2021-04-13 22:30:06 -05:00
Rajalakshmi Srinivasaraghavan	55bb9f639a	POWER10: Optimized zgemv This patch makes use of Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1 for zgemv_n and zgemv_t.	2021-04-10 19:00:24 -05:00
Martin Kroeker	2dfb24730d	Use "old" compute(24) function with clang due to register limitations	2021-04-06 19:58:32 +02:00
Martin Kroeker	147e0a75fd	Merge pull request #3170 from CodesWithWolves/sgemm_tcopy_16-invalid-read Remove Unnecessary/Erroneous Adds/Reads In sgemm_tcopy_16.S COPY1x8 Macro	2021-04-03 19:49:47 +02:00
Rajalakshmi Srinivasaraghavan	2dbcddd83d	POWER10: Adding check for little endian This patch makes sure that recent POWER10 patches are used only for little endian.	2021-03-31 21:32:42 -05:00
CodesWithWolves	d2bda3b56a	Remove Unnecessary/Erroneous Reads In sgemm_tcopy_16.S COPY1x8 Macro There appears to have been some code leak when copying from the COPY2x8 macro above where we're reading 8 bytes into d4-d7 directly after reading 4 bytes into s4-s7. These 32 bytes in d4-7 are unused and can possibly overrun the boundary of allocated memory -- Valgrind detected this which is what dragged my attention to it for a 128,1 copy. Additionally, there is no need to update the addresses stored in A0-A7 as the only possible paths after running this macro will overwrite A0-7 if looping to the next 8 rows, or overwrite A0-3 if moving to 4 rows -- in which case A4-7 are unused.	2021-03-31 15:44:25 -04:00
Martin Kroeker	bdd6e3a153	Merge pull request #3157 from martin-frbg/issue3020-final Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler on PPC	2021-03-19 15:23:12 +01:00
Martin Kroeker	7b8f580941	Merge pull request #3156 from martin-frbg/omatcopy_d Move x86_64 DOMATCOPY_RT back to the C implementation	2021-03-19 15:22:48 +01:00
Martin Kroeker	86c5a0013f	Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler	2021-03-19 11:47:58 +01:00
Martin Kroeker	ef85c22474	Add workaround for LAPACK test failures with the NVIDIA HPC compiler	2021-03-19 11:46:25 +01:00
Martin Kroeker	d3555d2e50	Add workaround for LAPACK test failures with the NVIDIA HPC compiler	2021-03-19 11:44:31 +01:00
Martin Kroeker	0f5e86a0d9	Remove premature entry for DOMATCOPY_RT	2021-03-18 21:53:50 +01:00
Martin Kroeker	7b294a99fd	Move common.h back to the top of the file so that SKYLAKEX (from config.h) is defined in time	2021-03-18 21:28:19 +01:00
Martin Kroeker	0934568d9c	Move includes under the ifdef for compilers w/o intrinsics support	2021-03-12 12:42:05 +01:00
Rajalakshmi Srinivasaraghavan	09d47af2c0	Optimize zscal function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-03-10 17:15:33 -06:00
Martin Kroeker	ef0238ba2b	Merge pull request #3130 from martin-frbg/issue3128 Replace spurious AVX512 requirement in the Haswell srot microkernel with an AVX2/FMA3 guard	2021-03-06 19:15:53 +01:00
Martin Kroeker	a9f6f7ad39	Remove spurious AVX512 requirement and add AVX2/FMA3 guard	2021-03-06 14:35:49 +01:00
Rajalakshmi Srinivasaraghavan	41646ed006	Optimize s/dasum function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-03-05 16:22:36 -06:00
Rajalakshmi Srinivasaraghavan	0571c3187b	POWER10: Rename mma builtins The LLVM and GCC teams agreed to rename the __builtin_mma_assemble_pair and __builtin_mma_disassemble_pair built-ins to __builtin_vsx_assemble_pair and __builtin_vsx_disassemble_pair respectively. This patch is to make corresponding changes in dgemm kernel. Also made changes in inputs to those builtins to avoid some potential typecasting issues. Reference gcc commit id:77ef995c1fbcab76a2a69b9f4700bcfd005d8e62	2021-02-26 20:56:34 -06:00
Martin Kroeker	292d1af1a0	Update omatcopy_rt.c	2021-02-24 09:34:14 +01:00
Martin Kroeker	325b398e3c	Update omatcopy_rt.c	2021-02-24 09:13:12 +01:00
Martin Kroeker	6f5667b4d4	Enable optimized S/D OMATCOPY_RT	2021-02-24 09:03:41 +01:00
Martin Kroeker	cceeee7806	Add optimized omatcopy_rt	2021-02-24 09:00:54 +01:00
Martin Kroeker	0a4546b742	Typo fix	2021-02-23 13:14:35 +01:00
Martin Kroeker	b1eed27a54	Replace naive omatcopy_rt with 4x4 blocked implementation as suggested by MigMuc in issue 2532	2021-02-22 21:35:42 +01:00
Martin Kroeker	47691c031f	Use Haswell optimizations for Zen as well	2021-02-11 09:26:15 +01:00
Martin Kroeker	ce7ddd8921	Use Haswell optimizations for Zen as well	2021-02-11 09:25:36 +01:00
Martin Kroeker	950c047b49	Use Haswell optimizations for Zen as well	2021-02-11 09:24:51 +01:00
Martin Kroeker	46509953a9	Use Haswell optimizations for Zen as well	2021-02-11 09:24:16 +01:00
Martin Kroeker	db348dcff2	Enable optimized srot/drot kernels from Haswell	2021-02-11 09:23:05 +01:00
Rajalakshmi Srinivasaraghavan	2056ffc227	Optimize cscal function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-29 13:51:43 -06:00
Rajalakshmi Srinivasaraghavan	3ede843d50	Optimize s/dscal function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-24 07:48:28 -06:00

1 2 3 4 5 ...

1658 Commits