OpenBLAS

Author	SHA1	Message	Date
Rajalakshmi Srinivasaraghavan	2379abaa5e	POWER10: Improve dgemm performance This patch uses vector pair pointer for input load operation which helps to generate power10 lxvp instructions.	2021-04-13 22:30:06 -05:00
Rajalakshmi Srinivasaraghavan	55bb9f639a	POWER10: Optimized zgemv This patch makes use of Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1 for zgemv_n and zgemv_t.	2021-04-10 19:00:24 -05:00
Rajalakshmi Srinivasaraghavan	2dbcddd83d	POWER10: Adding check for little endian This patch makes sure that recent POWER10 patches are used only for little endian.	2021-03-31 21:32:42 -05:00
Martin Kroeker	86c5a0013f	Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler	2021-03-19 11:47:58 +01:00
Martin Kroeker	ef85c22474	Add workaround for LAPACK test failures with the NVIDIA HPC compiler	2021-03-19 11:46:25 +01:00
Martin Kroeker	d3555d2e50	Add workaround for LAPACK test failures with the NVIDIA HPC compiler	2021-03-19 11:44:31 +01:00
Rajalakshmi Srinivasaraghavan	09d47af2c0	Optimize zscal function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-03-10 17:15:33 -06:00
Rajalakshmi Srinivasaraghavan	41646ed006	Optimize s/dasum function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-03-05 16:22:36 -06:00
Rajalakshmi Srinivasaraghavan	0571c3187b	POWER10: Rename mma builtins The LLVM and GCC teams agreed to rename the __builtin_mma_assemble_pair and __builtin_mma_disassemble_pair built-ins to __builtin_vsx_assemble_pair and __builtin_vsx_disassemble_pair respectively. This patch is to make corresponding changes in dgemm kernel. Also made changes in inputs to those builtins to avoid some potential typecasting issues. Reference gcc commit id:77ef995c1fbcab76a2a69b9f4700bcfd005d8e62	2021-02-26 20:56:34 -06:00
Rajalakshmi Srinivasaraghavan	2056ffc227	Optimize cscal function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-29 13:51:43 -06:00
Rajalakshmi Srinivasaraghavan	3ede843d50	Optimize s/dscal function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-24 07:48:28 -06:00
Rajalakshmi Srinivasaraghavan	439b93f6d2	Optimize s/drot function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-21 13:24:45 -06:00
Rajalakshmi Srinivasaraghavan	eff7c9166e	Optimize cdot function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-15 13:40:34 -06:00
Rajalakshmi Srinivasaraghavan	601b711c78	Optimize swap function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-08 08:01:36 -06:00
Rajalakshmi Srinivasaraghavan	2fb11f873b	POWER10: Improve copy performance This patch aligns the stores to 32 byte boundary for scopy and dcopy before entering into vector pair loop. For ccopy, changed the store instructions to stxv to improve performance of unaligned cases.	2020-12-13 10:41:45 -06:00
Martin Kroeker	043128cbe5	Merge pull request #3029 from RajalakshmiSR/axpyp10 POWER10: Improve axpy performance	2020-12-10 22:49:28 +01:00
Rajalakshmi Srinivasaraghavan	346e30a46a	POWER10: Improve axpy performance This patch aligns the stores to 32 byte boundary for saxpy and daxpy before entering into vector pair loop. Fox caxpy, changed the store instructions to stxv to improve performance of unaligned cases.	2020-12-10 11:51:42 -06:00
Gordon Fossum	213c0e7abb	Added special unrolled vectorized versions of "Solve" for specific sizes, in DTRSM and STRSM, to improve performance in Power9 and Power10.	2020-12-04 17:07:06 -06:00
Rajalakshmi Srinivasaraghavan	7d46e31de1	POWER10: Optimize dgemv_n Handling as 4x8 with vector pairs gives better performance than existing code in POWER10.	2020-11-29 15:28:28 -06:00
Rajalakshmi Srinivasaraghavan	6e364981a8	Optimize sdot/ddot for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2020-11-07 15:21:58 -06:00
Rajalakshmi Srinivasaraghavan	dd7a9cc5bf	POWER10: Change dgemm unroll factors Changing the unroll factors for dgemm to 8 shows improved performance with POWER10 MMA feature. Also made some minor changes in sgemm for edge cases.	2020-10-31 18:28:57 -05:00
Rajalakshmi Srinivasaraghavan	b435491885	Optimize caxpy for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2020-10-29 14:57:51 -05:00
Rajalakshmi Srinivasaraghavan	c24ba8b1dd	Optimize saxpy for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2020-10-26 13:24:59 -05:00
Martin Kroeker	34c3c407ef	label always_inline function as inline to silence a gcc warning	2020-10-22 22:14:26 +02:00
Rajalakshmi Srinivasaraghavan	ad745c0bae	Optimize scopy/ccopy for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores. Also reorganized all variants of copy functions to make use of same kernel.	2020-10-21 09:53:45 -05:00
Martin Kroeker	a61c086408	Fix spurious trailing whitespace in comment	2020-10-19 09:12:12 +02:00
Martin Kroeker	f1a4071d8c	Clean up STACKSIZE redefinition	2020-10-18 19:41:43 +02:00
Martin Kroeker	97cf10062f	Clean up STACKSIZE redefinition	2020-10-18 19:39:18 +02:00
Martin Kroeker	17e288e18d	Clean up STACKSIZE redefinition	2020-10-18 19:37:04 +02:00
Martin Kroeker	c1422f3e46	Clean up STACKSIZE redefinition	2020-10-18 19:31:01 +02:00
Martin Kroeker	d85b24e103	Clean up STACKSIZE redefinition	2020-10-18 19:29:45 +02:00
Rajalakshmi Srinivasaraghavan	0826d68f93	POWER10: Change the packing format for bfloat16 As the new MMA instructions need the inputs in 4x2 order for bfloat16, changing the format in copy/packing code. This avoids permute instructions in the gemm kernel inner loop.	2020-10-13 16:05:10 -05:00
Martin Kroeker	2061f7fdff	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:54:53 +02:00
Martin Kroeker	9ae80490e0	rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:39:42 +02:00
Martin Kroeker	d314d1f49f	Rename shgemm_kernel_power10.c to sbgemm_kernel_power10.c	2020-10-11 23:37:38 +02:00
Rajalakshmi Srinivasaraghavan	2df4235e00	Optimize dcopy/zcopy for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores. Tested in simulator and no new failures.	2020-09-27 21:42:32 -05:00
Rajalakshmi Srinivasaraghavan	be43d2cb96	Optimize daxpy/zaxpy for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores. Tested in simulator and no new failures.	2020-09-17 12:56:28 -05:00
Rajalakshmi Srinivasaraghavan	317ff27cda	POWER10: Avoid setting accumulators to zero in gemm kernels For the first iteration, it is better to use xvfger instead of xvfgerpp builtins which helps to avoid setting accumulators to zero. This helps to reduce few instructions.	2020-08-28 10:42:54 -05:00
Rajalakshmi Srinivasaraghavan	f77b6a83f4	dgemv optimization for POWER10 Making use of new vector pair POWER10 instructions in dgemv_n and dgemv_t. Also adding a new block 4x128 to make use of Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1. Tested on simulator and there are no new test failures.	2020-07-29 18:59:32 -05:00
Rajalakshmi Srinivasaraghavan	d557584b71	Fix compilation issues with clang on POWER As gcc defaults to -malign-power, removing that option. Also adding -fno-integrated-as to use GNU assembler for powerpc assembly optimization files. Fixed other compilation errors reported in dgemv_t.c file.	2020-07-27 14:11:07 -05:00
Rajalakshmi Srinivasaraghavan	9be2688c78	Fix to store results in correct order for POWER10 GEMM kernels There is a recent compiler change in __builtin_mma_disassemble_acc() which affects the order of storing result in POWER10. Also removing new LDFLAG -mno-power10-stub as it is handled by linker automatically.	2020-07-24 23:08:11 -05:00
Martin Kroeker	6a2a60038c	Merge pull request #2720 from martin-frbg/issue2694 WIP Further fixes for 32bit POWER8	2020-07-24 23:19:45 +02:00
Martin Kroeker	251a09ec90	Typo fix	2020-07-24 16:04:58 +00:00
Martin Kroeker	95d37e1575	Regroup the 32 and 64bit sections and restore 64bit CAXPY	2020-07-24 10:13:46 +00:00
Martin Kroeker	3523bb778e	Merge pull request #2721 from martin-frbg/p8align Fix alignment errors in the power8 saxpy kernel	2020-07-24 11:06:20 +02:00
Martin Kroeker	ca3561cab9	Add ifdefs around call to altivec microkernel	2020-07-23 18:30:42 +00:00
Martin Kroeker	21072e502a	Typo fix	2020-07-23 17:34:56 +00:00
Martin Kroeker	661c6bfa5a	Exclude altivec code paths if the compiler does not support them	2020-07-23 17:08:20 +02:00
Martin Kroeker	0033f8be0d	Use vec_vsx_ld/st to fix misaligned accesses flagged by asan	2020-07-16 23:32:54 +02:00
Martin Kroeker	f308e741b2	remove debug output and revert changes to cdot and crot	2020-07-15 10:00:07 +02:00

1 2 3 4 5

211 Commits