Commit Graph

223 Commits

Author SHA1 Message Date
Rajalakshmi Srinivasaraghavan
b06880c2cd POWER10: Improving dasum performance
Unrolling a loop in dasum micro code to help in improving
POWER10 performance.
2021-08-10 22:06:04 -05:00
Martin Kroeker
c4b464cac6 Merge pull request #3273 from austinpagan/sbgemm_gcc10_fix
Power10: Fix for SBGEMM
2021-06-15 22:58:48 +02:00
Gordon Fossum
e6dd44d989 Power10: Fix for SBGEMM
While testing bfloat16 sbgemm kernel, there are some failures for odd value inputs due to updating result for
additional bytes.
2021-06-15 13:07:47 -05:00
Martin Kroeker
2e8ff4a781 Merge pull request #3266 from martin-frbg/powerparam
Remove spurious casts from PPC parameters and fix compilation for older targets
2021-06-10 18:05:47 +02:00
Martin Kroeker
efdbdd8f82 Add prefetch values for power3 2021-06-10 11:20:29 +02:00
Martin Kroeker
3906ef3b0f Add prefetch values for power3 2021-06-10 11:19:40 +02:00
Martin Kroeker
8adf0971d8 Add prefetch values for power3 2021-06-10 11:18:22 +02:00
Martin Kroeker
08e2e60762 Add prefetch values for power3 2021-06-10 11:17:33 +02:00
Martin Kroeker
fb9e678235 Fix caxpy/zaxpy for big-endian 2021-06-10 11:15:48 +02:00
Martin Kroeker
dc4fcb48df Fix inverted conditional for caxpy/zaxpy 2021-06-10 11:14:03 +02:00
Martin Kroeker
7a48247761 fix c/zrot and sgemv for POWER5 2021-06-10 11:11:56 +02:00
Rajalakshmi Srinivasaraghavan
cbb70438df POWER10: Fixes for sbgemm kernel
While testing bfloat16 sbgemm kernel, there are some failures
for odd value inputs due to array access beyond the boundary.
2021-06-09 12:20:09 -05:00
Rajalakshmi Srinivasaraghavan
2379abaa5e POWER10: Improve dgemm performance
This patch uses vector pair pointer for input load operation
which helps to generate power10 lxvp instructions.
2021-04-13 22:30:06 -05:00
Rajalakshmi Srinivasaraghavan
55bb9f639a POWER10: Optimized zgemv
This patch makes use of Matrix-Multiply Assist (MMA)
feature introduced in POWER ISA v3.1 for zgemv_n and zgemv_t.
2021-04-10 19:00:24 -05:00
Rajalakshmi Srinivasaraghavan
2dbcddd83d POWER10: Adding check for little endian
This patch makes sure that recent POWER10 patches are used
only for little endian.
2021-03-31 21:32:42 -05:00
Martin Kroeker
86c5a0013f Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler 2021-03-19 11:47:58 +01:00
Martin Kroeker
ef85c22474 Add workaround for LAPACK test failures with the NVIDIA HPC compiler 2021-03-19 11:46:25 +01:00
Martin Kroeker
d3555d2e50 Add workaround for LAPACK test failures with the NVIDIA HPC compiler 2021-03-19 11:44:31 +01:00
Rajalakshmi Srinivasaraghavan
09d47af2c0 Optimize zscal function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-03-10 17:15:33 -06:00
Rajalakshmi Srinivasaraghavan
41646ed006 Optimize s/dasum function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-03-05 16:22:36 -06:00
Rajalakshmi Srinivasaraghavan
0571c3187b POWER10: Rename mma builtins
The LLVM and GCC teams agreed to rename the __builtin_mma_assemble_pair and
__builtin_mma_disassemble_pair built-ins to __builtin_vsx_assemble_pair and
__builtin_vsx_disassemble_pair respectively. This patch is to make
corresponding changes in dgemm kernel. Also made changes in
inputs to those builtins to avoid some potential typecasting issues.

Reference gcc commit id:77ef995c1fbcab76a2a69b9f4700bcfd005d8e62
2021-02-26 20:56:34 -06:00
Rajalakshmi Srinivasaraghavan
2056ffc227 Optimize cscal function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-29 13:51:43 -06:00
Rajalakshmi Srinivasaraghavan
3ede843d50 Optimize s/dscal function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-24 07:48:28 -06:00
Rajalakshmi Srinivasaraghavan
439b93f6d2 Optimize s/drot function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-21 13:24:45 -06:00
Rajalakshmi Srinivasaraghavan
eff7c9166e Optimize cdot function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-15 13:40:34 -06:00
Rajalakshmi Srinivasaraghavan
601b711c78 Optimize swap function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-08 08:01:36 -06:00
Rajalakshmi Srinivasaraghavan
2fb11f873b POWER10: Improve copy performance
This patch aligns the stores to 32 byte boundary for scopy and dcopy
before entering into vector pair loop. For ccopy, changed the store
instructions to stxv to improve performance of unaligned cases.
2020-12-13 10:41:45 -06:00
Martin Kroeker
043128cbe5 Merge pull request #3029 from RajalakshmiSR/axpyp10
POWER10: Improve axpy performance
2020-12-10 22:49:28 +01:00
Rajalakshmi Srinivasaraghavan
346e30a46a POWER10: Improve axpy performance
This patch aligns the stores to 32 byte boundary for saxpy and daxpy
before entering into vector pair loop. Fox caxpy, changed the store
instructions to stxv to improve performance of unaligned cases.
2020-12-10 11:51:42 -06:00
Gordon Fossum
213c0e7abb Added special unrolled vectorized versions of "Solve" for specific sizes,
in DTRSM and STRSM, to improve performance in Power9 and Power10.
2020-12-04 17:07:06 -06:00
Rajalakshmi Srinivasaraghavan
7d46e31de1 POWER10: Optimize dgemv_n
Handling as 4x8 with vector pairs gives better performance than
existing code in POWER10.
2020-11-29 15:28:28 -06:00
Rajalakshmi Srinivasaraghavan
6e364981a8 Optimize sdot/ddot for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2020-11-07 15:21:58 -06:00
Rajalakshmi Srinivasaraghavan
dd7a9cc5bf POWER10: Change dgemm unroll factors
Changing the unroll factors for dgemm to 8 shows improved performance with
POWER10 MMA feature.   Also made some minor changes in sgemm for edge cases.
2020-10-31 18:28:57 -05:00
Rajalakshmi Srinivasaraghavan
b435491885 Optimize caxpy for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2020-10-29 14:57:51 -05:00
Rajalakshmi Srinivasaraghavan
c24ba8b1dd Optimize saxpy for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2020-10-26 13:24:59 -05:00
Martin Kroeker
34c3c407ef label always_inline function as inline to silence a gcc warning 2020-10-22 22:14:26 +02:00
Rajalakshmi Srinivasaraghavan
ad745c0bae Optimize scopy/ccopy for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores. Also reorganized all variants of copy functions
to make use of same kernel.
2020-10-21 09:53:45 -05:00
Martin Kroeker
a61c086408 Fix spurious trailing whitespace in comment 2020-10-19 09:12:12 +02:00
Martin Kroeker
f1a4071d8c Clean up STACKSIZE redefinition 2020-10-18 19:41:43 +02:00
Martin Kroeker
97cf10062f Clean up STACKSIZE redefinition 2020-10-18 19:39:18 +02:00
Martin Kroeker
17e288e18d Clean up STACKSIZE redefinition 2020-10-18 19:37:04 +02:00
Martin Kroeker
c1422f3e46 Clean up STACKSIZE redefinition 2020-10-18 19:31:01 +02:00
Martin Kroeker
d85b24e103 Clean up STACKSIZE redefinition 2020-10-18 19:29:45 +02:00
Rajalakshmi Srinivasaraghavan
0826d68f93 POWER10: Change the packing format for bfloat16
As the new MMA instructions need the inputs in 4x2 order for bfloat16,
changing the format in copy/packing code.  This avoids permute instructions
in the gemm kernel inner loop.
2020-10-13 16:05:10 -05:00
Martin Kroeker
2061f7fdff Rename "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-11 23:54:53 +02:00
Martin Kroeker
9ae80490e0 rename "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-11 23:39:42 +02:00
Martin Kroeker
d314d1f49f Rename shgemm_kernel_power10.c to sbgemm_kernel_power10.c 2020-10-11 23:37:38 +02:00
Rajalakshmi Srinivasaraghavan
2df4235e00 Optimize dcopy/zcopy for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores. Tested in simulator and no new failures.
2020-09-27 21:42:32 -05:00
Rajalakshmi Srinivasaraghavan
be43d2cb96 Optimize daxpy/zaxpy for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores. Tested in simulator and no new failures.
2020-09-17 12:56:28 -05:00
Rajalakshmi Srinivasaraghavan
317ff27cda POWER10: Avoid setting accumulators to zero in gemm kernels
For the first iteration, it is better to use xvf*ger instead of xvf*gerpp
builtins which helps to avoid setting accumulators to zero. This helps
to reduce few instructions.
2020-08-28 10:42:54 -05:00