Commit Graph

1648 Commits

Author SHA1 Message Date
Martin Kroeker
4ecf631f95 Merge pull request #3228 from martin-frbg/issue3226
filter out -mavx flag on Sandybridge zgemm/ztrmm kernels
2021-05-15 09:06:12 +02:00
Martin Kroeker
310b76aad7 Merge pull request #3231 from martin-frbg/issue3227
Support compilation with pre-C99 versions of MSVC
2021-05-14 23:28:06 +02:00
Martin Kroeker
c4da892ba0 Only filter out -mavx on Sandybridge ZGEMM/ZTRMM kernels 2021-05-14 23:19:10 +02:00
Martin Kroeker
8b90e5f202 Drop redundant inclusion of complex.h 2021-05-14 15:06:44 +02:00
Martin Kroeker
bd60fb6ffc filter out -mavx flag on zgemm kernels as it can cause problems with older gcc 2021-05-13 23:05:00 +02:00
Martin Kroeker
37ea8702ee Merge pull request #3192 from damonyu1989/develop
Update the intrinsic api to the offical name.
2021-05-11 16:00:45 +02:00
damonyu
ceb44bef14 update the intrinsic api to the offical name. 2021-04-27 11:12:29 +08:00
Martin Kroeker
3d511f0e66 replace spurious avx512 requirement with fma check 2021-04-26 21:55:30 +02:00
Rajalakshmi Srinivasaraghavan
2379abaa5e POWER10: Improve dgemm performance
This patch uses vector pair pointer for input load operation
which helps to generate power10 lxvp instructions.
2021-04-13 22:30:06 -05:00
Rajalakshmi Srinivasaraghavan
55bb9f639a POWER10: Optimized zgemv
This patch makes use of Matrix-Multiply Assist (MMA)
feature introduced in POWER ISA v3.1 for zgemv_n and zgemv_t.
2021-04-10 19:00:24 -05:00
Martin Kroeker
2dfb24730d Use "old" compute(24) function with clang due to register limitations 2021-04-06 19:58:32 +02:00
Martin Kroeker
147e0a75fd Merge pull request #3170 from CodesWithWolves/sgemm_tcopy_16-invalid-read
Remove Unnecessary/Erroneous Adds/Reads In sgemm_tcopy_16.S COPY1x8 Macro
2021-04-03 19:49:47 +02:00
Rajalakshmi Srinivasaraghavan
2dbcddd83d POWER10: Adding check for little endian
This patch makes sure that recent POWER10 patches are used
only for little endian.
2021-03-31 21:32:42 -05:00
CodesWithWolves
d2bda3b56a Remove Unnecessary/Erroneous Reads In sgemm_tcopy_16.S COPY1x8 Macro
There appears to have been some code leak when copying from the COPY2x8
macro above where we're reading 8 bytes into d4-d7 directly after
reading 4 bytes into s4-s7. These 32 bytes in d4-7 are unused and can
possibly overrun the boundary of allocated memory -- Valgrind detected
this which is what dragged my attention to it for a 128,1 copy.

Additionally, there is no need to update the addresses stored in A0-A7
as the only possible paths after running this macro will overwrite A0-7
if looping to the next 8 rows, or overwrite A0-3 if moving to 4 rows --
in which case A4-7 are unused.
2021-03-31 15:44:25 -04:00
Martin Kroeker
bdd6e3a153 Merge pull request #3157 from martin-frbg/issue3020-final
Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler on PPC
2021-03-19 15:23:12 +01:00
Martin Kroeker
7b8f580941 Merge pull request #3156 from martin-frbg/omatcopy_d
Move x86_64 DOMATCOPY_RT back to the C implementation
2021-03-19 15:22:48 +01:00
Martin Kroeker
86c5a0013f Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler 2021-03-19 11:47:58 +01:00
Martin Kroeker
ef85c22474 Add workaround for LAPACK test failures with the NVIDIA HPC compiler 2021-03-19 11:46:25 +01:00
Martin Kroeker
d3555d2e50 Add workaround for LAPACK test failures with the NVIDIA HPC compiler 2021-03-19 11:44:31 +01:00
Martin Kroeker
0f5e86a0d9 Remove premature entry for DOMATCOPY_RT 2021-03-18 21:53:50 +01:00
Martin Kroeker
7b294a99fd Move common.h back to the top of the file so that SKYLAKEX (from config.h) is defined in time 2021-03-18 21:28:19 +01:00
Martin Kroeker
0934568d9c Move includes under the ifdef for compilers w/o intrinsics support 2021-03-12 12:42:05 +01:00
Rajalakshmi Srinivasaraghavan
09d47af2c0 Optimize zscal function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-03-10 17:15:33 -06:00
Martin Kroeker
ef0238ba2b Merge pull request #3130 from martin-frbg/issue3128
Replace spurious AVX512 requirement in the Haswell srot microkernel with an AVX2/FMA3 guard
2021-03-06 19:15:53 +01:00
Martin Kroeker
a9f6f7ad39 Remove spurious AVX512 requirement and add AVX2/FMA3 guard 2021-03-06 14:35:49 +01:00
Rajalakshmi Srinivasaraghavan
41646ed006 Optimize s/dasum function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-03-05 16:22:36 -06:00
Rajalakshmi Srinivasaraghavan
0571c3187b POWER10: Rename mma builtins
The LLVM and GCC teams agreed to rename the __builtin_mma_assemble_pair and
__builtin_mma_disassemble_pair built-ins to __builtin_vsx_assemble_pair and
__builtin_vsx_disassemble_pair respectively. This patch is to make
corresponding changes in dgemm kernel. Also made changes in
inputs to those builtins to avoid some potential typecasting issues.

Reference gcc commit id:77ef995c1fbcab76a2a69b9f4700bcfd005d8e62
2021-02-26 20:56:34 -06:00
Martin Kroeker
292d1af1a0 Update omatcopy_rt.c 2021-02-24 09:34:14 +01:00
Martin Kroeker
325b398e3c Update omatcopy_rt.c 2021-02-24 09:13:12 +01:00
Martin Kroeker
6f5667b4d4 Enable optimized S/D OMATCOPY_RT 2021-02-24 09:03:41 +01:00
Martin Kroeker
cceeee7806 Add optimized omatcopy_rt 2021-02-24 09:00:54 +01:00
Martin Kroeker
0a4546b742 Typo fix 2021-02-23 13:14:35 +01:00
Martin Kroeker
b1eed27a54 Replace naive omatcopy_rt with 4x4 blocked implementation
as suggested by MigMuc in issue 2532
2021-02-22 21:35:42 +01:00
Martin Kroeker
47691c031f Use Haswell optimizations for Zen as well 2021-02-11 09:26:15 +01:00
Martin Kroeker
ce7ddd8921 Use Haswell optimizations for Zen as well 2021-02-11 09:25:36 +01:00
Martin Kroeker
950c047b49 Use Haswell optimizations for Zen as well 2021-02-11 09:24:51 +01:00
Martin Kroeker
46509953a9 Use Haswell optimizations for Zen as well 2021-02-11 09:24:16 +01:00
Martin Kroeker
db348dcff2 Enable optimized srot/drot kernels from Haswell 2021-02-11 09:23:05 +01:00
Rajalakshmi Srinivasaraghavan
2056ffc227 Optimize cscal function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-29 13:51:43 -06:00
Rajalakshmi Srinivasaraghavan
3ede843d50 Optimize s/dscal function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-24 07:48:28 -06:00
Martin Kroeker
69a5558203 Merge pull request #3059 from Guobing-Chen/BF16_gemm
Initial code for Cooperlake BF16 GEMM kernel
2021-01-23 19:08:05 +01:00
Martin Kroeker
d6905403e3 Merge pull request #3068 from alexhenrie/scan-build
scan-build fixes
2021-01-23 19:06:29 +01:00
Rajalakshmi Srinivasaraghavan
439b93f6d2 Optimize s/drot function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-21 13:24:45 -06:00
Rajalakshmi Srinivasaraghavan
eff7c9166e Optimize cdot function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-15 13:40:34 -06:00
Alex Henrie
202fc9e8ed Fix uninitialized argument value in dasum_k 2021-01-14 19:40:31 -07:00
Martin Kroeker
e378b24487 Merge pull request #3067 from albertziegenhagel/fix-generic-cmake
Fix building "generic" TRMM kernel with CMake
2021-01-14 21:35:19 +01:00
Albert Ziegenhagel
e3f4063683 Fix building "generic" TRMM kernel with CMake
The CMake "TARGET_CORE" variables stores the "generic" target name in all lowercase letters, but gets compared to an all uppercase string, which results in the wrong TRMM kernel being selected.
This commit converts the TARGET_CORE to all uppercase before comparing its value to make sure case mismatches are not an issue in the future anymore.
2021-01-14 10:00:49 +01:00
Martin Kroeker
b716c0ef01 Add workaround for NVIDIA HPC 2021-01-12 16:51:35 +01:00
Martin Kroeker
2efa3b70dc Add workaround for NVIDIA HPC 2021-01-12 16:49:39 +01:00
Martin Kroeker
49959d4f1c Add workaround for NVIDIA HPC 2021-01-12 16:47:15 +01:00