Commit Graph

1627 Commits

Author SHA1 Message Date
Martin Kroeker
0934568d9c Move includes under the ifdef for compilers w/o intrinsics support 2021-03-12 12:42:05 +01:00
Rajalakshmi Srinivasaraghavan
09d47af2c0 Optimize zscal function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-03-10 17:15:33 -06:00
Martin Kroeker
ef0238ba2b Merge pull request #3130 from martin-frbg/issue3128
Replace spurious AVX512 requirement in the Haswell srot microkernel with an AVX2/FMA3 guard
2021-03-06 19:15:53 +01:00
Martin Kroeker
a9f6f7ad39 Remove spurious AVX512 requirement and add AVX2/FMA3 guard 2021-03-06 14:35:49 +01:00
Rajalakshmi Srinivasaraghavan
41646ed006 Optimize s/dasum function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-03-05 16:22:36 -06:00
Rajalakshmi Srinivasaraghavan
0571c3187b POWER10: Rename mma builtins
The LLVM and GCC teams agreed to rename the __builtin_mma_assemble_pair and
__builtin_mma_disassemble_pair built-ins to __builtin_vsx_assemble_pair and
__builtin_vsx_disassemble_pair respectively. This patch is to make
corresponding changes in dgemm kernel. Also made changes in
inputs to those builtins to avoid some potential typecasting issues.

Reference gcc commit id:77ef995c1fbcab76a2a69b9f4700bcfd005d8e62
2021-02-26 20:56:34 -06:00
Martin Kroeker
292d1af1a0 Update omatcopy_rt.c 2021-02-24 09:34:14 +01:00
Martin Kroeker
325b398e3c Update omatcopy_rt.c 2021-02-24 09:13:12 +01:00
Martin Kroeker
6f5667b4d4 Enable optimized S/D OMATCOPY_RT 2021-02-24 09:03:41 +01:00
Martin Kroeker
cceeee7806 Add optimized omatcopy_rt 2021-02-24 09:00:54 +01:00
Martin Kroeker
0a4546b742 Typo fix 2021-02-23 13:14:35 +01:00
Martin Kroeker
b1eed27a54 Replace naive omatcopy_rt with 4x4 blocked implementation
as suggested by MigMuc in issue 2532
2021-02-22 21:35:42 +01:00
Martin Kroeker
47691c031f Use Haswell optimizations for Zen as well 2021-02-11 09:26:15 +01:00
Martin Kroeker
ce7ddd8921 Use Haswell optimizations for Zen as well 2021-02-11 09:25:36 +01:00
Martin Kroeker
950c047b49 Use Haswell optimizations for Zen as well 2021-02-11 09:24:51 +01:00
Martin Kroeker
46509953a9 Use Haswell optimizations for Zen as well 2021-02-11 09:24:16 +01:00
Martin Kroeker
db348dcff2 Enable optimized srot/drot kernels from Haswell 2021-02-11 09:23:05 +01:00
Rajalakshmi Srinivasaraghavan
2056ffc227 Optimize cscal function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-29 13:51:43 -06:00
Rajalakshmi Srinivasaraghavan
3ede843d50 Optimize s/dscal function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-24 07:48:28 -06:00
Martin Kroeker
69a5558203 Merge pull request #3059 from Guobing-Chen/BF16_gemm
Initial code for Cooperlake BF16 GEMM kernel
2021-01-23 19:08:05 +01:00
Martin Kroeker
d6905403e3 Merge pull request #3068 from alexhenrie/scan-build
scan-build fixes
2021-01-23 19:06:29 +01:00
Rajalakshmi Srinivasaraghavan
439b93f6d2 Optimize s/drot function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-21 13:24:45 -06:00
Rajalakshmi Srinivasaraghavan
eff7c9166e Optimize cdot function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-15 13:40:34 -06:00
Alex Henrie
202fc9e8ed Fix uninitialized argument value in dasum_k 2021-01-14 19:40:31 -07:00
Martin Kroeker
e378b24487 Merge pull request #3067 from albertziegenhagel/fix-generic-cmake
Fix building "generic" TRMM kernel with CMake
2021-01-14 21:35:19 +01:00
Albert Ziegenhagel
e3f4063683 Fix building "generic" TRMM kernel with CMake
The CMake "TARGET_CORE" variables stores the "generic" target name in all lowercase letters, but gets compared to an all uppercase string, which results in the wrong TRMM kernel being selected.
This commit converts the TARGET_CORE to all uppercase before comparing its value to make sure case mismatches are not an issue in the future anymore.
2021-01-14 10:00:49 +01:00
Martin Kroeker
b716c0ef01 Add workaround for NVIDIA HPC 2021-01-12 16:51:35 +01:00
Martin Kroeker
2efa3b70dc Add workaround for NVIDIA HPC 2021-01-12 16:49:39 +01:00
Martin Kroeker
49959d4f1c Add workaround for NVIDIA HPC 2021-01-12 16:47:15 +01:00
Martin Kroeker
0f27a03607 Add workaround for NVIDIA HPC mishandling of the asm DOT kernels 2021-01-12 16:39:35 +01:00
Martin Kroeker
c2a8ebfe69 Add workaround for NVIDIA HPC mishandling of the asm DOT kernels 2021-01-12 16:38:51 +01:00
Martin Kroeker
43aac5bacc Support NVIDIA HPC compiler 2021-01-12 16:36:12 +01:00
Chen, Guobing
b0beb0b1ca Initial code for Cooperlake BF16 GEMM kernel 2021-01-11 02:15:21 +08:00
Rajalakshmi Srinivasaraghavan
601b711c78 Optimize swap function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-08 08:01:36 -06:00
Ashwin Sekhar T K
1b2508362b arm64: Fix nrm2 for input vectors with Inf
Fix double precision nrm2 kernels returning NaN when the
input vectors contain Inf/-Inf.
2021-01-01 02:49:37 -08:00
Martin Kroeker
3559c5d7a2 Merge pull request #3048 from martin-frbg/issue2998
Temporarily revert to the old NRM2 kernels for ThunderX2/3 and NeoverseN1
2020-12-21 13:30:08 +01:00
Martin Kroeker
8631e2976a Temporarily revert to the old nrm2 kernels 2020-12-21 07:45:13 +01:00
Martin Kroeker
2768bc1764 Temporarily revert to the old nrm2 kernels 2020-12-21 07:42:51 +01:00
Martin Kroeker
6f4698ee1f Temporarily revert to the old nrm2 kernel 2020-12-21 07:41:18 +01:00
Martin Kroeker
114eb159a4 Disable FMA intrinsics in the srot kernel when the compiler is PGI/NVIDIA 2020-12-19 22:15:58 +01:00
Martin Kroeker
005cce5507 Amend SkylakeX options to support the NVIDIA compiler 2020-12-19 22:11:49 +01:00
Martin Kroeker
c73d8ee40d Conditionally add -mfma to compiler options where needed 2020-12-17 11:34:05 +01:00
Rajalakshmi Srinivasaraghavan
2fb11f873b POWER10: Improve copy performance
This patch aligns the stores to 32 byte boundary for scopy and dcopy
before entering into vector pair loop. For ccopy, changed the store
instructions to stxv to improve performance of unaligned cases.
2020-12-13 10:41:45 -06:00
Martin Kroeker
043128cbe5 Merge pull request #3029 from RajalakshmiSR/axpyp10
POWER10: Improve axpy performance
2020-12-10 22:49:28 +01:00
Martin Kroeker
3331ca492d Merge pull request #3021 from austinpagan/trsm_p10
POWER: Added special unrolled vectorized versions of "Solve" for specific si…
2020-12-10 19:42:54 +01:00
Rajalakshmi Srinivasaraghavan
346e30a46a POWER10: Improve axpy performance
This patch aligns the stores to 32 byte boundary for saxpy and daxpy
before entering into vector pair loop. Fox caxpy, changed the store
instructions to stxv to improve performance of unaligned cases.
2020-12-10 11:51:42 -06:00
gxw
4b548857d6 Add msa support for loongson
1. Using core loongson3r3 and loongson3r4 for loongson
2. Add DYNAMIC_ARCH for loongson

Change-Id: I1c6b54dbeca3a0cc31d1222af36a7e9bd6ab54c1
2020-12-09 10:28:46 +08:00
Martin Kroeker
7f11e33e8d Merge pull request #3025 from TiredNotTear/develop
MIPS: Fix two bugs
2020-12-08 09:39:27 +01:00
Martin Kroeker
53e0837809 Merge pull request #3022 from jinboson/develop
Fix test errors reported by cblas_cgemm & cblas_ctrmm
2020-12-07 08:09:11 +01:00
Hao Chen
ad38bd0e89 Fix failed cgemv and zgemv test case after using msa optimization
The cgemv and zgemv test case will call cgemv_n/t_msa.c zgemv_n/t_msa.c files in MIPS environment.
When the macro CONJ is defined, the calculation result will be wrong due to the wrong definition of OP2.
This patch updates the value of OP2 and passes the corresponding test.
2020-12-07 10:25:01 +08:00