Commit Graph

1615 Commits

Author SHA1 Message Date
Martin Kroeker 47691c031f
Use Haswell optimizations for Zen as well 2021-02-11 09:26:15 +01:00
Martin Kroeker ce7ddd8921
Use Haswell optimizations for Zen as well 2021-02-11 09:25:36 +01:00
Martin Kroeker 950c047b49
Use Haswell optimizations for Zen as well 2021-02-11 09:24:51 +01:00
Martin Kroeker 46509953a9
Use Haswell optimizations for Zen as well 2021-02-11 09:24:16 +01:00
Martin Kroeker db348dcff2
Enable optimized srot/drot kernels from Haswell 2021-02-11 09:23:05 +01:00
Rajalakshmi Srinivasaraghavan 2056ffc227 Optimize cscal function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-29 13:51:43 -06:00
Rajalakshmi Srinivasaraghavan 3ede843d50 Optimize s/dscal function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-24 07:48:28 -06:00
Martin Kroeker 69a5558203
Merge pull request #3059 from Guobing-Chen/BF16_gemm
Initial code for Cooperlake BF16 GEMM kernel
2021-01-23 19:08:05 +01:00
Martin Kroeker d6905403e3
Merge pull request #3068 from alexhenrie/scan-build
scan-build fixes
2021-01-23 19:06:29 +01:00
Rajalakshmi Srinivasaraghavan 439b93f6d2 Optimize s/drot function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-21 13:24:45 -06:00
Rajalakshmi Srinivasaraghavan eff7c9166e Optimize cdot function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-15 13:40:34 -06:00
Alex Henrie 202fc9e8ed Fix uninitialized argument value in dasum_k 2021-01-14 19:40:31 -07:00
Martin Kroeker e378b24487
Merge pull request #3067 from albertziegenhagel/fix-generic-cmake
Fix building "generic" TRMM kernel with CMake
2021-01-14 21:35:19 +01:00
Albert Ziegenhagel e3f4063683 Fix building "generic" TRMM kernel with CMake
The CMake "TARGET_CORE" variables stores the "generic" target name in all lowercase letters, but gets compared to an all uppercase string, which results in the wrong TRMM kernel being selected.
This commit converts the TARGET_CORE to all uppercase before comparing its value to make sure case mismatches are not an issue in the future anymore.
2021-01-14 10:00:49 +01:00
Martin Kroeker b716c0ef01
Add workaround for NVIDIA HPC 2021-01-12 16:51:35 +01:00
Martin Kroeker 2efa3b70dc
Add workaround for NVIDIA HPC 2021-01-12 16:49:39 +01:00
Martin Kroeker 49959d4f1c
Add workaround for NVIDIA HPC 2021-01-12 16:47:15 +01:00
Martin Kroeker 0f27a03607
Add workaround for NVIDIA HPC mishandling of the asm DOT kernels 2021-01-12 16:39:35 +01:00
Martin Kroeker c2a8ebfe69
Add workaround for NVIDIA HPC mishandling of the asm DOT kernels 2021-01-12 16:38:51 +01:00
Martin Kroeker 43aac5bacc
Support NVIDIA HPC compiler 2021-01-12 16:36:12 +01:00
Chen, Guobing b0beb0b1ca Initial code for Cooperlake BF16 GEMM kernel 2021-01-11 02:15:21 +08:00
Rajalakshmi Srinivasaraghavan 601b711c78 Optimize swap function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-08 08:01:36 -06:00
Ashwin Sekhar T K 1b2508362b arm64: Fix nrm2 for input vectors with Inf
Fix double precision nrm2 kernels returning NaN when the
input vectors contain Inf/-Inf.
2021-01-01 02:49:37 -08:00
Martin Kroeker 3559c5d7a2
Merge pull request #3048 from martin-frbg/issue2998
Temporarily revert to the old NRM2 kernels for ThunderX2/3 and NeoverseN1
2020-12-21 13:30:08 +01:00
Martin Kroeker 8631e2976a
Temporarily revert to the old nrm2 kernels 2020-12-21 07:45:13 +01:00
Martin Kroeker 2768bc1764
Temporarily revert to the old nrm2 kernels 2020-12-21 07:42:51 +01:00
Martin Kroeker 6f4698ee1f
Temporarily revert to the old nrm2 kernel 2020-12-21 07:41:18 +01:00
Martin Kroeker 114eb159a4
Disable FMA intrinsics in the srot kernel when the compiler is PGI/NVIDIA 2020-12-19 22:15:58 +01:00
Martin Kroeker 005cce5507
Amend SkylakeX options to support the NVIDIA compiler 2020-12-19 22:11:49 +01:00
Martin Kroeker c73d8ee40d
Conditionally add -mfma to compiler options where needed 2020-12-17 11:34:05 +01:00
Rajalakshmi Srinivasaraghavan 2fb11f873b POWER10: Improve copy performance
This patch aligns the stores to 32 byte boundary for scopy and dcopy
before entering into vector pair loop. For ccopy, changed the store
instructions to stxv to improve performance of unaligned cases.
2020-12-13 10:41:45 -06:00
Martin Kroeker 043128cbe5
Merge pull request #3029 from RajalakshmiSR/axpyp10
POWER10: Improve axpy performance
2020-12-10 22:49:28 +01:00
Martin Kroeker 3331ca492d
Merge pull request #3021 from austinpagan/trsm_p10
POWER: Added special unrolled vectorized versions of "Solve" for specific si…
2020-12-10 19:42:54 +01:00
Rajalakshmi Srinivasaraghavan 346e30a46a POWER10: Improve axpy performance
This patch aligns the stores to 32 byte boundary for saxpy and daxpy
before entering into vector pair loop. Fox caxpy, changed the store
instructions to stxv to improve performance of unaligned cases.
2020-12-10 11:51:42 -06:00
gxw 4b548857d6 Add msa support for loongson
1. Using core loongson3r3 and loongson3r4 for loongson
2. Add DYNAMIC_ARCH for loongson

Change-Id: I1c6b54dbeca3a0cc31d1222af36a7e9bd6ab54c1
2020-12-09 10:28:46 +08:00
Martin Kroeker 7f11e33e8d
Merge pull request #3025 from TiredNotTear/develop
MIPS: Fix two bugs
2020-12-08 09:39:27 +01:00
Martin Kroeker 53e0837809
Merge pull request #3022 from jinboson/develop
Fix test errors reported by cblas_cgemm & cblas_ctrmm
2020-12-07 08:09:11 +01:00
Hao Chen ad38bd0e89 Fix failed cgemv and zgemv test case after using msa optimization
The cgemv and zgemv test case will call cgemv_n/t_msa.c zgemv_n/t_msa.c files in MIPS environment.
When the macro CONJ is defined, the calculation result will be wrong due to the wrong definition of OP2.
This patch updates the value of OP2 and passes the corresponding test.
2020-12-07 10:25:01 +08:00
Hao Chen 47b639cc9b Fix failed sswap and dswap case by using msa optimization
The swap test case will call sswap_msa.c and dswap_msa.c files in MIPS environmnet.
When inc_x or inc_y is equal to zero, the calculation result of the two functions will be wrong.
This patch adds the processing of inc_x or inc_y equal to zero, and the swap test case has passed.
2020-12-07 10:24:49 +08:00
Martin Kroeker b660008c7e
Work around DOT and SWAP test failures 2020-12-06 19:15:37 +01:00
Martin Kroeker f8346603cf
Fix compilation with SolarisStudio 2020-12-06 19:14:16 +01:00
Jin Bo 65de6f5957 Fix test errors reported by cblas_cgemm & cblas_ctrmm
The file cgemm_kernel_8x4_msa.c holds the MSA optimization
codes of cblas_cgemm and cblas_ctrmm. It defines two
macros: CGEMM_SCALE_1X2 and CGEMM_TRMM_SCALE_1X2. The pc1
array index in the two macros should be 0 and 1.
2020-12-05 15:08:17 +08:00
Gordon Fossum 213c0e7abb Added special unrolled vectorized versions of "Solve" for specific sizes,
in DTRSM and STRSM, to improve performance in Power9 and Power10.
2020-12-04 17:07:06 -06:00
Martin Kroeker 441c08c9ff
Merge pull request #3016 from xiegengxin/complex-asum
Improve the performance of zasum and casum with AVX512 intrinsic
2020-12-04 22:07:16 +01:00
Gengxin Xie 0cb7a403b2 fix error declare function blas_level1_thread_with_return_value 2020-12-02 09:51:52 +08:00
Gengxin Xie b766c1e9bb Improve the performance of zasum and casum with AVX512 intrinsic 2020-12-01 16:49:26 +08:00
Rajalakshmi Srinivasaraghavan 7d46e31de1 POWER10: Optimize dgemv_n
Handling as 4x8 with vector pairs gives better performance than
existing code in POWER10.
2020-11-29 15:28:28 -06:00
Martin Kroeker f1bf040b25
Merge pull request #2988 from xiegengxin/smp-asum
Improve the performance of dasum and sasum when SMP is defined
2020-11-22 12:24:13 +01:00
Xianyi Zhang 7037849498 Merge branch 'develop' into risc-v 2020-11-22 16:04:50 +08:00
Martin Kroeker 7e9cb39a25
Merge pull request #2981 from Qiyu8/fix-sum
Fix sum optimize issues
2020-11-16 08:40:46 +01:00