Martin Kroeker
2efa3b70dc
Add workaround for NVIDIA HPC
2021-01-12 16:49:39 +01:00
Martin Kroeker
49959d4f1c
Add workaround for NVIDIA HPC
2021-01-12 16:47:15 +01:00
Martin Kroeker
0f27a03607
Add workaround for NVIDIA HPC mishandling of the asm DOT kernels
2021-01-12 16:39:35 +01:00
Martin Kroeker
c2a8ebfe69
Add workaround for NVIDIA HPC mishandling of the asm DOT kernels
2021-01-12 16:38:51 +01:00
Martin Kroeker
43aac5bacc
Support NVIDIA HPC compiler
2021-01-12 16:36:12 +01:00
Rajalakshmi Srinivasaraghavan
601b711c78
Optimize swap function for POWER10
...
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-08 08:01:36 -06:00
Ashwin Sekhar T K
1b2508362b
arm64: Fix nrm2 for input vectors with Inf
...
Fix double precision nrm2 kernels returning NaN when the
input vectors contain Inf/-Inf.
2021-01-01 02:49:37 -08:00
Martin Kroeker
3559c5d7a2
Merge pull request #3048 from martin-frbg/issue2998
...
Temporarily revert to the old NRM2 kernels for ThunderX2/3 and NeoverseN1
2020-12-21 13:30:08 +01:00
Martin Kroeker
8631e2976a
Temporarily revert to the old nrm2 kernels
2020-12-21 07:45:13 +01:00
Martin Kroeker
2768bc1764
Temporarily revert to the old nrm2 kernels
2020-12-21 07:42:51 +01:00
Martin Kroeker
6f4698ee1f
Temporarily revert to the old nrm2 kernel
2020-12-21 07:41:18 +01:00
Martin Kroeker
114eb159a4
Disable FMA intrinsics in the srot kernel when the compiler is PGI/NVIDIA
2020-12-19 22:15:58 +01:00
Martin Kroeker
005cce5507
Amend SkylakeX options to support the NVIDIA compiler
2020-12-19 22:11:49 +01:00
Martin Kroeker
c73d8ee40d
Conditionally add -mfma to compiler options where needed
2020-12-17 11:34:05 +01:00
Rajalakshmi Srinivasaraghavan
2fb11f873b
POWER10: Improve copy performance
...
This patch aligns the stores to 32 byte boundary for scopy and dcopy
before entering into vector pair loop. For ccopy, changed the store
instructions to stxv to improve performance of unaligned cases.
2020-12-13 10:41:45 -06:00
Martin Kroeker
043128cbe5
Merge pull request #3029 from RajalakshmiSR/axpyp10
...
POWER10: Improve axpy performance
2020-12-10 22:49:28 +01:00
Martin Kroeker
3331ca492d
Merge pull request #3021 from austinpagan/trsm_p10
...
POWER: Added special unrolled vectorized versions of "Solve" for specific si…
2020-12-10 19:42:54 +01:00
Rajalakshmi Srinivasaraghavan
346e30a46a
POWER10: Improve axpy performance
...
This patch aligns the stores to 32 byte boundary for saxpy and daxpy
before entering into vector pair loop. Fox caxpy, changed the store
instructions to stxv to improve performance of unaligned cases.
2020-12-10 11:51:42 -06:00
gxw
4b548857d6
Add msa support for loongson
...
1. Using core loongson3r3 and loongson3r4 for loongson
2. Add DYNAMIC_ARCH for loongson
Change-Id: I1c6b54dbeca3a0cc31d1222af36a7e9bd6ab54c1
2020-12-09 10:28:46 +08:00
Martin Kroeker
7f11e33e8d
Merge pull request #3025 from TiredNotTear/develop
...
MIPS: Fix two bugs
2020-12-08 09:39:27 +01:00
Martin Kroeker
53e0837809
Merge pull request #3022 from jinboson/develop
...
Fix test errors reported by cblas_cgemm & cblas_ctrmm
2020-12-07 08:09:11 +01:00
Hao Chen
ad38bd0e89
Fix failed cgemv and zgemv test case after using msa optimization
...
The cgemv and zgemv test case will call cgemv_n/t_msa.c zgemv_n/t_msa.c files in MIPS environment.
When the macro CONJ is defined, the calculation result will be wrong due to the wrong definition of OP2.
This patch updates the value of OP2 and passes the corresponding test.
2020-12-07 10:25:01 +08:00
Hao Chen
47b639cc9b
Fix failed sswap and dswap case by using msa optimization
...
The swap test case will call sswap_msa.c and dswap_msa.c files in MIPS environmnet.
When inc_x or inc_y is equal to zero, the calculation result of the two functions will be wrong.
This patch adds the processing of inc_x or inc_y equal to zero, and the swap test case has passed.
2020-12-07 10:24:49 +08:00
Martin Kroeker
b660008c7e
Work around DOT and SWAP test failures
2020-12-06 19:15:37 +01:00
Martin Kroeker
f8346603cf
Fix compilation with SolarisStudio
2020-12-06 19:14:16 +01:00
Jin Bo
65de6f5957
Fix test errors reported by cblas_cgemm & cblas_ctrmm
...
The file cgemm_kernel_8x4_msa.c holds the MSA optimization
codes of cblas_cgemm and cblas_ctrmm. It defines two
macros: CGEMM_SCALE_1X2 and CGEMM_TRMM_SCALE_1X2. The pc1
array index in the two macros should be 0 and 1.
2020-12-05 15:08:17 +08:00
Gordon Fossum
213c0e7abb
Added special unrolled vectorized versions of "Solve" for specific sizes,
...
in DTRSM and STRSM, to improve performance in Power9 and Power10.
2020-12-04 17:07:06 -06:00
Martin Kroeker
441c08c9ff
Merge pull request #3016 from xiegengxin/complex-asum
...
Improve the performance of zasum and casum with AVX512 intrinsic
2020-12-04 22:07:16 +01:00
Gengxin Xie
0cb7a403b2
fix error declare function blas_level1_thread_with_return_value
2020-12-02 09:51:52 +08:00
Gengxin Xie
b766c1e9bb
Improve the performance of zasum and casum with AVX512 intrinsic
2020-12-01 16:49:26 +08:00
Rajalakshmi Srinivasaraghavan
7d46e31de1
POWER10: Optimize dgemv_n
...
Handling as 4x8 with vector pairs gives better performance than
existing code in POWER10.
2020-11-29 15:28:28 -06:00
Martin Kroeker
f1bf040b25
Merge pull request #2988 from xiegengxin/smp-asum
...
Improve the performance of dasum and sasum when SMP is defined
2020-11-22 12:24:13 +01:00
Xianyi Zhang
7037849498
Merge branch 'develop' into risc-v
2020-11-22 16:04:50 +08:00
Martin Kroeker
7e9cb39a25
Merge pull request #2981 from Qiyu8/fix-sum
...
Fix sum optimize issues
2020-11-16 08:40:46 +01:00
Gengxin Xie
d6e7e05bb3
Improve the performance of dasum and sasum when SMP is defined
2020-11-13 14:20:52 +08:00
Qiyu8
ae0b1dea19
modify system.cmake to enable fma flag
2020-11-13 10:20:24 +08:00
Qiyu8
e0dac6b53b
fix the CI failure of target specific option mismatch
2020-11-12 20:31:03 +08:00
Qiyu8
e5c2ceb675
fix the CI failure of lack the head
2020-11-12 17:35:17 +08:00
Qiyu8
a87e537b8c
modify macro
2020-11-11 15:53:48 +08:00
Qiyu8
5bc0a7583f
only FMA3 and vector larger than 128 have positive effects.
2020-11-11 15:18:01 +08:00
Qiyu8
8c0b206d4c
Optimize the performance of rot by using universal intrinsics
2020-11-11 14:33:12 +08:00
Qiyu8
c4c591ac5a
fix sum optimize issues
2020-11-10 16:16:38 +08:00
Xianyi Zhang
fc35b72ae1
Refs #2899
...
Merge branch 'openblas-open-910' of git://github.com/damonyu1989/OpenBLAS into damonyu1989-openblas-open-910
2020-11-10 09:38:04 +08:00
Xianyi Zhang
913cc9a4ca
Merge branch 'develop' into risc-v
2020-11-10 09:18:25 +08:00
Martin Kroeker
ff16329cb7
Merge pull request #2972 from xiegengxin/rot-intrinsic
...
Improve the performance of rot by using AVX512 and AVX2 intrinsic
2020-11-08 22:43:00 +01:00
Martin Kroeker
110c7a6de0
Merge pull request #2979 from RajalakshmiSR/dot_power10
...
Optimize sdot/ddot for POWER10
2020-11-08 10:19:34 +01:00
Rajalakshmi Srinivasaraghavan
6e364981a8
Optimize sdot/ddot for POWER10
...
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2020-11-07 15:21:58 -06:00
Martin Kroeker
b976a0bf40
Remove previous workaround for compiler flags related to cpu capabilities in x86_64 DYNAMIC_ARCH builds
2020-11-07 20:39:56 +01:00
Martin Kroeker
ff74319ea5
Merge pull request #2977 from martin-frbg/issue2976
...
Fix macro name used in ifdef for POWERPC/PGI
2020-11-07 14:41:34 +01:00
Martin Kroeker
28d2dfe2b3
Fix macro name used in ifdef
2020-11-07 12:17:49 +01:00