Commit Graph

5781 Commits

Author SHA1 Message Date
Martin Kroeker
2efa3b70dc Add workaround for NVIDIA HPC 2021-01-12 16:49:39 +01:00
Martin Kroeker
49959d4f1c Add workaround for NVIDIA HPC 2021-01-12 16:47:15 +01:00
Martin Kroeker
0f27a03607 Add workaround for NVIDIA HPC mishandling of the asm DOT kernels 2021-01-12 16:39:35 +01:00
Martin Kroeker
c2a8ebfe69 Add workaround for NVIDIA HPC mishandling of the asm DOT kernels 2021-01-12 16:38:51 +01:00
Martin Kroeker
43aac5bacc Support NVIDIA HPC compiler 2021-01-12 16:36:12 +01:00
Martin Kroeker
bff2b7c94d Support compilation with NVIDIA HPC compilers (which do not take gcc-style arch options) 2021-01-12 16:34:18 +01:00
Martin Kroeker
2d45a262d9 Support compilation with nvfortran 2021-01-12 16:32:29 +01:00
Gordon Fossum
ed652d8136 Added definitions for GEMM_PREFERED_SIZE and SWITCH_RATIO to the POWER9 and POWER10 specific sections of param.h. 2021-01-11 21:13:53 -05:00
Martin Kroeker
6fe0f1fab9 Label get_cpu_ftr as volatile to keep gcc from rearranging the code 2021-01-11 19:05:29 +01:00
Chen, Guobing
b0beb0b1ca Initial code for Cooperlake BF16 GEMM kernel 2021-01-11 02:15:21 +08:00
Martin Kroeker
018dec8588 Merge pull request #7 from xianyi/develop
rebase
2021-01-10 17:09:46 +01:00
Martin Kroeker
5d6209e1f9 Merge pull request #3055 from RajalakshmiSR/swapp10
Optimize swap function for POWER10
2021-01-09 00:11:44 +01:00
Rajalakshmi Srinivasaraghavan
601b711c78 Optimize swap function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-08 08:01:36 -06:00
Martin Kroeker
78702753f2 Merge pull request #3053 from pkubaj/patch-1
Fix build on FreeBSD/powerpc64le
2021-01-02 16:14:07 +01:00
pkubaj
7aa1ff8ff6 Fix build on FreeBSD/powerpc64le 2021-01-01 21:19:57 +00:00
Martin Kroeker
d6c97cf010 Merge pull request #3052 from ashwinyes/arm64_fix_nrm2
arm64: Fix nrm2 for input vectors with Inf
2021-01-01 15:51:07 +01:00
Ashwin Sekhar T K
1b2508362b arm64: Fix nrm2 for input vectors with Inf
Fix double precision nrm2 kernels returning NaN when the
input vectors contain Inf/-Inf.
2021-01-01 02:49:37 -08:00
Martin Kroeker
cd898af59f Merge pull request #3050 from aurel32/riscv64-openblas-supported
getarch.c: define OPENBLAS_SUPPORTED for riscv64
2020-12-29 21:59:40 +01:00
Aurelien Jarno
0a535e58d8 getarch.c: define OPENBLAS_SUPPORTED for riscv64 2020-12-29 12:06:39 +00:00
Martin Kroeker
9ce9e295fe Merge pull request #3049 from martin-frbg/readme
Expand the introductory paragraph of the README with links to netlib docs and linear algebra lecture videos
2020-12-27 22:54:20 +01:00
Martin Kroeker
9a38592c79 Add pointers to the netlib documentation and Gilbert Strang's linear algebra primers 2020-12-27 21:55:08 +01:00
Martin Kroeker
9b3965b08c Merge pull request #6 from xianyi/develop
rebase
2020-12-27 21:28:10 +01:00
Martin Kroeker
531cb4f673 Merge pull request #3035 from Joshua-Ashton/patch-1
Define BLAS acronym in README
2020-12-27 21:26:52 +01:00
Martin Kroeker
3559c5d7a2 Merge pull request #3048 from martin-frbg/issue2998
Temporarily revert to the old NRM2 kernels for ThunderX2/3 and NeoverseN1
2020-12-21 13:30:08 +01:00
Martin Kroeker
8631e2976a Temporarily revert to the old nrm2 kernels 2020-12-21 07:45:13 +01:00
Martin Kroeker
2768bc1764 Temporarily revert to the old nrm2 kernels 2020-12-21 07:42:51 +01:00
Martin Kroeker
6f4698ee1f Temporarily revert to the old nrm2 kernel 2020-12-21 07:41:18 +01:00
Martin Kroeker
85e5165e98 Merge pull request #3046 from martin-frbg/nvidiasdk-ppc
Support NVIDIA HPC SDK on POWERPC
2020-12-20 11:55:53 +01:00
Martin Kroeker
17c16f2a71 Implement builtin_cpu_is and limit cpu choices to P8 and P9 for NVIDIA compilers 2020-12-19 23:21:22 +01:00
Martin Kroeker
91c3f86c2b NVIDIA compiler does not yet support POWER10 2020-12-19 23:19:05 +01:00
Martin Kroeker
75b1f3becc Limit POWERPC DYNAMIC_CORE list to P8 and P9 for NVIDIA compilers 2020-12-19 23:17:40 +01:00
Martin Kroeker
07c5e549b2 Merge pull request #3045 from martin-frbg/nvidiasdk
Support NVIDIA HPC SDK 20.11 compilers on x86_64
2020-12-19 23:14:02 +01:00
Martin Kroeker
114eb159a4 Disable FMA intrinsics in the srot kernel when the compiler is PGI/NVIDIA 2020-12-19 22:15:58 +01:00
Martin Kroeker
005cce5507 Amend SkylakeX options to support the NVIDIA compiler 2020-12-19 22:11:49 +01:00
Martin Kroeker
b859b6e79d Add nvfortran 2020-12-19 22:09:57 +01:00
Martin Kroeker
b212a2fb9f Add/modify "PGI" compiler options for NVIDIA SDK 20.11 2020-12-19 22:08:37 +01:00
Martin Kroeker
e40416567a Add version printout for PGI/NVIDIA compiler 2020-12-19 22:06:56 +01:00
Martin Kroeker
b37e5fa2f8 Merge pull request #5 from xianyi/develop
rebase
2020-12-19 20:11:06 +01:00
Martin Kroeker
326469ef4a Merge pull request #3042 from martin-frbg/develop
Move FMA3 option setting to the kernel makefile
2020-12-19 20:04:19 +01:00
Martin Kroeker
c73d8ee40d Conditionally add -mfma to compiler options where needed 2020-12-17 11:34:05 +01:00
Martin Kroeker
abef2ea770 Move -fma option setting to kernel/Makefile.L1 2020-12-17 11:32:27 +01:00
Martin Kroeker
b26e32c3af Merge pull request #3040 from martin-frbg/fixfcheck
Fix undefined CC variable in check for clang+gfortran combo
2020-12-16 00:05:04 +01:00
Martin Kroeker
7822eff936 Merge pull request #3038 from martin-frbg/issue3037
Fix spurious assumption of cross-compilation on some architectures
2020-12-16 00:04:45 +01:00
Martin Kroeker
865676682d Add Intel Rocket Lake 2020-12-14 22:40:23 +01:00
Martin Kroeker
0f7776af0b Add Intel Rocket Lake 2020-12-14 22:30:36 +01:00
Martin Kroeker
b03dc011be Fix undefined CC variable in clang check 2020-12-14 19:21:52 +01:00
Martin Kroeker
00ce35336e Fix spurious removal of a trailing character from the hostarch string on x86_64 2020-12-13 21:28:01 +01:00
Martin Kroeker
723776ddf7 Merge pull request #4 from xianyi/develop
rebase
2020-12-13 21:22:41 +01:00
Martin Kroeker
5a77ec7f1c Merge pull request #3036 from RajalakshmiSR/p10copyalign
POWER10: Improve copy performance
2020-12-13 21:21:34 +01:00
Rajalakshmi Srinivasaraghavan
2fb11f873b POWER10: Improve copy performance
This patch aligns the stores to 32 byte boundary for scopy and dcopy
before entering into vector pair loop. For ccopy, changed the store
instructions to stxv to improve performance of unaligned cases.
2020-12-13 10:41:45 -06:00