Commit Graph

7452 Commits

Author SHA1 Message Date
Martin Kroeker de4d5646eb
Merge pull request #3857 from martin-frbg/issue3856
Fix stride in shortcut path of C/ZSYR for small N
2022-12-09 08:25:54 +01:00
Martin Kroeker f10c266b4d
Fix stride in shortcut path for small N 2022-12-08 21:02:01 +01:00
Martin Kroeker c17b5ce75c
Merge pull request #3854 from martin-frbg/travis-gcc8arm
Travis CI: Add a DYNAMIC_ARCH build on Neoverse using older gcc8
2022-12-07 18:01:07 +01:00
Martin Kroeker 8531dbaa25
Update .travis.yml 2022-12-07 15:04:13 +01:00
Martin Kroeker ce1a9ae8bd
Add a DYNAMIC_ARCH build on Neoverse using older gcc8 2022-12-07 14:28:55 +01:00
Martin Kroeker aab9c410ef
Merge pull request #3853 from Mousius/fix-sve
Remove SVE from Arm(R) Neoverse(TM) N1 CPU in Makefile
2022-12-06 23:55:33 +01:00
Bart Oldeman 60e49b851c Fix typo in clobber list, should be xmm14 instead of ymm14. 2022-12-06 16:30:46 -05:00
Chris Sidebottom f76e3de3a5 Remove SVE from Arm(R) Neoverse(TM) N1 CPU in Makefile
I incorrectly added `+sve` to the Neoverse(TM) N1 CPUs GCC parameters,
which doesn't support SVE - this results in failed builds when using a
compiler that doesn't support `-mtune=neoverse-n1` which appears to hide
the mistake.
2022-12-06 21:23:07 +00:00
Bart Oldeman 4afe1439a1 Fix skylake fallback kernel name for old compilers. 2022-12-06 16:09:54 -05:00
Bart Oldeman 5ceca1a4d8 Add sscal.c + microkernels for Haswell, Zen, Skylake and newer.
Unlike [dcz]scal, sscal still used the original GotoBLAS SSE code from scal_sse.S.
This code follows dscal as closely as possible, except for the inc_x > 1 code
for which a plain C loop is used much like the one in cscal.c, instead of an
adaptation of the SSE2 asm code of dscal.c (I tried but the performance wasn't
better than the plain C loop).
2022-12-06 14:05:49 -05:00
lilianhuang 729af6406f bugfix for sbgemm_ncopy_8_neoversen2 2022-12-05 05:10:18 -05:00
Martin Kroeker 042e3c0e7c
Merge pull request #3848 from bartoldeman/dscal-haswell-ymm
dscal: use ymm registers in Haswell microkernel
2022-12-05 08:56:08 +01:00
Martin Kroeker 02763077d6
Merge pull request #3851 from martin-frbg/lapack773
Allocate work array in LAPACKE ?TGSEN when ijob is zero (Reference-LAPACK PR 733)
2022-12-04 14:52:36 +01:00
Martin Kroeker d59dcd7b16
Allocate work array when ijob is zero (Reference-LAPACK PR 733) 2022-12-04 11:43:24 +01:00
Martin Kroeker 14aef9400d
Merge pull request #3850 from martin-frbg/lapack765
Check for NaN in ?GECON (Reference-LAPACK PR765)
2022-12-04 11:33:57 +01:00
Martin Kroeker 9b96990e5d
Check for NaN in ?GECON (Reference-LAPACK PR765) 2022-12-03 20:33:27 +01:00
Martin Kroeker 1c1e0682a0
Merge pull request #3849 from martin-frbg/lapack769
Fix uninitialized M in quick return from D/SLARRD (Reference-LAPACK PR769)
2022-12-03 20:08:20 +01:00
Martin Kroeker 00cc78cfba
Fix uninitialized M in quick return (Reference-LAPACK 769) 2022-12-03 16:19:20 +01:00
Martin Kroeker 9307b0fabc
Fix uninitialized M in quick return (Reference-LAPACK 769) 2022-12-03 16:17:54 +01:00
Martin Kroeker 0a24f631e9
Merge pull request #3844 from Mousius/switch-ratio-16
Set SWITCH_RATIO for Arm(R) Neoverse(TM) V1 CPUs
2022-12-02 12:48:43 +01:00
Martin Kroeker 65984fbe68
Merge pull request #3847 from bartoldeman/scal-benchmark
scal benchmark: eliminate y, move init/timing out of loop
2022-12-02 11:51:50 +01:00
Martin Kroeker f6f0d13b9f
Merge pull request #3842 from Mousius/sve-dot
Add SVE implementation for sdot/ddot
2022-12-02 08:30:51 +01:00
Bart Oldeman 5c3169ecd8 dscal: use ymm registers in Haswell microkernel
Using 256-bit registers in dscal makes this microkernel consistent with
cscal and zscal, and generally doubles performance if the vector fits in
L1 cache.
2022-12-01 07:48:05 -05:00
Chris Sidebottom eea006a688 Wrap SVE header with __has_include check 2022-12-01 12:07:55 +00:00
Chris Sidebottom fd4f52c797 Add SVE implementation for sdot/ddot
This adds an SVE implementation to sdot/ddot when available, falling back to the previous Advanced SIMD kernel where there's no SVE implementation for the kernel.

All the targets were essentially treating `dot_thunderx2t99.c` as the Advanced SIMD implementation so I've renamed it to better fit with the feature detection.
2022-12-01 12:07:50 +00:00
Martin Kroeker b6a4ef98b9
Merge pull request #3845 from Mousius/asimd-dot-opt
Remove unnecessary instructions from Advanced SIMD dot
2022-11-30 21:07:30 +01:00
Chris Sidebottom 2fb096315e Set SWITCH_RATIO for Arm(R) Neoverse(TM) V1 CPUs
From testing this yields better results than the default of `2`.
2022-11-30 09:35:38 +00:00
Bart Oldeman bae45d94d1 scal benchmark: eliminate y, move init/timing out of loop
Removing y avoids cache effects (if y is the size of the L1 cache, the
main array x is removed from it).
Moving init and timing out of the loop makes the scal benchmark behave like
the gemm benchmark, and allows higher accuracy for smaller test cases since
the loop overhead is much smaller than the timing overhead.

Example:
OPENBLAS_LOOPS=10000 ./dscal.goto 1024 8192 1024
on AMD Zen2 (7532) with 32k (4k doubles) L1 cache per core.

Before
From : 1024  To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000
   SIZE       Flops
   1024 :     5627.08 MFlops   0.000000 sec
   2048 :     5907.34 MFlops   0.000000 sec
   3072 :     5553.30 MFlops   0.000001 sec
   4096 :     5446.38 MFlops   0.000001 sec
   5120 :     5504.61 MFlops   0.000001 sec
   6144 :     5501.80 MFlops   0.000001 sec
   7168 :     5547.43 MFlops   0.000001 sec
   8192 :     5548.46 MFlops   0.000001 sec

After
From : 1024  To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000
   SIZE       Flops
   1024 :     6310.28 MFlops   0.000000 sec
   2048 :     6396.29 MFlops   0.000000 sec
   3072 :     6439.14 MFlops   0.000000 sec
   4096 :     6327.14 MFlops   0.000001 sec
   5120 :     5628.24 MFlops   0.000001 sec
   6144 :     5616.41 MFlops   0.000001 sec
   7168 :     5553.13 MFlops   0.000001 sec
   8192 :     5600.88 MFlops   0.000001 sec

We can see the L1->L2 switchover point is now where it should be, and the
number of flops for L1 is more accurate.
2022-11-29 08:02:45 -05:00
lilianhuang fdac8a97c1 Add sbgemm_ncopy_8 and sbgemm_tcopy_4 2022-11-29 04:46:14 -05:00
lilianhuang 135718eafc Improve the performance of sbgemm_tcopy on neoversen2 2022-11-28 04:17:54 -05:00
Chris Sidebottom 4f7b77e08a Remove unnecessary instructions from Advanced SIMD dot
The existing kernel was issuing extra instructions to organise the arguments into the same registers they would usually be in and similarly to put the result into the appropriate register.

This has an impact on smaller sized dots and seemed like a quick fix
2022-11-25 16:19:03 +00:00
Martin Kroeker e9a911fb9f
Merge pull request #3841 from martin-frbg/lapack755+764
Fix SLATRS3 and CLATRS3 tests in TESTING/LIN (Reference-LAPACK PRs 755+764)
2022-11-23 22:38:06 +01:00
Martin Kroeker bf0e8d67b5
Merge pull request #3840 from martin-frbg/lapack760
Fix typo in EIG tests and spurious return in lapacke_?tz_trans utility (Reference-LAPACK PR760)
2022-11-23 19:16:25 +01:00
Martin Kroeker a5470521ee
Fix array indexation in copy, and fix test (Reference-LAPACK PR764) 2022-11-23 15:31:25 +01:00
Martin Kroeker b0393ea4e1
Fix test (Reference-LAPACK PR764) 2022-11-23 15:27:46 +01:00
Martin Kroeker 0d26f1a4c7
Fix wrong indexation in test (Reference-LAPACK PR755) 2022-11-23 15:22:27 +01:00
Martin Kroeker 19fd2d7f00
Use LSAME for character comparison (Reference-LAPACK PR755) 2022-11-23 15:19:07 +01:00
Martin Kroeker 663bf68dbd
Merge pull request #3839 from martin-frbg/lapack758
Fix array dimesion in complex SYL01 test (Reference-LAPACK PR758)
2022-11-23 14:57:56 +01:00
Martin Kroeker c2ba4e6249
Remove unnecessary return in void function call (Reference-LAPACK PR760) 2022-11-23 10:43:34 +01:00
Martin Kroeker 74962c7f53
Remove unnecessary return in void function call (Reference-LAPACK PR760) 2022-11-23 10:42:29 +01:00
Martin Kroeker d952cbf7bc
Remove unnecessary return in void function call (Reference-LAPACK PR760) 2022-11-23 10:41:50 +01:00
Martin Kroeker 7694ff495f
Remove unnecessary return in void function call (Reference-LAPACK PR760) 2022-11-23 10:40:59 +01:00
Martin Kroeker 825ae316e2
Fix typo in EXTERNAL (Reference-LAPACK PR760) 2022-11-23 10:36:10 +01:00
Martin Kroeker 730ed549e6
Fix typo in EXTERNAL (Reference-LAPACK PR760) 2022-11-23 10:35:23 +01:00
Martin Kroeker bc3393f703
Fix array dimension (Reference-LAPACK 758) 2022-11-23 10:31:18 +01:00
Martin Kroeker 0b2f8dabbf
Fix array dimension (Reference-LAPACK 758) 2022-11-23 10:30:35 +01:00
Martin Kroeker b4c9228441
Merge pull request #3838 from martin-frbg/lapa311
Update the version number of the included LAPACK to 3.11.0
2022-11-22 17:39:51 +01:00
Martin Kroeker e6e2a63650
Update LAPACK version number to 3.11.0 2022-11-22 14:02:21 +01:00
Martin Kroeker 8408357bab
Update LAPACK version number to 3.11.0 2022-11-22 14:01:48 +01:00
Martin Kroeker ba8fb8b4b2
Merge pull request #3837 from martin-frbg/lapack655+697
Improve convergence of LAPACK ?LAED4 and fix a bug in DORCSD2BY1 (Reference-LAPACK PRs 655+697)
2022-11-22 13:51:57 +01:00