Martin Kroeker
0a24f631e9
Merge pull request #3844 from Mousius/switch-ratio-16
...
Set SWITCH_RATIO for Arm(R) Neoverse(TM) V1 CPUs
2022-12-02 12:48:43 +01:00
Martin Kroeker
65984fbe68
Merge pull request #3847 from bartoldeman/scal-benchmark
...
scal benchmark: eliminate y, move init/timing out of loop
2022-12-02 11:51:50 +01:00
Martin Kroeker
f6f0d13b9f
Merge pull request #3842 from Mousius/sve-dot
...
Add SVE implementation for sdot/ddot
2022-12-02 08:30:51 +01:00
Bart Oldeman
5c3169ecd8
dscal: use ymm registers in Haswell microkernel
...
Using 256-bit registers in dscal makes this microkernel consistent with
cscal and zscal, and generally doubles performance if the vector fits in
L1 cache.
2022-12-01 07:48:05 -05:00
Chris Sidebottom
eea006a688
Wrap SVE header with __has_include check
2022-12-01 12:07:55 +00:00
Chris Sidebottom
fd4f52c797
Add SVE implementation for sdot/ddot
...
This adds an SVE implementation to sdot/ddot when available, falling back to the previous Advanced SIMD kernel where there's no SVE implementation for the kernel.
All the targets were essentially treating `dot_thunderx2t99.c` as the Advanced SIMD implementation so I've renamed it to better fit with the feature detection.
2022-12-01 12:07:50 +00:00
Martin Kroeker
b6a4ef98b9
Merge pull request #3845 from Mousius/asimd-dot-opt
...
Remove unnecessary instructions from Advanced SIMD dot
2022-11-30 21:07:30 +01:00
Chris Sidebottom
2fb096315e
Set SWITCH_RATIO for Arm(R) Neoverse(TM) V1 CPUs
...
From testing this yields better results than the default of `2`.
2022-11-30 09:35:38 +00:00
Bart Oldeman
bae45d94d1
scal benchmark: eliminate y, move init/timing out of loop
...
Removing y avoids cache effects (if y is the size of the L1 cache, the
main array x is removed from it).
Moving init and timing out of the loop makes the scal benchmark behave like
the gemm benchmark, and allows higher accuracy for smaller test cases since
the loop overhead is much smaller than the timing overhead.
Example:
OPENBLAS_LOOPS=10000 ./dscal.goto 1024 8192 1024
on AMD Zen2 (7532) with 32k (4k doubles) L1 cache per core.
Before
From : 1024 To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000
SIZE Flops
1024 : 5627.08 MFlops 0.000000 sec
2048 : 5907.34 MFlops 0.000000 sec
3072 : 5553.30 MFlops 0.000001 sec
4096 : 5446.38 MFlops 0.000001 sec
5120 : 5504.61 MFlops 0.000001 sec
6144 : 5501.80 MFlops 0.000001 sec
7168 : 5547.43 MFlops 0.000001 sec
8192 : 5548.46 MFlops 0.000001 sec
After
From : 1024 To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000
SIZE Flops
1024 : 6310.28 MFlops 0.000000 sec
2048 : 6396.29 MFlops 0.000000 sec
3072 : 6439.14 MFlops 0.000000 sec
4096 : 6327.14 MFlops 0.000001 sec
5120 : 5628.24 MFlops 0.000001 sec
6144 : 5616.41 MFlops 0.000001 sec
7168 : 5553.13 MFlops 0.000001 sec
8192 : 5600.88 MFlops 0.000001 sec
We can see the L1->L2 switchover point is now where it should be, and the
number of flops for L1 is more accurate.
2022-11-29 08:02:45 -05:00
lilianhuang
fdac8a97c1
Add sbgemm_ncopy_8 and sbgemm_tcopy_4
2022-11-29 04:46:14 -05:00
lilianhuang
135718eafc
Improve the performance of sbgemm_tcopy on neoversen2
2022-11-28 04:17:54 -05:00
Chris Sidebottom
4f7b77e08a
Remove unnecessary instructions from Advanced SIMD dot
...
The existing kernel was issuing extra instructions to organise the arguments into the same registers they would usually be in and similarly to put the result into the appropriate register.
This has an impact on smaller sized dots and seemed like a quick fix
2022-11-25 16:19:03 +00:00
Martin Kroeker
e9a911fb9f
Merge pull request #3841 from martin-frbg/lapack755+764
...
Fix SLATRS3 and CLATRS3 tests in TESTING/LIN (Reference-LAPACK PRs 755+764)
2022-11-23 22:38:06 +01:00
Martin Kroeker
bf0e8d67b5
Merge pull request #3840 from martin-frbg/lapack760
...
Fix typo in EIG tests and spurious return in lapacke_?tz_trans utility (Reference-LAPACK PR760)
2022-11-23 19:16:25 +01:00
Martin Kroeker
a5470521ee
Fix array indexation in copy, and fix test (Reference-LAPACK PR764)
2022-11-23 15:31:25 +01:00
Martin Kroeker
b0393ea4e1
Fix test (Reference-LAPACK PR764)
2022-11-23 15:27:46 +01:00
Martin Kroeker
0d26f1a4c7
Fix wrong indexation in test (Reference-LAPACK PR755)
2022-11-23 15:22:27 +01:00
Martin Kroeker
19fd2d7f00
Use LSAME for character comparison (Reference-LAPACK PR755)
2022-11-23 15:19:07 +01:00
Martin Kroeker
663bf68dbd
Merge pull request #3839 from martin-frbg/lapack758
...
Fix array dimesion in complex SYL01 test (Reference-LAPACK PR758)
2022-11-23 14:57:56 +01:00
Martin Kroeker
c2ba4e6249
Remove unnecessary return in void function call (Reference-LAPACK PR760)
2022-11-23 10:43:34 +01:00
Martin Kroeker
74962c7f53
Remove unnecessary return in void function call (Reference-LAPACK PR760)
2022-11-23 10:42:29 +01:00
Martin Kroeker
d952cbf7bc
Remove unnecessary return in void function call (Reference-LAPACK PR760)
2022-11-23 10:41:50 +01:00
Martin Kroeker
7694ff495f
Remove unnecessary return in void function call (Reference-LAPACK PR760)
2022-11-23 10:40:59 +01:00
Martin Kroeker
825ae316e2
Fix typo in EXTERNAL (Reference-LAPACK PR760)
2022-11-23 10:36:10 +01:00
Martin Kroeker
730ed549e6
Fix typo in EXTERNAL (Reference-LAPACK PR760)
2022-11-23 10:35:23 +01:00
Martin Kroeker
bc3393f703
Fix array dimension (Reference-LAPACK 758)
2022-11-23 10:31:18 +01:00
Martin Kroeker
0b2f8dabbf
Fix array dimension (Reference-LAPACK 758)
2022-11-23 10:30:35 +01:00
Martin Kroeker
b4c9228441
Merge pull request #3838 from martin-frbg/lapa311
...
Update the version number of the included LAPACK to 3.11.0
2022-11-22 17:39:51 +01:00
Martin Kroeker
e6e2a63650
Update LAPACK version number to 3.11.0
2022-11-22 14:02:21 +01:00
Martin Kroeker
8408357bab
Update LAPACK version number to 3.11.0
2022-11-22 14:01:48 +01:00
Martin Kroeker
ba8fb8b4b2
Merge pull request #3837 from martin-frbg/lapack655+697
...
Improve convergence of LAPACK ?LAED4 and fix a bug in DORCSD2BY1 (Reference-LAPACK PRs 655+697)
2022-11-22 13:51:57 +01:00
Martin Kroeker
cabf9453e2
Merge pull request #3836 from martin-frbg/lapack665+735
...
Fix documentation of LAPACK functions ?TPRFB and IEEECK (Reference-LAPACK PRs 665+735)
2022-11-22 09:25:24 +01:00
Martin Kroeker
d321357558
Fix bug in DORCSD2BY1 (from Reference-LAPACK PR697)
2022-11-21 21:19:44 +01:00
Martin Kroeker
afcd7e88b6
Improve convergence of DLAED4/SLAED4 (Reference-LAPACK PR655)
2022-11-21 21:18:39 +01:00
Martin Kroeker
f8f2bebf11
Fix function documentation for LAPACK ?TPRFB (Reference-LAPACK PR665)
2022-11-21 20:01:47 +01:00
Martin Kroeker
c45edcb537
Fix typo in comment (Reference-LAPACK PR735)
2022-11-21 19:59:33 +01:00
Martin Kroeker
880a3fb20f
Merge pull request #3835 from martin-frbg/lapack217
...
Simplify ?SYSWAPR and fix its documentation (Reference-LAPACK 217)
2022-11-21 19:56:28 +01:00
Martin Kroeker
50aba02910
Simplify ?SYSWAPR and fix its documentation (Reference-LAPACK 217)
2022-11-21 18:00:31 +01:00
Martin Kroeker
0b68dd6a9b
Merge pull request #3834 from martin-frbg/lapack631
...
Use new algorithms for computing Givens rotations (Reference-LAPACK PR631)
2022-11-21 08:30:14 +01:00
Martin Kroeker
9343499256
Merge pull request #3833 from martin-frbg/lapack712+747
...
Set scale early in ?LATBS/?LATRS and fix documentation of ?LASCL2 (Reference-LAPACK PRs 712+747)
2022-11-21 08:29:49 +01:00
Martin Kroeker
7ae4269add
Use new algorithms for computing Givens rotations (Reference-LAPACK PR631)
2022-11-20 22:52:28 +01:00
Martin Kroeker
e00f0fb26a
Fix function documentation (Reference-LAPACK PR747)
2022-11-20 22:46:58 +01:00
Martin Kroeker
31d2145988
Set scale early for robust triangular solvers (Reference-LAPACK PR712)
2022-11-20 22:44:36 +01:00
Martin Kroeker
1d5a3aff0d
Merge pull request #3832 from martin-frbg/lapack681+698
...
Improve ?LAQR5 and use normwise criterion in ?LAQZ0 (Reference-LAPACK PRs 681+698)
2022-11-20 22:40:52 +01:00
Martin Kroeker
c6816bb576
Use normwise criterion in multishift QZ (Reference-LAPACK PR698)
2022-11-20 19:39:12 +01:00
Martin Kroeker
6f09e4c121
Improve FMA usage in ?LAQR5 (Reference-LAPACK PR681)
2022-11-20 19:37:28 +01:00
Martin Kroeker
f63c93274c
Merge pull request #3831 from martin-frbg/lapack647+697+702
...
Fix code and documentation for ?SORBDB?/?CUNBDB? (Reference-LAPACK PRs 647+697+702)
2022-11-20 19:34:41 +01:00
Martin Kroeker
aaea0804bc
Fix function documentation (Reference-LAPACK PR697)
2022-11-20 16:38:57 +01:00
Martin Kroeker
b946820502
Fix uninitialized variable (Reference-LAPACK PR647)
2022-11-20 16:36:19 +01:00
Martin Kroeker
9e29312c83
Fix type precision and function documentation (Reference-LAPACK PRs 647+702)
2022-11-20 16:34:45 +01:00