I incorrectly added `+sve` to the Neoverse(TM) N1 CPUs GCC parameters,
which doesn't support SVE - this results in failed builds when using a
compiler that doesn't support `-mtune=neoverse-n1` which appears to hide
the mistake.
Unlike [dcz]scal, sscal still used the original GotoBLAS SSE code from scal_sse.S.
This code follows dscal as closely as possible, except for the inc_x > 1 code
for which a plain C loop is used much like the one in cscal.c, instead of an
adaptation of the SSE2 asm code of dscal.c (I tried but the performance wasn't
better than the plain C loop).
Using 256-bit registers in dscal makes this microkernel consistent with
cscal and zscal, and generally doubles performance if the vector fits in
L1 cache.
This adds an SVE implementation to sdot/ddot when available, falling back to the previous Advanced SIMD kernel where there's no SVE implementation for the kernel.
All the targets were essentially treating `dot_thunderx2t99.c` as the Advanced SIMD implementation so I've renamed it to better fit with the feature detection.
Removing y avoids cache effects (if y is the size of the L1 cache, the
main array x is removed from it).
Moving init and timing out of the loop makes the scal benchmark behave like
the gemm benchmark, and allows higher accuracy for smaller test cases since
the loop overhead is much smaller than the timing overhead.
Example:
OPENBLAS_LOOPS=10000 ./dscal.goto 1024 8192 1024
on AMD Zen2 (7532) with 32k (4k doubles) L1 cache per core.
Before
From : 1024 To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000
SIZE Flops
1024 : 5627.08 MFlops 0.000000 sec
2048 : 5907.34 MFlops 0.000000 sec
3072 : 5553.30 MFlops 0.000001 sec
4096 : 5446.38 MFlops 0.000001 sec
5120 : 5504.61 MFlops 0.000001 sec
6144 : 5501.80 MFlops 0.000001 sec
7168 : 5547.43 MFlops 0.000001 sec
8192 : 5548.46 MFlops 0.000001 sec
After
From : 1024 To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000
SIZE Flops
1024 : 6310.28 MFlops 0.000000 sec
2048 : 6396.29 MFlops 0.000000 sec
3072 : 6439.14 MFlops 0.000000 sec
4096 : 6327.14 MFlops 0.000001 sec
5120 : 5628.24 MFlops 0.000001 sec
6144 : 5616.41 MFlops 0.000001 sec
7168 : 5553.13 MFlops 0.000001 sec
8192 : 5600.88 MFlops 0.000001 sec
We can see the L1->L2 switchover point is now where it should be, and the
number of flops for L1 is more accurate.
The existing kernel was issuing extra instructions to organise the arguments into the same registers they would usually be in and similarly to put the result into the appropriate register.
This has an impact on smaller sized dots and seemed like a quick fix