This re-spins #3869 with some additional copy unrolling which helps maintain SYRK performance.
After #3868, the SVE kernels represent a pretty good boost.
This re-uses ARMV8SVE as a base and I'm going to incrementally move everything to use ARMV8SVE in additional patches (as well as fix up anything that's not already in ARMV8SVE).
Removing y avoids cache effects (if y is the size of the L1 cache, the
main array x is removed from it).
Moving init and timing out of the loop makes the scal benchmark behave like
the gemm benchmark, and allows higher accuracy for smaller test cases since
the loop overhead is much smaller than the timing overhead.
Example:
OPENBLAS_LOOPS=10000 ./dscal.goto 1024 8192 1024
on AMD Zen2 (7532) with 32k (4k doubles) L1 cache per core.
Before
From : 1024 To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000
SIZE Flops
1024 : 5627.08 MFlops 0.000000 sec
2048 : 5907.34 MFlops 0.000000 sec
3072 : 5553.30 MFlops 0.000001 sec
4096 : 5446.38 MFlops 0.000001 sec
5120 : 5504.61 MFlops 0.000001 sec
6144 : 5501.80 MFlops 0.000001 sec
7168 : 5547.43 MFlops 0.000001 sec
8192 : 5548.46 MFlops 0.000001 sec
After
From : 1024 To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000
SIZE Flops
1024 : 6310.28 MFlops 0.000000 sec
2048 : 6396.29 MFlops 0.000000 sec
3072 : 6439.14 MFlops 0.000000 sec
4096 : 6327.14 MFlops 0.000001 sec
5120 : 5628.24 MFlops 0.000001 sec
6144 : 5616.41 MFlops 0.000001 sec
7168 : 5553.13 MFlops 0.000001 sec
8192 : 5600.88 MFlops 0.000001 sec
We can see the L1->L2 switchover point is now where it should be, and the
number of flops for L1 is more accurate.
Benchmarks should allocate with cacheline (often 64 bytes) alignment
to avoid unreliable timings. This technique, storing the offset in the
byte before the pointer, doesn't require C11's aligned_alloc for
compatibility with older compilers.
For example, Glibc's x86_64 malloc returns 16-byte aligned buffers, which is
not sufficient for AVX/AVX2 (32-byte preferred) or AVX512 (64-byte).
[description]: when the matrix size goes higher than 5800 during the cpotrf test, error info, such as "Potrf info = 5679", will be returned on ARM64 and x86 machines. Uplo = L & F.
[solution]: changed the func for building the matrix so that the complex Hermitian matrix can stay positive definite during the computation.
[dts]: