Commit Graph

162 Commits

Author SHA1 Message Date
Chris Sidebottom ec334e69dc Use SVE kernel for SGEMM/DGEMM on Arm(R) Neoverse(TM) V1
This re-spins #3869 with some additional copy unrolling which helps maintain SYRK performance.

After #3868, the SVE kernels represent a pretty good boost.

This re-uses ARMV8SVE as a base and I'm going to incrementally move everything to use ARMV8SVE in additional patches (as well as fix up anything that's not already in ARMV8SVE).
2023-04-17 17:38:42 +01:00
Bart Oldeman bae45d94d1 scal benchmark: eliminate y, move init/timing out of loop
Removing y avoids cache effects (if y is the size of the L1 cache, the
main array x is removed from it).
Moving init and timing out of the loop makes the scal benchmark behave like
the gemm benchmark, and allows higher accuracy for smaller test cases since
the loop overhead is much smaller than the timing overhead.

Example:
OPENBLAS_LOOPS=10000 ./dscal.goto 1024 8192 1024
on AMD Zen2 (7532) with 32k (4k doubles) L1 cache per core.

Before
From : 1024  To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000
   SIZE       Flops
   1024 :     5627.08 MFlops   0.000000 sec
   2048 :     5907.34 MFlops   0.000000 sec
   3072 :     5553.30 MFlops   0.000001 sec
   4096 :     5446.38 MFlops   0.000001 sec
   5120 :     5504.61 MFlops   0.000001 sec
   6144 :     5501.80 MFlops   0.000001 sec
   7168 :     5547.43 MFlops   0.000001 sec
   8192 :     5548.46 MFlops   0.000001 sec

After
From : 1024  To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000
   SIZE       Flops
   1024 :     6310.28 MFlops   0.000000 sec
   2048 :     6396.29 MFlops   0.000000 sec
   3072 :     6439.14 MFlops   0.000000 sec
   4096 :     6327.14 MFlops   0.000001 sec
   5120 :     5628.24 MFlops   0.000001 sec
   6144 :     5616.41 MFlops   0.000001 sec
   7168 :     5553.13 MFlops   0.000001 sec
   8192 :     5600.88 MFlops   0.000001 sec

We can see the L1->L2 switchover point is now where it should be, and the
number of flops for L1 is more accurate.
2022-11-29 08:02:45 -05:00
Martin Kroeker f92dd6e303
change line endings from CRLF to LF 2022-11-17 10:18:36 +01:00
Bart Oldeman 9e6b060bf3 Fix comment.
It stores the pointer, not an offset (that would be an alternative approach).
2022-10-20 20:11:09 -04:00
Bart Oldeman 9959a60873 Benchmarks: align malloc'ed buffers.
Benchmarks should allocate with cacheline (often 64 bytes) alignment
to avoid unreliable timings. This technique, storing the offset in the
byte before the pointer, doesn't require C11's aligned_alloc for
compatibility with older compilers.

For example, Glibc's x86_64 malloc returns 16-byte aligned buffers, which is
not sufficient for AVX/AVX2 (32-byte preferred) or AVX512 (64-byte).
2022-10-20 13:28:20 -04:00
Marius Hillenbrand f119e26354 Fix flipped indices in benchmark for gemv
Fixes #3439
2021-11-03 12:45:09 +01:00
Martin Kroeker 14e33e0f7e
Handle OPENBLAS_LOOPS in SYR2 benchmark 2021-07-10 21:27:53 +02:00
Martin Kroeker 4ed99c2ce3
Merge pull request #3292 from martin-frbg/syrk_limit
Add lower limit for multithreading in xSYRK
2021-07-07 20:46:28 +02:00
Martin Kroeker a4543e4918
Handle OPENBLAS_LOOP 2021-07-04 16:59:43 +02:00
Martin Kroeker dcfc5cf714
Handle OPENBLAS_LOOPS for more stable results 2021-07-01 17:39:37 +02:00
Martin Kroeker 06e3b07ecb
Handle OPENBLAS_LOOPS and OPENBLAS_TEST options 2021-07-01 17:38:45 +02:00
Martin Kroeker 1f8bda71b9
Add OPENBLAS_LOOPS support to potrf/potrs/potri benchmark 2021-06-26 23:46:00 +02:00
Martin Kroeker d57c681a6d
Fix compilation on older OSX versions 2021-03-26 22:29:29 +01:00
Martin Kroeker 38dcf3454b
Support timing Apple M1 2021-03-02 17:50:55 +01:00
Qiyu8 f917c26e83 Refractoring remaining benchmark cases. 2020-10-26 10:25:05 +08:00
Qiyu8 dd6ebdfdab Refactor the performance measurement system 2020-10-23 10:32:03 +08:00
Martin Kroeker 7ae9e8960e
Change "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-12 00:08:29 +02:00
Martin Kroeker 5464eb13ea
Change ifdef linux to __linux for C11 compatibility 2020-09-30 22:59:41 +02:00
Martin Kroeker 6f8fad87c5
Use POSIX2001 clock.gettime for higher resolution 2020-09-05 19:44:01 +02:00
Martin Kroeker ced49466f0
Use the fortran compiler to link LAPACK-related benchmarks
to fix linking problems with (at least) the AMD version of flang that creates dependencies on more than just the fortran runtime.
2020-05-29 13:35:51 +02:00
Martin Kroeker 6e270f91ec
add support for RETURN_BY_STACK semantics, e.g. clang 2020-05-29 13:29:10 +02:00
Rajalakshmi Srinivasaraghavan ce90e2bd3f Include shgemm in benchtest
This patch is to enable benchtest for half precision gemm
when BUILD_HALF is set during make.
2020-05-11 09:57:46 -05:00
l00536773 6b7ef6543a [OpenBLAS]: benchmark error of potrf
[description]: when the matrix size goes higher than 5800 during the cpotrf test, error info, such as "Potrf info = 5679", will be returned on ARM64 and x86 machines. Uplo = L & F.
[solution]: changed the func for building the matrix so that the complex Hermitian matrix can stay positive definite during the computation.
[dts]:
2020-04-16 10:55:10 +08:00
Martin Kroeker 717c604aeb
Merge pull request #2515 from zelong-1024/develop
[OpenBLAS]: benchmark for her/her2 LEVEL2 functions
2020-03-16 21:59:55 +01:00
Martin Kroeker ce33da4cab
Merge pull request #2513 from aaawuanjun/develop
[OpenBlas]: Add benchmark tpsv file and modify benchmark/Makefile
2020-03-16 21:58:55 +01:00
l00536773 d45c53ecf1 [OpenBLAS]: benchmark for her/her2 LEVEL2 functions
[description]: benchmark for her/her2
[solution]: added benchmark for her/her2, modified makefile in benchmark
[dts]:
2020-03-16 11:19:05 +08:00
Martin Kroeker c2840997db
Merge pull request #2508 from liujingjue/develop
[OpenBLAS]:fix the iamax benchmark error
2020-03-14 14:21:30 +01:00
Martin Kroeker c0649aa694
Merge pull request #2506 from xiaofengF/develop
Add benchmark for SPMV and fix segmentation fault when data size >= 50000
2020-03-14 13:08:36 +01:00
wuanjun 00447568 2428dc9fd3 [OpenBlas]: Add benchmark tpsv file and modify benchmark/Makefile
[Description]: Solve lack of tpsv benchmark.
2020-03-14 09:11:08 +08:00
l00546269 a0a3bf7c81 [OpenBLAS]:fix the iamax benchmark error
[Description]:the result for i?amax is not MFlops, it is MBytes
2020-03-13 10:58:39 +08:00
jayfely@qq.com ae3f2c2e49 Remove cspmv and zspmv to remove the error occured in travis CI 2020-03-11 17:02:34 +08:00
jayfely@qq.com 649733ff15 Only keep spmv.goto and spmv.atlas 2020-03-11 15:48:58 +08:00
wuanjun 00447568 3e8f1c6cc5 [OpenBlas]:Add benchmark tpmv.c and modify Makefile
[Description]:Solve the problem of missing tpmv.c benchmark file
2020-03-11 12:31:48 +08:00
jayfely@qq.com 2f4c5bb3a9 Update spmv.c: solve segmentation fault when m and n are larger than 50000 2020-03-11 10:30:09 +08:00
Martin Kroeker 047dfb216d
Merge pull request #2501 from jijiwawa/Fix_mistakes
Fix  pr #2487 error
2020-03-10 16:44:40 +01:00
s00527847 cd8871f1a1 Use the correct unit of measure 2020-03-10 19:26:06 -04:00
jayfely@qq.com 08e1d8cbae Modify Makefile in Benchmark 2020-03-10 14:32:18 +08:00
jayfely@qq.com ff40a4e726 Add benchmark for SPMV 2020-03-10 14:22:18 +08:00
s00548429 c5bdd21352 Add benchmark for ?amax, ?max, ?amin, ?min, i?max, i?amin and i?min. 2020-03-09 14:59:03 +08:00
Martin Kroeker b6a6ccbbea
Merge pull request #2495 from ZuoQ3/develop
add benchmark for axpby test
2020-03-08 08:09:58 +01:00
Martin Kroeker 8b720f7365
Merge pull request #2494 from shengyang-3390/develop
add benchmark for csrot and zdrot
2020-03-07 23:04:21 +01:00
Martin Kroeker 14df234edb
Merge pull request #2489 from jijiwawa/brightness
Remove redundant code
2020-03-07 22:26:00 +01:00
s00527847 bbeda55b7b add trmm.c 2020-03-07 13:09:19 -05:00
s00527847 efcf89aec7 Remove redundant code 2020-03-07 12:03:05 -05:00
zq 0c8162eba6 Add benchmark file axpby.c and modify benchmark/Makefile to test s/d/c/zaxpby 2020-03-07 17:48:55 +08:00
shengyang 09c7a191bd add benchmark for csrot and zdrot
modified:   benchmark/Makefile
	modified:   benchmark/rot.c
2020-03-07 15:17:49 +08:00
Martin Kroeker dca3e0cf20
Merge pull request #2491 from chenxuqiang/hbmv_benchmark
benchmark/hpmv&hbmv: add benchmark/hpmv.c and benchmark/hbmv.c
2020-03-06 15:06:42 +01:00
Martin Kroeker c9f8db979b
Merge pull request #2490 from shengyang-3390/develop
Add benchmark file rotm.c and modify benchmark/Makefile to test s/drotm
2020-03-06 15:05:55 +01:00
Martin Kroeker 97c36ca58c
Merge branch 'develop' into develop 2020-03-06 14:41:40 +01:00
Martin Kroeker 9f5a74f3c7
Merge pull request #2486 from qqqil/develop
add benchmark for trsv
2020-03-06 14:30:09 +01:00