Commit Graph

789 Commits

Author SHA1 Message Date
Martin Kroeker 68f2501958
temporarily(?) disable the alpha=0 branch to handle Inf/NaN in x 2024-06-22 21:08:57 +02:00
Martin Kroeker 0a744a939a
temporarily(?) disable the alpha=0 branch to handle NaN/Inf in x 2024-06-22 21:07:43 +02:00
Martin Kroeker a2ee4b1966
Merge branch 'OpenMathLib:develop' into issue4728 2024-06-21 09:35:56 +02:00
Martin Kroeker dd7efcf9ef
Avoid exceeding the configured thread count in x86_64 TOBF16 (#4748)
* avoid setting nthreads higher than available
2024-06-14 14:21:13 +02:00
Martin Kroeker 1abafcd9b2
handle corner cases involving NAN and/or INF 2024-06-06 23:59:43 +02:00
Martin Kroeker 020b3e1682
fix handling of INF arguments 2024-06-01 00:51:18 +02:00
Martin Kroeker ce130f11d2
Update zscal.c 2024-05-31 15:09:03 +02:00
Martin Kroeker ab13cfef93
more fixes for infinite x 2024-05-31 14:34:49 +02:00
Martin Kroeker ad2b5c67c8
fix another corner case involving infinity 2024-05-31 01:06:58 +02:00
Bart Oldeman 62f7b244ff Replace use of FLT_MAX in x86_64 zscal.c by isinf()
Commit def4996 fixed issues with inf and nan values in zscal,
but used FLT_MAX, where DBL_MAX or isinf() is more appropriate,
as FLT_MAX is for single precision only.
Using FLT_MAX caused test case failures in the LAPACK tests.

isinf() is consistent with the later fix 969601a1
2024-05-24 17:20:27 +00:00
Zoltán Böszörményi ca64861ce8 Add forgotten conditional uses of PREFETCH
This fixes a (cross-)compilation/linker error for PRESCOTT
on Yocto.

Signed-off-by: Zoltán Böszörményi <zoltan.boszormenyi@xenial.com>
2024-04-19 10:52:28 +02:00
Martin Kroeker 8f8ef3492a
Add CSUM and ZSUM kernels (trivially derived from their existing ASUM counterparts) 2024-02-24 23:57:50 +01:00
Martin Kroeker be5e18c6f9
Add kernel definitions for CSUM and ZSUM 2024-02-24 23:55:43 +01:00
gxw 969601a1dc X86_64: Fixed bug in zscal
Fixed handling of NAN and INF arguments when
inc is greater than 1.
2024-01-31 11:23:59 +08:00
Martin Kroeker 5f5b7c4f45
Merge pull request #4423 from martin-frbg/issue4422
Check compiler support for AVX512BF16 and base COL/SPR kernel choice on that
2024-01-12 16:30:50 +01:00
Martin Kroeker 995a990e24
Make AVX512 BFLOAT16 kernels conditional on compiler capability 2024-01-12 00:12:46 +01:00
Martin Kroeker cf8b03ae8b
Use NAN rather than SNAN for portability 2024-01-07 23:09:57 +01:00
Martin Kroeker def4996170
Fix handling of NAN and INF arguments 2024-01-07 15:29:42 +01:00
Martin Kroeker f06b535566
Use C kernel for dgemv_t due to limitations of the old assembly one 2023-12-15 09:58:44 +01:00
Bart Oldeman c34e2cf380 Use _mm_set1_epi{32,64x} to init mask in x86-64 [cz]asum
for skylake kernels. This is the same method as used in [sd]asum.
_mm_set1_epi64x was commented out for zasum, but has the advantage
of avoiding possible undefined behaviour (using an uninitialized
variable), optimized out by NVHPC and icx. The new code works
fine with those compilers.

For GCC 12.3 the generated code is identical; no matter what method
you use, the compiler optimizes the code into a compile-time
constant, there is no performance benefit using mm_cmpeq_epi8
since the corresponding instruction (VPCMPEQB) isn't actually
generated!
2023-11-19 21:28:35 +00:00
Martin Kroeker 22aa401656
Temporarily disable the AVX512 CASUM/ZASUM microkernels for any version of NVIDIA HPC (#4327)
* Temporarily disable the C/ZASUM microkernels for any version of NVHPC
2023-11-19 00:04:31 +01:00
Bart Oldeman f8ad5344c2 Fix casum fallback kernel.
This kernel is only used on Skylake+ if the kernel with AVX512
intrinsics can't be used, but used the variable x1 incorrectly
in the tail end of the loop, as it is still at the initial
value instead of where x points to.

This caused 55 "other error"s in the LAPACK tests
(https://github.com/OpenMathLib/OpenBLAS/issues/4282)

This change makes casum.c as similar as possible as zasum.c,
because zasum.c does this correctly.
2023-11-17 23:53:56 +00:00
Martin Kroeker 9019bc4945
Use SkylakeX ?ASUM microkernel for Cooperlake/Sapphirerapids as well 2023-11-04 22:10:06 +01:00
Martin Kroeker 675cd551da
fix improper function prototypes (empty parentheses) 2023-09-30 12:56:38 +02:00
Martin Kroeker 2c3034ff7f
Disable the C/ZASUM AVX512 microkernels when compiling with LLVM17 as well 2023-08-25 17:22:51 +02:00
Martin Kroeker 34da1a067d
Allow negative INCX (API change from version 3.10 of the reference implementation) 2023-08-10 17:01:50 +02:00
Martin Kroeker 4664b57e6e
use shortcut only when both incx and incy are zero 2023-08-04 12:25:34 +02:00
Martin Kroeker 6a428b5629
Update casum_microk_skylakex-2.c 2023-07-29 12:24:30 +02:00
Martin Kroeker ebb447e32e
Update zasum_microk_skylakex-2.c 2023-07-29 12:23:57 +02:00
Martin Kroeker 9f6847583a
nvc currently miscompiles this, hopefully fixed in release 23.09 2023-07-29 11:50:16 +02:00
Martin Kroeker fe54ee3d15
nvc currently miscompiles this, hopefully fixed in release 23.09 2023-07-29 11:48:38 +02:00
Martin Kroeker 2a62d2df96
Enable use of AVX512 microkernels with NVIDIA HPC from version 22.3 2023-07-26 19:39:11 +02:00
Honglin Zhu a76afdc047 Compatible with older version of GNU make 2023-05-20 13:58:23 +08:00
Honglin Zhu 0b83088887 spr dynamic arch support 2023-05-19 10:48:18 +08:00
Honglin Zhu f249ccb741 Fix spr sbgemm error 2023-05-19 10:48:18 +08:00
Martin Kroeker 84bcf6639f
Disable gcc's tree-vectorizer pass on all operating systems 2023-04-20 23:24:52 +02:00
Martin Kroeker c9174ae8d7
Disable gcc's tree-vectorizer pass on all operating systems 2023-04-19 23:45:44 +02:00
Martin Kroeker c2fe9cb91f
Disable gcc's tree-vectorizer pass on all operating systems 2023-04-19 23:45:14 +02:00
Martin Kroeker 66b39b835c
Disable gcc's tree-vectorizer pass on all operating systems 2023-04-19 23:44:45 +02:00
Martin Kroeker bb6d6735bf
Disable gcc's tree-vectorizer pass on all operating systems 2023-04-19 23:44:15 +02:00
Martin Kroeker d18efaed20
Disable gcc's tree-vectorizer pass on all operating systems 2023-04-19 23:43:43 +02:00
Martin Kroeker 99f6d31ed5
Disable gcc's tree-vectorizer pass on all operating systems 2023-04-19 23:42:55 +02:00
Martin Kroeker 7de9335c56
Disable gcc's tree-vectorizer pass on all operating systems 2023-04-19 23:42:09 +02:00
Bart Oldeman 60e49b851c Fix typo in clobber list, should be xmm14 instead of ymm14. 2022-12-06 16:30:46 -05:00
Bart Oldeman 4afe1439a1 Fix skylake fallback kernel name for old compilers. 2022-12-06 16:09:54 -05:00
Bart Oldeman 5ceca1a4d8 Add sscal.c + microkernels for Haswell, Zen, Skylake and newer.
Unlike [dcz]scal, sscal still used the original GotoBLAS SSE code from scal_sse.S.
This code follows dscal as closely as possible, except for the inc_x > 1 code
for which a plain C loop is used much like the one in cscal.c, instead of an
adaptation of the SSE2 asm code of dscal.c (I tried but the performance wasn't
better than the plain C loop).
2022-12-06 14:05:49 -05:00
Bart Oldeman 5c3169ecd8 dscal: use ymm registers in Haswell microkernel
Using 256-bit registers in dscal makes this microkernel consistent with
cscal and zscal, and generally doubles performance if the vector fits in
L1 cache.
2022-12-01 07:48:05 -05:00
Martin Kroeker f73cfb7e2c
change line endings from CRLF to LF 2022-11-17 09:39:56 +01:00
Martin Kroeker 1688c7da43
change line endings from CRLF to LF 2022-11-16 22:24:01 +01:00
Bart Oldeman 6c1043eb41 Add [cz]scal microkernels for SKYLAKEX
These are as similar to dscal_microk_skylakex-2.c as possible
for consistency.

Note that before this change SKYLAKEX+ uses generic C functions for
cscal/zscal via commit 2271c350 from #2610 (which is masked by
commit 086d87a30). However now #3799 disables FMAs (in turn enabled
by `-march=skylake-avx512`) in the plain C code which fixes excessive
LAPACK test failures more nicely.
2022-11-09 08:57:03 -05:00