Commit Graph

2036 Commits

Author SHA1 Message Date
Bart Oldeman
c34e2cf380 Use _mm_set1_epi{32,64x} to init mask in x86-64 [cz]asum
for skylake kernels. This is the same method as used in [sd]asum.
_mm_set1_epi64x was commented out for zasum, but has the advantage
of avoiding possible undefined behaviour (using an uninitialized
variable), optimized out by NVHPC and icx. The new code works
fine with those compilers.

For GCC 12.3 the generated code is identical; no matter what method
you use, the compiler optimizes the code into a compile-time
constant, there is no performance benefit using mm_cmpeq_epi8
since the corresponding instruction (VPCMPEQB) isn't actually
generated!
2023-11-19 21:28:35 +00:00
Martin Kroeker
22aa401656 Temporarily disable the AVX512 CASUM/ZASUM microkernels for any version of NVIDIA HPC (#4327)
* Temporarily disable the C/ZASUM microkernels for any version of NVHPC
2023-11-19 00:04:31 +01:00
Bart Oldeman
f8ad5344c2 Fix casum fallback kernel.
This kernel is only used on Skylake+ if the kernel with AVX512
intrinsics can't be used, but used the variable x1 incorrectly
in the tail end of the loop, as it is still at the initial
value instead of where x points to.

This caused 55 "other error"s in the LAPACK tests
(https://github.com/OpenMathLib/OpenBLAS/issues/4282)

This change makes casum.c as similar as possible as zasum.c,
because zasum.c does this correctly.
2023-11-17 23:53:56 +00:00
Martin Kroeker
04bc801999 (Re)apply fixes for supporting only a subset of precision types from PR 3915 2023-11-04 23:48:59 +01:00
Martin Kroeker
9019bc4945 Use SkylakeX ?ASUM microkernel for Cooperlake/Sapphirerapids as well 2023-11-04 22:10:06 +01:00
Martin Kroeker
3bfa4d4dcc Fix outdated SVE kernel definitions for Cortex cpus by aliasing to ARMV8SVE 2023-11-03 14:55:31 +01:00
Rajalakshmi Srinivasaraghavan
980f702f72 POWER: AIX: Make use of power10 optimization
POWER10 optimizations are disabled when using default AIX assembler.
As we have fixed many issues recently, enabling optimization path
for default assembler.
2023-10-19 18:48:19 -05:00
Rajalakshmi Srinivasaraghavan
9f42570e33 POWER: Increase macro size limit for AIX
This patch increases the macro size limit from 4096 to 16384 to
allow compiling larger assembly files in AIX.
Tested with GCC and IBM Open XL C.
2023-10-12 12:37:40 -05:00
Martin Kroeker
9f49aef91b Merge pull request #4255 from RajalakshmiSR/AIX-P10
POWER10: Fix compilation issues with Open XL C
2023-10-12 18:59:17 +02:00
Martin Kroeker
e7d05402e0 Fix up S/D GEMM copy function definitions after #4009 2023-10-12 14:24:53 +02:00
Rajalakshmi Srinivasaraghavan
71d733e5f7 POWER: Avoid m4 conversions for C files
This patch removes intermediate m4 conversions used in sbgemm
compilation as it is not needed for .c files.
Tested on AIX with gcc and IBM Open XL C.
2023-10-11 17:18:42 -05:00
Rajalakshmi Srinivasaraghavan
82fc29a57a POWER10: Fallback to POWER8 functions
As cgemm and zgemm kernels are not optimized for big endian falling
back to POWER8 versions.  Tested on AIX using gcc and Open XL C.
2023-10-11 17:04:42 -05:00
Rajalakshmi Srinivasaraghavan
db0805906b powerpc: Fix build errors with Open XL C
This patch fixes errors when using Open XL C compiler on AIX.
Tested with gcc/xlf and ibm-clang/xlf compiler combinations.
2023-10-04 14:04:03 -05:00
Martin Kroeker
675cd551da fix improper function prototypes (empty parentheses) 2023-09-30 12:56:38 +02:00
gxw
d15e0a055c LoongArch64: Fixed compilation issues when enable DYNAMIC_ARCH 2023-09-27 10:05:27 +08:00
gxw
4670eb1462 LoongArch64: Add dtrsm kernel 2023-09-26 15:45:14 +08:00
gxw
f2cf929374 LoongArch64: Add sgemv kernel 2023-09-04 14:28:37 +08:00
Martin Kroeker
8e6d93359d Merge pull request #4196 from TiborGY/obsolete_inlines
Modernize obsolete inline order
2023-09-03 14:12:42 +02:00
gxw
394a1fd1bf LoongArch64: Compatible with early internal toolchain
__loongarch_grlen and __loongarch_frlen were introduced in gcc version 8.3.0
(Loongnix 8.3.0-6.lnd.vec.31) internally within Loongson to standardize the
general and floating-point register widths. However, previous versions did
not have them, requiring additional checks to be added.
2023-08-31 16:55:29 +08:00
Martin Kroeker
9c4ae4d4fb Merge pull request #4206 from martin-frbg/issue4201-2
Work around miscompilation of zdot_thunderx2t99 by the current NVIDIA HPC compiler
2023-08-26 10:17:27 +02:00
Martin Kroeker
88435104c8 Merge pull request #4204 from martin-frbg/llvm17-2
Work around LLVM17 miscompiling the AVX512 microkernels for CASUM/ZASUM
2023-08-26 00:32:18 +02:00
Martin Kroeker
fc8894dd98 Workaround miscompilation by NVIDIA nvc 2023-08-26 00:30:17 +02:00
Martin Kroeker
7a6203ffa1 restore default Neoverse SVE build instructions for non-NVIDIA compilers 2023-08-25 18:25:51 +02:00
Martin Kroeker
2c3034ff7f Disable the C/ZASUM AVX512 microkernels when compiling with LLVM17 as well 2023-08-25 17:22:51 +02:00
Martin Kroeker
8794544b43 Add support for compiling the Neoverse SVE kernels with the NVIDIA HPC compiler 2023-08-25 16:47:32 +02:00
gxw
553cc1372f LoongArch64: Add sgemm_kernel 2023-08-23 16:08:43 +08:00
Martin Kroeker
12ede72ab7 Merge pull request #4192 from imciner2/im/clangfix
Fix cooperlake and sapphire rapids march flags on clang
2023-08-21 15:46:35 +02:00
Ian McInerney
79c15db348 Fix power10 gcc intrinsic check
__builtin_vsx_assemble_pair was only in GCC 10-11.2 and was replaced by
__builtin_vsx_build_pair thereafter.
2023-08-17 15:05:29 +01:00
TGY
b5ba95a6c0 Modernize obsolete inline order 2023-08-16 00:48:40 +02:00
Ian McInerney
8a8a8479be Fix cooperlake and sapphire rapids march flags on clang
The march=cooperlake and march=sapphirerapids flags were never getting
added when building with Clang targetting those architectures. Instead
it was falling back to the skylake AVX512 implementation.

Clang added support for these two architectures in Clang 9 and Clang 12,
so introduce new checks for those versions to enable the appropriate
march flag, and fallback to skylake otherwise.
2023-08-14 16:12:35 +01:00
Martin Kroeker
34da1a067d Allow negative INCX (API change from version 3.10 of the reference implementation) 2023-08-10 17:01:50 +02:00
Martin Kroeker
07e32c4cb8 Allow negative INCX (API change from version 3.10 of the reference implementation) 2023-08-10 17:00:18 +02:00
Martin Kroeker
c211da0688 Allow negative INCX (API change from version 3.10 of the reference implementation) 2023-08-10 16:58:57 +02:00
Martin Kroeker
a34a0a7abc Allow negative INCX (API change from version 3.10 of the reference implementation) 2023-08-10 16:56:52 +02:00
Martin Kroeker
54d3246fc6 Allow negative INCX (API change from version 3.10 of the reference implementation) 2023-08-10 16:55:17 +02:00
Martin Kroeker
7dd441d5db Allow negative INCX (API change from version 3.10 of the reference implementation) 2023-08-10 16:53:33 +02:00
Martin Kroeker
f692178792 Allow negative INCX (API change from version 3.10 of the reference implementation) 2023-08-10 16:52:09 +02:00
Martin Kroeker
d15ffb7fdf Allow negative INCX (API change from version 3.10 of the reference implementation) 2023-08-10 16:50:44 +02:00
Martin Kroeker
a2d867f4d1 Allow negative iNCX (API change from version 3.10 of the reference implementation) 2023-08-10 16:49:05 +02:00
Martin Kroeker
afdc56a421 Merge pull request #4158 from XiWeiGu/loongarch64_update_dgemm_kernel
LoongArch64: Update dgemm kernel
2023-08-07 12:44:09 +02:00
gxw
e8b571d245 LoongArch64: Add dgemv_t_8_lasx.S and dgemv_n_8_lasx.S V2 2023-08-07 11:20:42 +08:00
gxw
71fcee6eef LoongArch64: Update dgemm kernel 2023-08-07 11:06:52 +08:00
Martin Kroeker
0f521ece25 Merge pull request #4183 from martin-frbg/issue4181
Apply USE_TRMM to MIPS64_GENERIC as to GENERIC in gmake builds
2023-08-06 18:59:50 +02:00
Martin Kroeker
41c31bc1d4 Revert "LoongArch64: Add dgemv_t_8_lasx.S and dgemv_n_8_lasx.S" 2023-08-06 16:00:03 +02:00
Martin Kroeker
61d803547a Apply USE_TRMM to MIPS64_GENERIC as to GENERIC 2023-08-06 15:17:38 +02:00
Martin Kroeker
f8ee309402 Merge pull request #4153 from XiWeiGu/dgemv
LoongArch64: Add dgemv_t_8_lasx.S and dgemv_n_8_lasx.S
2023-08-06 08:49:16 +02:00
gxw
ec1e96aac8 LoongArch64: Add dgemv_t_8_lasx.S and dgemv_n_8_lasx.S 2023-08-05 10:24:17 +08:00
gxw
d46772e037 LoongArch64: Add compiler feature checks 2023-08-05 10:21:43 +08:00
Martin Kroeker
4664b57e6e use shortcut only when both incx and incy are zero 2023-08-04 12:25:34 +02:00
Martin Kroeker
09131f79a6 Merge pull request #4164 from martin-frbg/issue4162
Enable use of AVX512 microkernels with NVIDIA HPC from version 22.3
2023-07-29 15:07:20 +02:00