2343 Commits

Author SHA1 Message Date
guxiwei e771be185e Optimize copy functions with lsx.
Signed-off-by: Hao Chen <chenhao@loongson.cn>
2023-12-29 17:30:57 +08:00
Hao Chen 179ed51d3b Add dgemm_kernel_8x4.S file. 2023-12-29 17:30:57 +08:00
Hao Chen 173a65d4e6 loongarch64: Add and refine iamax optimization functions. 2023-12-29 17:30:57 +08:00
zhoupeng ea70e165c7 loongarch64: Refine rot optimization. 2023-12-29 17:30:57 +08:00
zhoupeng 116aee7527 loongarch64: Refine imin optimization. 2023-12-29 17:30:57 +08:00
zhoupeng 8be2654193 loongarch64: Refine imax optimization. 2023-12-29 17:30:57 +08:00
zhoupeng 154baad454 loongarch64: Refine iamin optimization. 2023-12-29 17:30:57 +08:00
Shiyou Yin 36c12c4971 loongarch64: Refine copy,swap,nrm2,sum optimization. 2023-12-29 17:30:57 +08:00
Shiyou Yin c6996a80e9 loongarch64: Refine amax,amin,max,min optimization. 2023-12-29 17:30:57 +08:00
Chris Sidebottom ecae1389df Reduce duplication in kernel definitions
These files are exactly the same, so I believe we can reduce these files
down. Other files require a slightly more complex unpicking.
2023-12-23 12:39:53 +00:00
Chris Sidebottom 60e66725e4 Use numeric labels to allow repeated inlining 2023-12-19 13:11:06 +00:00
Chris Sidebottom 7a4fef4f60 Tweak SVE dot kernel
This changes the SVE dot kernel to only predicate when necessary as well
as streamlining the assembly a bit. The benchmarks seem to indicate this
can improve performance by ~33%.
2023-12-19 12:08:54 +00:00
Martin Kroeker f06b535566 Use C kernel for dgemv_t due to limitations of the old assembly one 2023-12-15 09:58:44 +01:00
barracuda156 d9653af018 KERNEL.PPC970, KERNEL.PPCG4: unbreak CMake parsing
Fixes: https://github.com/OpenMathLib/OpenBLAS/issues/4366
2023-12-14 12:00:11 +08:00
Chip-Kerchner 93747fb377 Merge remote-tracking branch 'origin/develop' into power10Copies 2023-12-12 09:32:49 -06:00
Chip-Kerchner 4e738e561a Replace two vector loads with one vector pair load and fix endianess of stores. 2023-12-08 12:36:08 -06:00
yancheng d32f38fb37 loongarch64: Add optimizations for nrm2. 2023-12-07 14:36:26 +08:00
yancheng f9b468990e loongarch64: Add optimizations for rot. 2023-12-07 14:36:26 +08:00
yancheng c80e7e27d1 loongarch64: Add optimizations for sum and asum. 2023-12-07 14:36:26 +08:00
yancheng d4c96a35a8 loongarch64: Add optimizations for axpy and axpby. 2023-12-07 14:36:26 +08:00
yancheng 360acc0a41 loongarch64: Add optimizations for swap. 2023-12-07 14:36:26 +08:00
yancheng 174c25766b loongarch64: Add optimizations for copy. 2023-12-07 14:36:26 +08:00
yancheng 49829b2b7d loongarch64: Add optimizations for iamin. 2023-12-07 14:36:07 +08:00
yancheng be83f5e4e0 loongarch64: Add optimizations for iamax. 2023-12-07 14:36:07 +08:00
yancheng e3fb2b5afa loongarch64: Add optimizations for imin. 2023-12-07 14:36:07 +08:00
yancheng e46b48e372 loongarch64: Add optimizations for imax. 2023-12-07 14:36:07 +08:00
yancheng 702fc1d56d loongarch64: Add optimization for min. 2023-12-07 14:36:07 +08:00
yancheng 346b384d1c loongarch64: Add optimization for max. 2023-12-07 14:36:07 +08:00
yancheng ff2ecc6cda loongarch64: Add optimization for amin. 2023-12-07 14:36:07 +08:00
yancheng 265b5f2e80 loongarch64: Add optimizations for amax. 2023-12-07 14:36:07 +08:00
yancheng 993ede7c70 loongarch64: Add optimizations for scal. 2023-12-07 14:36:07 +08:00
Octavian Maghiar 4a12cf53ec [RISC-V] Improve RVV kernel generator LMUL usage
The RVV kernel generation script uses the provided LMUL to increase the number of accumulator registers.
Since the effect of the LMUL is to group together the vector registers into larger ones, it actually should be used as a multiplier in the calculation of vlenmax.
At the moment, no matter what LMUL is provided, the generated kernels would only set the maximum number of vector elements equal to VLEN/SEW.
Commit changes the use of LMUL to properly adjust vlenmax. Note that an increase in LMUL results in a decrease in the number of effective vector registers.
2023-12-04 11:13:35 +00:00
Octavian Maghiar e4586e81b8 [RISC-V] Add RISC-V Vector 128-bit target
Current RVV x280 target depends on vlen=512-bits for Level 3 operations.
Commit adds generic target that supports vlen=128-bits.
New target uses the same scalable kernels as x280 for Level 1&2 operations, and autogenerated kernels for Level 3 operations.
Functional correctness of Level 3 operations tested on vlen=128-bits using QEMU v8.1.1 for ctests and BLAS-Tester.
2023-12-04 11:02:18 +00:00
Martin Kroeker 39bf8ece20 Merge pull request #4340 from yinshiyou/la-dev
Add some refines and optimizations for LoongArch.
2023-11-29 08:22:25 +01:00
Shiyou Yin 9fe07d82fd loongarch: Add LSX optimization for dot. 2023-11-28 20:24:18 +08:00
Shiyou Yin 13b8c44b44 loongarch: Add optimization for dsdot kernel. 2023-11-28 20:24:16 +08:00
Shiyou Yin 3def6a8143 loongarch: Add LASX optimization for dot. 2023-11-28 20:24:14 +08:00
Bart Oldeman c34e2cf380 Use _mm_set1_epi{32,64x} to init mask in x86-64 [cz]asum
for skylake kernels. This is the same method as used in [sd]asum.
_mm_set1_epi64x was commented out for zasum, but has the advantage
of avoiding possible undefined behaviour (using an uninitialized
variable), optimized out by NVHPC and icx. The new code works
fine with those compilers.

For GCC 12.3 the generated code is identical; no matter what method
you use, the compiler optimizes the code into a compile-time
constant, there is no performance benefit using mm_cmpeq_epi8
since the corresponding instruction (VPCMPEQB) isn't actually
generated!
2023-11-19 21:28:35 +00:00
Martin Kroeker 22aa401656 Temporarily disable the AVX512 CASUM/ZASUM microkernels for any version of NVIDIA HPC (#4327)
* Temporarily disable the C/ZASUM microkernels for any version of NVHPC
2023-11-19 00:04:31 +01:00
Bart Oldeman f8ad5344c2 Fix casum fallback kernel.
This kernel is only used on Skylake+ if the kernel with AVX512
intrinsics can't be used, but used the variable x1 incorrectly
in the tail end of the loop, as it is still at the initial
value instead of where x points to.

This caused 55 "other error"s in the LAPACK tests
(https://github.com/OpenMathLib/OpenBLAS/issues/4282)

This change makes casum.c as similar as possible as zasum.c,
because zasum.c does this correctly.
2023-11-17 23:53:56 +00:00
Martin Kroeker 04bc801999 (Re)apply fixes for supporting only a subset of precision types from PR 3915 2023-11-04 23:48:59 +01:00
Martin Kroeker 9019bc4945 Use SkylakeX ?ASUM microkernel for Cooperlake/Sapphirerapids as well 2023-11-04 22:10:06 +01:00
Martin Kroeker 3bfa4d4dcc Fix outdated SVE kernel definitions for Cortex cpus by aliasing to ARMV8SVE 2023-11-03 14:55:31 +01:00
Rajalakshmi Srinivasaraghavan 980f702f72 POWER: AIX: Make use of power10 optimization
POWER10 optimizations are disabled when using default AIX assembler.
As we have fixed many issues recently, enabling optimization path
for default assembler.
2023-10-19 18:48:19 -05:00
Rajalakshmi Srinivasaraghavan 9f42570e33 POWER: Increase macro size limit for AIX
This patch increases the macro size limit from 4096 to 16384 to
allow compiling larger assembly files in AIX.
Tested with GCC and IBM Open XL C.
2023-10-12 12:37:40 -05:00
Martin Kroeker 9f49aef91b Merge pull request #4255 from RajalakshmiSR/AIX-P10
POWER10: Fix compilation issues with Open XL C
2023-10-12 18:59:17 +02:00
Martin Kroeker e7d05402e0 Fix up S/D GEMM copy function definitions after #4009 2023-10-12 14:24:53 +02:00
Rajalakshmi Srinivasaraghavan 71d733e5f7 POWER: Avoid m4 conversions for C files
This patch removes intermediate m4 conversions used in sbgemm
compilation as it is not needed for .c files.
Tested on AIX with gcc and IBM Open XL C.
2023-10-11 17:18:42 -05:00
Rajalakshmi Srinivasaraghavan 82fc29a57a POWER10: Fallback to POWER8 functions
As cgemm and zgemm kernels are not optimized for big endian falling
back to POWER8 versions.  Tested on AIX using gcc and Open XL C.
2023-10-11 17:04:42 -05:00
Rajalakshmi Srinivasaraghavan db0805906b powerpc: Fix build errors with Open XL C
This patch fixes errors when using Open XL C compiler on AIX.
Tested with gcc/xlf and ibm-clang/xlf compiler combinations.
2023-10-04 14:04:03 -05:00