yancheng
993ede7c70
loongarch64: Add optimizations for scal.
2023-12-07 14:36:07 +08:00
Octavian Maghiar
4a12cf53ec
[RISC-V] Improve RVV kernel generator LMUL usage
...
The RVV kernel generation script uses the provided LMUL to increase the number of accumulator registers.
Since the effect of the LMUL is to group together the vector registers into larger ones, it actually should be used as a multiplier in the calculation of vlenmax.
At the moment, no matter what LMUL is provided, the generated kernels would only set the maximum number of vector elements equal to VLEN/SEW.
Commit changes the use of LMUL to properly adjust vlenmax. Note that an increase in LMUL results in a decrease in the number of effective vector registers.
2023-12-04 11:13:35 +00:00
Octavian Maghiar
e4586e81b8
[RISC-V] Add RISC-V Vector 128-bit target
...
Current RVV x280 target depends on vlen=512-bits for Level 3 operations.
Commit adds generic target that supports vlen=128-bits.
New target uses the same scalable kernels as x280 for Level 1&2 operations, and autogenerated kernels for Level 3 operations.
Functional correctness of Level 3 operations tested on vlen=128-bits using QEMU v8.1.1 for ctests and BLAS-Tester.
2023-12-04 11:02:18 +00:00
Martin Kroeker
39bf8ece20
Merge pull request #4340 from yinshiyou/la-dev
...
Add some refines and optimizations for LoongArch.
2023-11-29 08:22:25 +01:00
Shiyou Yin
9fe07d82fd
loongarch: Add LSX optimization for dot.
2023-11-28 20:24:18 +08:00
Shiyou Yin
13b8c44b44
loongarch: Add optimization for dsdot kernel.
2023-11-28 20:24:16 +08:00
Shiyou Yin
3def6a8143
loongarch: Add LASX optimization for dot.
2023-11-28 20:24:14 +08:00
Bart Oldeman
c34e2cf380
Use _mm_set1_epi{32,64x} to init mask in x86-64 [cz]asum
...
for skylake kernels. This is the same method as used in [sd]asum.
_mm_set1_epi64x was commented out for zasum, but has the advantage
of avoiding possible undefined behaviour (using an uninitialized
variable), optimized out by NVHPC and icx. The new code works
fine with those compilers.
For GCC 12.3 the generated code is identical; no matter what method
you use, the compiler optimizes the code into a compile-time
constant, there is no performance benefit using mm_cmpeq_epi8
since the corresponding instruction (VPCMPEQB) isn't actually
generated!
2023-11-19 21:28:35 +00:00
Martin Kroeker
22aa401656
Temporarily disable the AVX512 CASUM/ZASUM microkernels for any version of NVIDIA HPC ( #4327 )
...
* Temporarily disable the C/ZASUM microkernels for any version of NVHPC
2023-11-19 00:04:31 +01:00
Bart Oldeman
f8ad5344c2
Fix casum fallback kernel.
...
This kernel is only used on Skylake+ if the kernel with AVX512
intrinsics can't be used, but used the variable x1 incorrectly
in the tail end of the loop, as it is still at the initial
value instead of where x points to.
This caused 55 "other error"s in the LAPACK tests
(https://github.com/OpenMathLib/OpenBLAS/issues/4282 )
This change makes casum.c as similar as possible as zasum.c,
because zasum.c does this correctly.
2023-11-17 23:53:56 +00:00
Martin Kroeker
04bc801999
(Re)apply fixes for supporting only a subset of precision types from PR 3915
2023-11-04 23:48:59 +01:00
Martin Kroeker
9019bc4945
Use SkylakeX ?ASUM microkernel for Cooperlake/Sapphirerapids as well
2023-11-04 22:10:06 +01:00
Martin Kroeker
3bfa4d4dcc
Fix outdated SVE kernel definitions for Cortex cpus by aliasing to ARMV8SVE
2023-11-03 14:55:31 +01:00
Rajalakshmi Srinivasaraghavan
980f702f72
POWER: AIX: Make use of power10 optimization
...
POWER10 optimizations are disabled when using default AIX assembler.
As we have fixed many issues recently, enabling optimization path
for default assembler.
2023-10-19 18:48:19 -05:00
Rajalakshmi Srinivasaraghavan
9f42570e33
POWER: Increase macro size limit for AIX
...
This patch increases the macro size limit from 4096 to 16384 to
allow compiling larger assembly files in AIX.
Tested with GCC and IBM Open XL C.
2023-10-12 12:37:40 -05:00
Martin Kroeker
9f49aef91b
Merge pull request #4255 from RajalakshmiSR/AIX-P10
...
POWER10: Fix compilation issues with Open XL C
2023-10-12 18:59:17 +02:00
Martin Kroeker
e7d05402e0
Fix up S/D GEMM copy function definitions after #4009
2023-10-12 14:24:53 +02:00
Rajalakshmi Srinivasaraghavan
71d733e5f7
POWER: Avoid m4 conversions for C files
...
This patch removes intermediate m4 conversions used in sbgemm
compilation as it is not needed for .c files.
Tested on AIX with gcc and IBM Open XL C.
2023-10-11 17:18:42 -05:00
Rajalakshmi Srinivasaraghavan
82fc29a57a
POWER10: Fallback to POWER8 functions
...
As cgemm and zgemm kernels are not optimized for big endian falling
back to POWER8 versions. Tested on AIX using gcc and Open XL C.
2023-10-11 17:04:42 -05:00
Rajalakshmi Srinivasaraghavan
db0805906b
powerpc: Fix build errors with Open XL C
...
This patch fixes errors when using Open XL C compiler on AIX.
Tested with gcc/xlf and ibm-clang/xlf compiler combinations.
2023-10-04 14:04:03 -05:00
Martin Kroeker
675cd551da
fix improper function prototypes (empty parentheses)
2023-09-30 12:56:38 +02:00
gxw
d15e0a055c
LoongArch64: Fixed compilation issues when enable DYNAMIC_ARCH
2023-09-27 10:05:27 +08:00
gxw
4670eb1462
LoongArch64: Add dtrsm kernel
2023-09-26 15:45:14 +08:00
gxw
f2cf929374
LoongArch64: Add sgemv kernel
2023-09-04 14:28:37 +08:00
Martin Kroeker
8e6d93359d
Merge pull request #4196 from TiborGY/obsolete_inlines
...
Modernize obsolete inline order
2023-09-03 14:12:42 +02:00
gxw
394a1fd1bf
LoongArch64: Compatible with early internal toolchain
...
__loongarch_grlen and __loongarch_frlen were introduced in gcc version 8.3.0
(Loongnix 8.3.0-6.lnd.vec.31) internally within Loongson to standardize the
general and floating-point register widths. However, previous versions did
not have them, requiring additional checks to be added.
2023-08-31 16:55:29 +08:00
Martin Kroeker
9c4ae4d4fb
Merge pull request #4206 from martin-frbg/issue4201-2
...
Work around miscompilation of zdot_thunderx2t99 by the current NVIDIA HPC compiler
2023-08-26 10:17:27 +02:00
Martin Kroeker
88435104c8
Merge pull request #4204 from martin-frbg/llvm17-2
...
Work around LLVM17 miscompiling the AVX512 microkernels for CASUM/ZASUM
2023-08-26 00:32:18 +02:00
Martin Kroeker
fc8894dd98
Workaround miscompilation by NVIDIA nvc
2023-08-26 00:30:17 +02:00
Martin Kroeker
7a6203ffa1
restore default Neoverse SVE build instructions for non-NVIDIA compilers
2023-08-25 18:25:51 +02:00
Martin Kroeker
2c3034ff7f
Disable the C/ZASUM AVX512 microkernels when compiling with LLVM17 as well
2023-08-25 17:22:51 +02:00
Martin Kroeker
8794544b43
Add support for compiling the Neoverse SVE kernels with the NVIDIA HPC compiler
2023-08-25 16:47:32 +02:00
gxw
553cc1372f
LoongArch64: Add sgemm_kernel
2023-08-23 16:08:43 +08:00
Martin Kroeker
12ede72ab7
Merge pull request #4192 from imciner2/im/clangfix
...
Fix cooperlake and sapphire rapids march flags on clang
2023-08-21 15:46:35 +02:00
Ian McInerney
79c15db348
Fix power10 gcc intrinsic check
...
__builtin_vsx_assemble_pair was only in GCC 10-11.2 and was replaced by
__builtin_vsx_build_pair thereafter.
2023-08-17 15:05:29 +01:00
TGY
b5ba95a6c0
Modernize obsolete inline order
2023-08-16 00:48:40 +02:00
Ian McInerney
8a8a8479be
Fix cooperlake and sapphire rapids march flags on clang
...
The march=cooperlake and march=sapphirerapids flags were never getting
added when building with Clang targetting those architectures. Instead
it was falling back to the skylake AVX512 implementation.
Clang added support for these two architectures in Clang 9 and Clang 12,
so introduce new checks for those versions to enable the appropriate
march flag, and fallback to skylake otherwise.
2023-08-14 16:12:35 +01:00
Martin Kroeker
34da1a067d
Allow negative INCX (API change from version 3.10 of the reference implementation)
2023-08-10 17:01:50 +02:00
Martin Kroeker
07e32c4cb8
Allow negative INCX (API change from version 3.10 of the reference implementation)
2023-08-10 17:00:18 +02:00
Martin Kroeker
c211da0688
Allow negative INCX (API change from version 3.10 of the reference implementation)
2023-08-10 16:58:57 +02:00
Martin Kroeker
a34a0a7abc
Allow negative INCX (API change from version 3.10 of the reference implementation)
2023-08-10 16:56:52 +02:00
Martin Kroeker
54d3246fc6
Allow negative INCX (API change from version 3.10 of the reference implementation)
2023-08-10 16:55:17 +02:00
Martin Kroeker
7dd441d5db
Allow negative INCX (API change from version 3.10 of the reference implementation)
2023-08-10 16:53:33 +02:00
Martin Kroeker
f692178792
Allow negative INCX (API change from version 3.10 of the reference implementation)
2023-08-10 16:52:09 +02:00
Martin Kroeker
d15ffb7fdf
Allow negative INCX (API change from version 3.10 of the reference implementation)
2023-08-10 16:50:44 +02:00
Martin Kroeker
a2d867f4d1
Allow negative iNCX (API change from version 3.10 of the reference implementation)
2023-08-10 16:49:05 +02:00
gxw
4d0f000db6
MIPS: Enable MSA
2023-08-07 21:00:10 +08:00
Martin Kroeker
afdc56a421
Merge pull request #4158 from XiWeiGu/loongarch64_update_dgemm_kernel
...
LoongArch64: Update dgemm kernel
2023-08-07 12:44:09 +02:00
gxw
e8b571d245
LoongArch64: Add dgemv_t_8_lasx.S and dgemv_n_8_lasx.S V2
2023-08-07 11:20:42 +08:00
gxw
71fcee6eef
LoongArch64: Update dgemm kernel
2023-08-07 11:06:52 +08:00
Martin Kroeker
0f521ece25
Merge pull request #4183 from martin-frbg/issue4181
...
Apply USE_TRMM to MIPS64_GENERIC as to GENERIC in gmake builds
2023-08-06 18:59:50 +02:00
Martin Kroeker
41c31bc1d4
Revert "LoongArch64: Add dgemv_t_8_lasx.S and dgemv_n_8_lasx.S"
2023-08-06 16:00:03 +02:00
Martin Kroeker
61d803547a
Apply USE_TRMM to MIPS64_GENERIC as to GENERIC
2023-08-06 15:17:38 +02:00
Martin Kroeker
f8ee309402
Merge pull request #4153 from XiWeiGu/dgemv
...
LoongArch64: Add dgemv_t_8_lasx.S and dgemv_n_8_lasx.S
2023-08-06 08:49:16 +02:00
gxw
ec1e96aac8
LoongArch64: Add dgemv_t_8_lasx.S and dgemv_n_8_lasx.S
2023-08-05 10:24:17 +08:00
gxw
d46772e037
LoongArch64: Add compiler feature checks
2023-08-05 10:21:43 +08:00
Martin Kroeker
4664b57e6e
use shortcut only when both incx and incy are zero
2023-08-04 12:25:34 +02:00
Martin Kroeker
09131f79a6
Merge pull request #4164 from martin-frbg/issue4162
...
Enable use of AVX512 microkernels with NVIDIA HPC from version 22.3
2023-07-29 15:07:20 +02:00
Martin Kroeker
6a428b5629
Update casum_microk_skylakex-2.c
2023-07-29 12:24:30 +02:00
Martin Kroeker
ebb447e32e
Update zasum_microk_skylakex-2.c
2023-07-29 12:23:57 +02:00
Martin Kroeker
9f6847583a
nvc currently miscompiles this, hopefully fixed in release 23.09
2023-07-29 11:50:16 +02:00
Martin Kroeker
fe54ee3d15
nvc currently miscompiles this, hopefully fixed in release 23.09
2023-07-29 11:48:38 +02:00
Martin Kroeker
5720fa02c5
Merge pull request #4168 from Mousius/sve-zgemm-cgemm
...
Use SVE zgemm/cgemm on Arm(R) Neoverse(TM) V1 core
2023-07-27 17:41:45 +02:00
Chris Sidebottom
84a268b6ca
Use SVE zgemm/cgemm on Arm(R) Neoverse(TM) V1 core
...
This patch removes the prefetches from cgemm/zgemm which improves the performance similar to sgemm/dgemm did in #3868 , this means I'm happy to enable this on any applicable cores.
I also replicated the unrolling the copies from sgemm and dgemm.
2023-07-27 14:12:20 +01:00
Chris Sidebottom
730ca04b48
Fix ZHEMM copy for SVE
...
Whilst disambiguating whilelt, I inadvertantly used the wrong datatype
for offsets, which can be negative. This rectifies that.
2023-07-27 13:27:28 +01:00
Martin Kroeker
2a62d2df96
Enable use of AVX512 microkernels with NVIDIA HPC from version 22.3
2023-07-26 19:39:11 +02:00
Martin Kroeker
849c8806b8
Merge pull request #4161 from Mousius/non-sve-kernels
...
Use latest non-SVE kernels in ARMV8SVE
2023-07-26 15:49:40 +02:00
Chris Sidebottom
24586bc4ff
Disambiguate whilelt
2023-07-25 20:15:44 +01:00
Chris Sidebottom
aea2a4622b
Use latest non-SVE kernels in ARMV8SVE
...
These are generally better and, in some cases, include threading which helps in the cores we're targeting here.
2023-07-25 14:12:26 +01:00
Octavian Maghiar
826a9d5fa4
Adds tail undisturbed for RVV Level 2 operations
...
During the last iteration of some RVV operations, accumulators can get overwritten when VL < VLMAX and tail policy is agnostic.
Commit changes intrinsics tail policy to undistrubed.
2023-07-25 11:36:23 +01:00
martin-frbg
7976deff80
Fix file permissions (issue 4095)
2023-07-23 20:37:07 +02:00
Octavian Maghiar
8df0289db6
Adds tail undisturbed for RVV Level 1 operations
...
During the last iteration of some RVV operations, accumulators can get overwritten when VL < VLMAX and tail policy is agnostic.
Commit changes intrinsics tail policy to undistrubed.
2023-07-20 15:28:35 +01:00
Martin Kroeker
76ef1672f8
Override DSDOT with generic code to get rid of qemu precision error
2023-07-19 22:31:07 +02:00
Martin Kroeker
49077e7bde
Merge pull request #4145 from martin-frbg/issue4144
...
Restore zero-initialization of variables in generic ztrsm_utcopy
2023-07-14 12:44:05 +02:00
Martin Kroeker
3d31191b0f
Work around Clang failing to disambiguate SVE intrinsics and add AppleClang crossbuild to MacOS/arm64 DYNAMIC_ARCH in AzureCI ( #4140 )
...
* Add AppleClang crossbuild to MacOS/arm64 DYNAMIC_ARCH
* add casts to disambiguate svwhilelt for clang
2023-07-14 11:06:48 +02:00
Martin Kroeker
cfa0a80664
Restore initialization of data variables
2023-07-13 23:23:12 +02:00
Martin Kroeker
9567305e4c
Restore initialization of data01,data02
2023-07-13 23:21:18 +02:00
Octavian Maghiar
1e4a3a2b5e
Fixes RVV masked intrinsics for izamax/izamin kernels
2023-07-12 12:55:50 +01:00
Octavian Maghiar
e1958eb705
Fixes RVV masked intrinsics for iamax/iamin/imax/imin kernels
...
Changes masked intrinsics from _m to _mu and reintroduces maskedoff argument.
2023-07-05 11:34:00 +01:00
Xianyi Zhang
e14a025bb1
Temporily walk around zaxpy vector kernel bug.
2023-06-28 11:17:38 +00:00
Martin Kroeker
772b0cc715
Fix early bailout
2023-06-27 16:12:27 +02:00
Martin Kroeker
d6be5036d7
Fix IDAMAX
2023-06-26 21:19:33 +02:00
Martin Kroeker
1fe96f8da7
Fix failures to handle increments of zero
2023-06-25 22:36:57 +02:00
Martin Kroeker
73b30b1dec
Fix VLEV_FLOAT/VSEV_FLOAT macros to compile with t-head 2.6.1
2023-06-18 17:46:29 +02:00
Martin Kroeker
c3a2d407a0
Merge pull request #4048 from imzhuhl/spr_sbgemm_fix
...
Sapphire Rapids sbgemm fix
2023-06-17 20:47:09 +02:00
Manjul Mohan
58b88aa5f0
POWER10: Fix compiler warnings
...
This patch removes the warning messages related to unused variables in
sbgemm_kernel_power10.c.
Signed-off-by: Manjul Mohan <manjul@linux.vnet.ibm.com>
2023-06-12 01:08:59 -04:00
ZhengSh
2a8bc38cdc
Merge branch 'xianyi:risc-v' into risc-v
2023-06-09 20:01:03 +08:00
Heller Zheng
0954746380
remove argument unused during compilation.
...
fix wrong vr = VFMVVF_FLOAT(0, vl);
2023-06-04 20:06:58 -07:00
sh-zheng
d3bf5a5401
Combine two reduction operations of zhe/symv into one, with tail undisturbed setted.
2023-05-22 22:39:45 +08:00
Honglin Zhu
9e80a194d6
Fix dynamic_list build and gcc version check error
2023-05-21 19:52:58 +08:00
Honglin Zhu
a76afdc047
Compatible with older version of GNU make
2023-05-20 13:58:23 +08:00
sh-zheng
18d7afe69d
Add rvv support for zsymv and active rvv support for zhemv
2023-05-20 01:19:44 +08:00
Honglin Zhu
90f041e348
Invoke the syscall to allow the use of amx tiles
2023-05-19 10:48:18 +08:00
Honglin Zhu
0b83088887
spr dynamic arch support
2023-05-19 10:48:18 +08:00
Honglin Zhu
f249ccb741
Fix spr sbgemm error
2023-05-19 10:48:18 +08:00
Martin Kroeker
e9a8d5b45f
Merge pull request #4015 from martin-frbg/issue4013-2
...
[WIP] Disable gcc's tree-vectorizer for x86_64 CGEMV
2023-04-23 18:51:12 +02:00
Martin Kroeker
72caceb324
Merge pull request #4009 from Mousius/sve-gemm
...
Use SVE kernel for SGEMM/DGEMM on Arm(R) Neoverse(TM) V1
2023-04-22 13:56:45 +02:00
Martin Kroeker
84bcf6639f
Disable gcc's tree-vectorizer pass on all operating systems
2023-04-20 23:24:52 +02:00
Martin Kroeker
c9174ae8d7
Disable gcc's tree-vectorizer pass on all operating systems
2023-04-19 23:45:44 +02:00
Martin Kroeker
c2fe9cb91f
Disable gcc's tree-vectorizer pass on all operating systems
2023-04-19 23:45:14 +02:00
Martin Kroeker
66b39b835c
Disable gcc's tree-vectorizer pass on all operating systems
2023-04-19 23:44:45 +02:00
Martin Kroeker
bb6d6735bf
Disable gcc's tree-vectorizer pass on all operating systems
2023-04-19 23:44:15 +02:00
Martin Kroeker
d18efaed20
Disable gcc's tree-vectorizer pass on all operating systems
2023-04-19 23:43:43 +02:00
Martin Kroeker
99f6d31ed5
Disable gcc's tree-vectorizer pass on all operating systems
2023-04-19 23:42:55 +02:00
Martin Kroeker
7de9335c56
Disable gcc's tree-vectorizer pass on all operating systems
2023-04-19 23:42:09 +02:00
Martin Kroeker
437c0bf2b4
Merge pull request #3843 from Mousius/switch-ratio
...
Propagate SWITCH_RATIO to DYNAMIC_ARCH builds
2023-04-19 11:51:54 +02:00
Chris Sidebottom
ec334e69dc
Use SVE kernel for SGEMM/DGEMM on Arm(R) Neoverse(TM) V1
...
This re-spins #3869 with some additional copy unrolling which helps maintain SYRK performance.
After #3868 , the SVE kernels represent a pretty good boost.
This re-uses ARMV8SVE as a base and I'm going to incrementally move everything to use ARMV8SVE in additional patches (as well as fix up anything that's not already in ARMV8SVE).
2023-04-17 17:38:42 +01:00
Chris Sidebottom
32f2fafde7
Propagate SWITCH_RATIO to DYNAMIC_ARCH builds
...
Previously dynamic builds were either using the default SWITCH_RATIO
or one from the higher level architecture; this patch ensures the
dynamic builds can use this parameter as well.
2023-04-17 15:34:12 +01:00
Martin Kroeker
44164e3a3d
revert "move alpha out of register 18" (out of PR scope, no SVE on Apple hw)
2023-04-17 14:23:13 +02:00
Martin Kroeker
8be68fa7f4
move declaration of sca to really keep the compiler from throwing it out (for now)
2023-04-15 12:02:39 +02:00
Martin Kroeker
3727672a74
Improve workaround and keep compilers from optimizing it out
2023-04-13 18:07:52 +02:00
Martin Kroeker
108a21e47a
Move ALPHA out of register 18 (reserved on OSX)
2023-04-13 18:05:14 +02:00
Martin Kroeker
0b1acb0ba3
Move ALPHA_I out of register 18 (reserved on OSX)
2023-04-13 18:03:35 +02:00
Martin Kroeker
c7bbad09ad
Move ALPHA_I out of register 18 (reserved on OSX)
2023-04-13 18:00:47 +02:00
Martin Kroeker
cda29633a3
move ALPHA_I out of register 18 (reserved on OSX)
2023-04-13 17:59:48 +02:00
Martin Kroeker
09ace3cf23
Merge pull request #3846 from lilh9598/sbgemm_opt
...
Improve the performance of sbgemm_tcopy on neoversen2
2023-03-26 19:04:57 +02:00
Heller Zheng
1374a2d08b
This PR adapts latest spec changes
...
Add prefix (_riscv) for all riscv intrinsics
Update some intrinsics' parameter, like vfredxxxx, vmerge
2023-03-19 23:59:03 -07:00
Zhang Xianyi
19f17c8bc6
Merge pull request #3893 from HellerZheng/develop
...
add riscv level3 C,Z kernel functions.
2023-03-15 10:17:13 +08:00
Sergei Lewis
cb0a70e0e2
dot.c early bail fix
2023-03-02 09:51:10 +00:00
Sergei Lewis
9b61be4545
factoring riscv64/dot.c fix into separate PR as requested
2023-03-01 17:40:42 +00:00
Sergei Lewis
2406958629
* update intrinsics to match latest spec at https://github.com/riscv-non-isa/rvv-intrinsic-doc (in particular, __riscv_ prefixes for rvv intrinsics)
...
* fix multiple numerical stability and corner case issues
* add a script to generate arbitrary gemm kernel shapes
* add a generic zvl256b target to demonstrate large gemm kernel unrolls
2023-02-24 10:45:03 +00:00
Martin Kroeker
38d6fb4225
Fix dependencies in builds with specified subsets of precision types
2023-02-23 23:12:06 +01:00
Martin Kroeker
e412bee313
fix GEMM kernel dependencies in builds that use only a subset of precisions
2023-02-22 00:37:14 +01:00
Martin Kroeker
d80adf253e
make SSYMV available to BUILD_DOUBLE-only builds
2023-02-22 00:30:20 +01:00
Martin Kroeker
5481c328e8
fix DYNAMIC_ARCH builds that use only a subset of precisions
2023-02-22 00:28:25 +01:00
Heller Zheng
63cf4d0166
add riscv level3 C,Z kernel functions.
2023-02-01 19:13:44 -08:00
Xianyi Zhang
c19dff0a31
Fix T-Head RVV intrinsic API changes.
2023-01-25 19:33:32 +08:00
Martin Kroeker
5a9cd87794
Merge pull request #3868 from Mousius/sve-prefetch
...
Remove prefetches from SVE kernels
2022-12-24 10:52:29 +01:00
Chris Sidebottom
1361229291
Remove prefetches from SVE kernels
...
This is a precursor to enabling the SVE kernels for Arm(R) Neoverse(TM)
V1 which has 256-bit SVE. Testing revealed that the SVE kernel was
actually worse in some cases than the existing kernel which seemed odd -
removing these prefetches the underlying architecture seems to do a better job
😸
2022-12-16 14:43:09 +00:00
Bart Oldeman
60e49b851c
Fix typo in clobber list, should be xmm14 instead of ymm14.
2022-12-06 16:30:46 -05:00
Bart Oldeman
4afe1439a1
Fix skylake fallback kernel name for old compilers.
2022-12-06 16:09:54 -05:00
Bart Oldeman
5ceca1a4d8
Add sscal.c + microkernels for Haswell, Zen, Skylake and newer.
...
Unlike [dcz]scal, sscal still used the original GotoBLAS SSE code from scal_sse.S.
This code follows dscal as closely as possible, except for the inc_x > 1 code
for which a plain C loop is used much like the one in cscal.c, instead of an
adaptation of the SSE2 asm code of dscal.c (I tried but the performance wasn't
better than the plain C loop).
2022-12-06 14:05:49 -05:00
lilianhuang
729af6406f
bugfix for sbgemm_ncopy_8_neoversen2
2022-12-05 05:10:18 -05:00
Martin Kroeker
042e3c0e7c
Merge pull request #3848 from bartoldeman/dscal-haswell-ymm
...
dscal: use ymm registers in Haswell microkernel
2022-12-05 08:56:08 +01:00
Xianyi Zhang
e5313f53d5
Merge branch 'develop' of https://github.com/HellerZheng/OpenBLAS_riscv_x280 into HellerZheng-develop
2022-12-03 12:00:52 +08:00
Bart Oldeman
5c3169ecd8
dscal: use ymm registers in Haswell microkernel
...
Using 256-bit registers in dscal makes this microkernel consistent with
cscal and zscal, and generally doubles performance if the vector fits in
L1 cache.
2022-12-01 07:48:05 -05:00
Chris Sidebottom
eea006a688
Wrap SVE header with __has_include check
2022-12-01 12:07:55 +00:00
Chris Sidebottom
fd4f52c797
Add SVE implementation for sdot/ddot
...
This adds an SVE implementation to sdot/ddot when available, falling back to the previous Advanced SIMD kernel where there's no SVE implementation for the kernel.
All the targets were essentially treating `dot_thunderx2t99.c` as the Advanced SIMD implementation so I've renamed it to better fit with the feature detection.
2022-12-01 12:07:50 +00:00
lilianhuang
fdac8a97c1
Add sbgemm_ncopy_8 and sbgemm_tcopy_4
2022-11-29 04:46:14 -05:00
lilianhuang
135718eafc
Improve the performance of sbgemm_tcopy on neoversen2
2022-11-28 04:17:54 -05:00
Chris Sidebottom
4f7b77e08a
Remove unnecessary instructions from Advanced SIMD dot
...
The existing kernel was issuing extra instructions to organise the arguments into the same registers they would usually be in and similarly to put the result into the appropriate register.
This has an impact on smaller sized dots and seemed like a quick fix
2022-11-25 16:19:03 +00:00
Heller Zheng
3918d8504e
nrm2 simple optimization
2022-11-21 19:06:07 -08:00
HellerZheng
943372bdf5
Merge branch 'develop' into develop
2022-11-18 10:12:46 +08:00
Martin Kroeker
f73cfb7e2c
change line endings from CRLF to LF
2022-11-17 09:39:56 +01:00
Martin Kroeker
1688c7da43
change line endings from CRLF to LF
2022-11-16 22:24:01 +01:00
Heller Zheng
5d0d1c5551
Remove redundant files
2022-11-15 18:22:21 -08:00
Heller Zheng
bef47917bd
Initial version for riscv sifive x280
2022-11-15 00:06:25 -08:00
Bart Oldeman
6c1043eb41
Add [cz]scal microkernels for SKYLAKEX
...
These are as similar to dscal_microk_skylakex-2.c as possible
for consistency.
Note that before this change SKYLAKEX+ uses generic C functions for
cscal/zscal via commit 2271c350
from #2610 (which is masked by
commit 086d87a30
). However now #3799 disables FMAs (in turn enabled
by `-march=skylake-avx512`) in the plain C code which fixes excessive
LAPACK test failures more nicely.
2022-11-09 08:57:03 -05:00
Martin Kroeker
c9d78dc3b2
Remove excess initializer (leftover from rework of PR 3793)
2022-10-31 16:57:03 +01:00
Martin Kroeker
65338a9493
Merge pull request #3799 from bartoldeman/cscal-zscal-no-fma
...
x86_64: prevent GCC and Clang from generating FMAs in cscal/zscal.
2022-10-30 18:56:10 +01:00
Honglin Zhu
79066b6bf3
Change file name to match the norm and delete useless code.
2022-10-28 17:09:39 +08:00
Bart Oldeman
e7e3aa2948
x86_64: prevent GCC and Clang from generating FMAs in cscal/zscal.
...
If e.g. -march=haswell is set in CFLAGS, GCC generates FMAs by default, which
is inconsistent with the microkernels, none of which use FMAs. These
inconsistencies cause a few failures in the LAPACK testcases, where
eigenvalue results with/without eigenvectors are compared.
Moreover using FMAs for multiplication of complex numbers can give surprising
results, see 22aa81f
for more information.
This uses the same syntax as used in 22aa81f
for zarch (s390x).
2022-10-27 18:16:43 -04:00
Honglin Zhu
4989e039a5
Define SBGEMM_ALIGN_K for DYNAMIC_ARCH build
2022-10-27 14:10:26 +08:00
Honglin Zhu
843e9fd0b9
Fix typo error
2022-10-26 17:06:33 +08:00
Honglin Zhu
b00d5b9746
New sbgemm implementation for Neoverse N2
...
1. Use UZP instructions but not gather load and scatter store instructions to get lower latency.
2. Padding k to a power of 4.
2022-10-26 15:09:41 +08:00
Martin Kroeker
f6f35a4288
fix copyobj declarations to work with DYNAMIC_ARCH
2022-09-29 08:47:14 +02:00
Martin Kroeker
b1d69fb3ac
Add MIPS64_GENERIC as a copy of GENERIC
2022-09-17 23:52:32 +02:00
gxw
edea1bcfaf
MIPS64: Fixed failed utest dsdot:dsdot_n_1 when TARGET=I6500
2022-09-17 16:43:22 +08:00
Martin Kroeker
101a2c77c3
Fix warnings
2022-09-15 09:19:19 +02:00
Martin Kroeker
23d59baaf1
Add -mfma to -mavx2 for Apple clang, and set AVX2 options for Zen as well
2022-09-13 22:39:27 +02:00
gxw
365936ae1b
MIPS64: Using the macro MTC rather than MTC1
2022-09-13 16:39:40 +08:00
Martin Kroeker
739c3c44a7
Work around windows/osx gcc12 x86_64 tree-optimizer problem and add an osx/gcc12 build to Azure CI ( #3745 )
...
Add pragma to disable the gcc tree-optimizer for some x86_64 S and Z kernels with gcc12 on OSX or Windows
2022-09-03 15:01:22 +02:00
Martin Kroeker
bd30120ba7
Merge pull request #3720 from FlyGoat/mips64
...
Make it work on general MIPS64 processors
2022-08-19 20:24:27 +02:00
Jiaxun Yang
a50b29c540
Provide a fallback MIPS64_GENERIC target
...
It is really dangerous to fallback to Loongson core on other
MIPS64 processors.
Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
2022-08-12 13:13:28 +01:00
Jiaxun Yang
50c4eeb97d
alpha: Remove include of version.h
...
It will be defined by preprocessor argument.
Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
2022-08-11 15:02:58 +01:00
Ivan Pribec
802e71bf05
Add const attribute to lsame
2022-08-08 15:15:52 +02:00
gxw
fbfe1daf6e
LoongArch64: Add DYNAMIC_ARCH support
2022-07-28 14:28:45 +08:00
Martin Kroeker
cd8e57040c
Merge pull request #3691 from martin-frbg/issue3679-sparc
...
SPARC: fix DNRM2 returning INF instead of zero due to intermediate overflow
2022-07-25 15:41:15 +02:00
Martin Kroeker
6c118b7977
Fix DNRM2 returning INF instead of zero due to intermediate overflow
2022-07-24 17:42:31 +02:00
Martin Kroeker
c43ec53bdd
Merge pull request #3690 from RajalakshmiSR/cdotp10
...
POWER: Fix complex dot function failures
2022-07-19 13:59:16 +02:00
Martin Kroeker
b7c65d08cb
Merge pull request #3689 from RajalakshmiSR/dgemvgcc10
...
POWER10: dgemv builtin rename
2022-07-19 10:25:01 +02:00
Martin Kroeker
06ef015234
fix DNRM2 returning INF instead of zero due to intermediate overflow
2022-07-19 10:19:27 +02:00
Rajalakshmi Srinivasaraghavan
a612e78a97
POWER: Fix complex dot function failures
...
There are some test failures in complex dot functions when compiling with gcc12.
The machine constraints used now do not update all the four elements in the
expected result array. Fixing this with a reduced level of optimization.
This is not changing any performance numbers but will be converted to C code in future.
2022-07-18 14:48:43 -05:00
Rajalakshmi Srinivasaraghavan
432fd99445
POWER10: dgemv builtin rename
...
Add check to use correct builtin name for older versions
of gcc10 compilers.
2022-07-18 09:48:01 -05:00
gxw
4dd05e526b
LoongArch64: Fix dnrm2_tiny testcase failure
2022-07-15 11:18:59 +08:00
gxw
cce4b1d956
MIPS64: Fix dnrm2_tiny testcase failure
2022-07-11 19:18:38 +08:00
Martin Kroeker
e12d474780
Eliminate uses of CREAL on left-hand side of assignments
2022-07-05 00:01:09 +02:00
Martin Kroeker
9e29598575
workaround fault with ssq=inf,scale=0
2022-07-02 23:47:17 +02:00
Honglin Zhu
123e0dfb62
Neoverse N2 sbgemm:
...
1. Modify the algorithm to resolve multithreading failures
2. No memory allocation in sbgemm kernel
3. Optimize when alpha == 1.0f
2022-06-29 10:14:21 +08:00
Honglin Zhu
bc3728475f
format code
2022-06-29 10:14:21 +08:00
Honglin Zhu
55d686d41e
neoverse n2 sbgemm:
...
implement ncopy tcopy kernel_8x4
2022-06-29 10:14:21 +08:00
Honglin Zhu
04593bb27c
neoverse n2 sbgemm: init file
2022-06-29 10:14:21 +08:00
Martin Kroeker
be5500e704
Merge pull request #3669 from VFerrari/fix_small_matrix_kernel
...
POWER: fix issues with the small matrix kernel
2022-06-28 16:09:36 +02:00
Martin Kroeker
92275a7902
Merge pull request #3642 from nursik/develop
...
Add ARM64 support for Windows
2022-06-28 16:05:11 +02:00
VFerrari
cac634fce3
POWER10: Fix multithreading check when USE_THREAD=0
...
This patch fixes an issue when OpenBLAS is compiled for TARGET=POWER10
and the flag USE_THREAD is set to 0.
The function `num_cpu_avail` is only available when USE_THREAD=1,
so SMP is defined.
2022-06-25 03:46:46 -03:00
Martin Kroeker
9283c7c0b5
Merge pull request #3655 from RajalakshmiSR/zgemmasmp10
...
POWER10: Fix ZGEMM testcase failures
2022-06-18 20:52:26 +02:00
Rajalakshmi Srinivasaraghavan
f191bc652b
POWER10: Fix ZGEMM testcase failures
...
This patch fixes storing and restoring non volatile registers
in zgemm POWER10 kernel.
2022-06-17 08:18:08 -05:00
Rajalakshmi Srinivasaraghavan
8419d538ff
POWER10: convert dgemv inline assembly
...
This patch makes use of compiler builtins and matches with assembly
performance. Tested with clang14 and gcc12.
2022-06-09 10:42:57 -05:00
Xianyi Zhang
5e9a912591
Merge branch 'develop' into risc-v
2022-06-06 14:12:09 +08:00
Xianyi Zhang
968e1f51d8
Update RISC-V Intrinsic API.
2022-06-06 13:52:21 +08:00
Nursultan Zarlyk
1bb7993a97
Fix MSVC ARM64 build. Add generic kernel for ARM64
2022-06-02 16:53:54 +02:00
Martin Kroeker
dc49edd4e6
Revert "roll back DGEMM kernel ... for DYNAMIC_ARCH"
2022-05-20 11:23:30 +02:00
Rajalakshmi Srinivasaraghavan
b62173c5a0
POWER10: Changing store instructions for Level1 functions
...
This patch changes 32 bytes stores to two 16 bytes stores
to fix a recent degradation due to 32 bytes stores.
2022-05-12 11:17:33 -05:00
Martin Kroeker
84cb58b7fb
Fix generator rules for ?laswp_ncopy and ?neg_tcopy
2022-04-30 15:28:38 +02:00
Martin Kroeker
05dcfa176e
fix undefined prefetchsizes
2022-04-16 10:04:27 +02:00
Martin Kroeker
2bbb9f05c7
fix undefined prefetchsize
2022-04-16 10:00:10 +02:00
Martin Kroeker
115bc9b98f
CortexX1 is ARMV8 like A7x
2022-03-28 17:28:29 +02:00
Martin Kroeker
b3b4672c30
Add initial support for Phytium FT2000 series and ARMV9 Cortex 510/710/X1/X2
2022-03-27 15:29:20 +02:00
Martin Kroeker
40302558ed
Remove extraneous (and wrong) definition of sbgemm_r on x86_64
2022-03-23 20:05:32 +01:00
Caroline Newcombe
5cc1111383
fix unsafe read of Y in assembly kernel
2022-03-11 11:56:33 -06:00