Bine Brank
ce329ab686
add sve zhemm copy routines
2022-01-03 15:56:05 +01:00
Bine Brank
0140373802
add sve ztrmm
2022-01-02 19:15:33 +01:00
Bine Brank
f7b6912868
ztrmm sve copy kernels
2021-12-30 21:00:16 +01:00
Bine Brank
40b14e4957
fix zgemm kernel
2021-12-29 11:42:04 +01:00
Bine Brank
6ec4aab875
zgemm sve copy routines
2021-12-26 17:05:46 +01:00
Bine Brank
878064f394
sve zgemm kernel
2021-12-26 08:44:05 +01:00
Bine Brank
683a7548bf
added macros for sve zgemm kernels
2021-12-25 11:46:41 +01:00
Martin Kroeker
7b146e590c
fix function typecast
2021-12-24 20:01:52 +01:00
Martin Kroeker
e9a0e52201
fix function typecast
2021-12-24 20:00:50 +01:00
Martin Kroeker
d1ee6ff73f
fix function typecasts
2021-12-21 18:45:28 +01:00
Bine Brank
e3c9947c0f
prepare kernel for sve zgemm
2021-12-21 11:19:27 +01:00
gxw
8d9b9c6b2a
loongarch64: Optimize dgemm_kernel
2021-12-21 09:33:06 +08:00
Wu Zhigang
92b7b949dd
fix bug in zscal function
...
memset can not be used in zscal because of
the stride parameters.
Signed-off-by: Wu Zhigang <zhigang.wu@starfivetech.com>
2021-12-15 01:23:30 -08:00
Martin Kroeker
b0a590f4fe
Merge pull request #3475 from wjc404/optimize-A53-dgemm
...
optimize cgemm on ARM cortex A53 & cortex A55
2021-12-12 19:09:08 +01:00
Martin Kroeker
f4d1f0333b
Merge pull request #3474 from rafaelcfsousa/rafael/cmake_power
...
Add CMake support for Power
2021-12-12 19:08:27 +01:00
Jia-Chen
b610d2de37
optimize cgemm on ARM cortex A53 & cortex A55
2021-12-12 17:22:52 +08:00
Martin Kroeker
697e2752d7
Merge pull request #3464 from binebrank/arm_sve_sgemm
...
Add sgemm part for Arm SVE
2021-12-11 20:35:22 +01:00
Bine Brank
a8f62a347b
fix UNROLL_MN and add to targets for SVE
2021-12-11 16:37:23 +01:00
Bine Brank
774267fdac
adjust Makefile.L3 for SVE
2021-12-11 16:35:08 +01:00
Rafael Cardoso Fernandes Sousa
23a7561353
Fix error cmake (small kernels)
2021-12-09 09:57:39 -06:00
Martin Kroeker
5378046abd
roll back DGEMM kernels to 4x8 when compiling for DYNAMIC_ARCH
2021-12-06 19:43:54 +01:00
Bine Brank
a1fea1fe2a
sgemm v2x8 SVE kernel
2021-12-05 18:47:29 +01:00
Bine Brank
abe1ce3434
strmm sve v1x8 kernel
2021-12-05 14:03:08 +01:00
Martin Kroeker
54d321d742
Merge pull request #3466 from rafaelcfsousa/rafael/small_matrix_p10
...
[POWER] Add small matrix for sgemm/dgemm on Power10
2021-12-03 12:12:20 +01:00
Martin Kroeker
0882db30a2
Merge pull request #3455 from cenewcombe/develop
...
Fix unsafe read during final iteration of zsymv_L_sse2.S
2021-12-03 10:01:20 +01:00
Bine Brank
0de36f7b5c
trmm sve copy fucntions for single precision
2021-11-29 21:25:05 +01:00
Rafael Cardoso Fernandes Sousa
c78fdcc80d
[POWER] Add support for SMALL_MATRIX_OPT
2021-11-28 12:41:16 -06:00
Bine Brank
86ae89bf33
add sgemm kernel and copy functions for sgemm and ssymm
2021-11-28 18:12:47 +01:00
Martin Kroeker
454edd741c
Merge pull request #3425 from binebrank/arm_sve_dgemm
...
Add dgemm kernel for arm64 SVE
2021-11-26 16:14:55 +01:00
Martin Kroeker
bcfbdc81b2
Merge pull request #3459 from rafaelcfsousa/fix_cmake
...
Fix issues when building OpenBLAS with cmake
2021-11-26 15:19:24 +01:00
Bine Brank
1af73ce38e
Adapt CMake for SVE
2021-11-26 10:35:01 +01:00
Martin Kroeker
e7fca060db
Merge pull request #3457 from wjc404/optimize-A53-dgemm
...
MOD: optimize zgemm on cortex-A53/cortex-A55
2021-11-26 10:30:47 +01:00
Jia-Chen
5c1cd5e0c2
MOD: add comments to a53 zgemm kernel
2021-11-25 22:48:48 +08:00
Rafael Cardoso Fernandes Sousa
d5c9353f1b
Modify the order that cmake set the KERNEL variables (generic now is fallback)
2021-11-24 20:08:35 -06:00
Jia-Chen
9f59b19fcd
MOD: optimize zgemm on cortex-A53/cortex-A55
2021-11-24 21:51:45 +08:00
Bine Brank
531a28b6a0
removed unused code (compiler warnings)
2021-11-22 10:12:34 +01:00
Bine Brank
9b9cb90bb1
modify Makefile for SVE copy
2021-11-22 09:54:20 +01:00
Bine Brank
9388f05a3c
configure SVE Makefile
2021-11-21 18:33:43 +01:00
Bine Brank
b58d4f31ab
some clean-up & commentary
2021-11-21 14:56:27 +01:00
Martin Kroeker
b7df500106
Add generic mips32 target
2021-11-20 17:31:51 +01:00
Bine Brank
e6ed4be02e
symm SVE copy rutines
2021-11-20 16:35:29 +01:00
Caroline Newcombe
feeb8283a5
Fix unsafe read during final iteration of zsymv_L_sse2.S
2021-11-19 14:29:32 -06:00
Jia-Chen
302f22693a
MOD: optimize normal DGEMM on ARMV8 cortex-A53 & cortex-A55
2021-11-18 21:14:43 +08:00
Bine Brank
3c7eed0e53
add remaining trmm copy rutines for SVE
2021-11-14 16:00:10 +01:00
Bine Brank
7d996b1c36
dtrmm_utcopy sve function
2021-11-13 18:48:53 +01:00
Bine Brank
ab7917910d
add v2x8 kernel + fix sve dtrmm
2021-11-07 20:37:51 +01:00
Bine Brank
7093372e32
add ARMV8SVE target
2021-11-01 22:53:21 +01:00
Bine Brank
a8fbdbac34
fix sve dgemm kernel + sve dtrmm
2021-10-31 10:24:25 +01:00
Bine Brank
746b4f0f17
added SVE ncopy and tcopy
2021-10-30 12:11:44 +02:00
Bine Brank
1a10d3e09d
add sve dgemm prototype
2021-10-27 16:37:18 +02:00
Martin Kroeker
22bf5c27ba
Add basic support for the Fujitsu A64FX ( #3415 )
...
* Add initial support for Fujitsu A64FX as generic ARMV8
2021-10-18 15:00:19 +02:00
Wangyang Guo
63a103ba6e
sbgemm: spr: disable small matrix path by default
2021-10-17 19:08:03 -07:00
Wangyang Guo
82194ea9d2
sbgemm: spr: implement otcopy_16
2021-10-17 19:08:03 -07:00
Wangyang Guo
8632380a96
sbgemm: spr: reuse ncopy_16 from cooperlake as incopy
2021-10-17 19:08:03 -07:00
Wangyang Guo
6bc8204ce5
sbgemm: spr: optimization for tmp_c buffer
2021-10-17 19:08:03 -07:00
Wangyang Guo
f018aa342a
sbgemm: spr: kernel handle alpha != 1.0
2021-10-17 19:08:03 -07:00
Wangyang Guo
a52456b168
sbgemm: spr: oncopy: use tile load/store instead
2021-10-17 19:08:03 -07:00
Wangyang Guo
f2485352a6
sbgemm: spr: only load A once in tail_k handling
2021-10-17 19:08:03 -07:00
Wangyang Guo
9ab33228bb
sbgemm: spr: process k2 and odd k at the same time
2021-10-17 19:08:03 -07:00
Wangyang Guo
10d52646e2
sbgemm: spr: oncopy: avoid handling too much pointer at a time
2021-10-17 19:08:03 -07:00
Wangyang Guo
88154ed02d
sbgemm: spr: reduce tile conf loading by seperate tail k handling
2021-10-17 19:08:03 -07:00
Wangyang Guo
a70bfb52d5
sbgemm: spr: kernel works for NN case when alpha is 1.0
2021-10-17 19:08:03 -07:00
Wangyang Guo
6051c86741
sbgemm: spr: kernel works for m32 in NN case
2021-10-17 19:08:03 -07:00
Wangyang Guo
d0b253ac6e
sbgemm: spr: implement oncopy_16
2021-10-17 19:08:03 -07:00
Wangyang Guo
1d48b7cb16
sbgemm: spr: add dummy source files
2021-10-17 19:08:03 -07:00
Wangyang Guo
3dc6052c7e
initial support for Sapphire Rapids platform
2021-10-12 01:30:40 -07:00
Martin Kroeker
8c20ca345a
Use Neoverse's current mix of ThunderX2 kernels for Vortex as well
2021-10-06 11:06:43 +02:00
Martin Kroeker
8e4c209002
Merge pull request #3398 from kavanabhat/aix_p10_gnuas
...
Big Endian Changes for Power10 kernels
2021-10-05 18:59:47 +02:00
kavanabhat
9cc95e5657
AIX changes for P10 with GNU Compiler
2021-10-01 05:18:35 -05:00
kavanabhat
fe3c778c51
AIX changes for P10 with GNU Compiler
2021-09-30 06:06:27 -05:00
Wangyang Guo
ee5ca8a328
x86_64: BFLOAT16: fix build warning
2021-09-28 18:30:06 +08:00
Martin Kroeker
90cc944625
Move alphaI to x22 to leave x18 unused (reserved on OSX)
2021-09-17 09:53:18 +02:00
Martin Kroeker
590fbff06e
move alpha to x19/x20 to leave x18 unused for OSX
2021-09-17 09:42:17 +02:00
Martin Kroeker
380940271b
Move temp to x21 to leave x18 unused (reserved on OSX)
2021-09-17 09:28:19 +02:00
Martin Kroeker
7d75177446
Move temp to x21 to leave x18 unused (reserved on OSX)
2021-09-17 09:24:11 +02:00
Martin Kroeker
0a4ac4b585
Use x21 for I to leave x18 unused (reserved on OSX)
2021-09-17 09:19:51 +02:00
Martin Kroeker
7d4a221579
Remove unused TEMP2 and reshuffle to leave x18 unused (reserved on OSX)
2021-09-17 09:18:25 +02:00
Martin Kroeker
d3a9c7ef7f
Merge pull request #3382 from rafaelcfsousa/rafael/cwarnings
...
[POWER] Remove unused variable warnings.
2021-09-17 09:15:16 +02:00
Martin Kroeker
8dfa61a61c
Initialize abs_mask1 with itself to silence a gcc warning
2021-09-15 22:11:35 +02:00
Martin Kroeker
99aa10b3ff
Initialize abs_mask1 with itself to silence a gcc warning
...
actual initialization is via the _mm_cmpeq_ep18, which I've seen claimed to be the fastest way to set an xmm register to all 1s
2021-09-15 22:10:43 +02:00
Rafael Cardoso Fernandes Sousa
b751edf624
Fix unused variable warnings on Power
2021-09-15 13:36:07 -05:00
Martin Kroeker
80346b8813
Merge pull request #3379 from martin-frbg/issue3369-2
...
Add casts to fix compiler warnings for SkylakeX sasum/dasum
2021-09-15 07:18:57 +02:00
Martin Kroeker
ce036a2fc0
Add casts
2021-09-14 21:41:53 +02:00
Martin Kroeker
ddf106f769
Add dedicated entries for BFLOAT16 kernels
2021-09-14 16:17:18 +02:00
Martin Kroeker
af8843875a
Merge pull request #3376 from martin-frbg/issue3370
...
Fix a few harmless compiler warnings
2021-09-12 00:01:31 +02:00
Martin Kroeker
0925dfe2c9
One instance of kernel_4x1 is used even on SKX
2021-09-11 15:30:19 +02:00
Martin Kroeker
7d873a329f
Add ifdefs around conditionally used functions
2021-09-11 14:38:47 +02:00
Martin Kroeker
ef24712030
Move a conditionally used variable
2021-09-11 14:37:44 +02:00
Martin Kroeker
d17238599b
Add casts
2021-09-11 13:38:28 +02:00
Wangyang Guo
59a1114d03
sbgemm: cooperlake: tuning for small matrix
2021-09-07 21:30:46 +08:00
Wangyang Guo
682d66555d
sbgemm: cooperlake: implement ncopy_16
2021-09-07 21:30:46 +08:00
Wangyang Guo
beccb83b16
sbgemm: cooperlake: add n24 kernel for tcopy_4
2021-09-07 21:30:46 +08:00
Wangyang Guo
5fcacad32b
sbgemm: cooperlake: implement tcopy_4
2021-09-07 21:30:46 +08:00
Wangyang Guo
bb1c4fa5bd
sbgemm: cooperlake: prefetch A & B
2021-09-07 21:30:46 +08:00
Wangyang Guo
7a2d1601ec
sbgemm: cooperlake: unroll core loop by 2
2021-09-07 21:30:46 +08:00
Wangyang Guo
45fdf951b6
sbgemm: cooperlake: reorder ptr increase for performance
2021-09-07 21:30:46 +08:00
Wangyang Guo
cece3541ab
sbgemm: cooperlake: fix bug in m64n12
2021-09-07 21:30:46 +08:00
Wangyang Guo
9df0953cde
sbgemm: cooperlake: kernel works for NN
2021-09-07 21:30:45 +08:00
Wangyang Guo
2ec9f3a8aa
sbgemm: cooperlake: change kernel size to 16x4
2021-09-07 21:30:45 +08:00
Wangyang Guo
ef8f5fecc8
sbgemm: cooperlake: implement sbgemm_tcopy_32
2021-09-07 21:30:45 +08:00
Wangyang Guo
4c294336e6
sbgemm: cooperlake: add dummy source files
2021-09-07 21:30:45 +08:00
Martin Kroeker
f1e3305974
Add workaround for Windows10 macro name clash
2021-09-01 21:36:50 +02:00
Wangyang Guo
619588fbab
sbgemm: remove unnecessary b0 files
2021-08-30 17:55:01 +08:00
Wangyang Guo
f39301935c
sbgemm: cooperlake: make sure hot buffer aligned to 64
2021-08-30 17:40:30 +08:00
Wangyang Guo
7d27b182fc
sbgemm: cooperlake: enable SBGEMM by small matrix path
2021-08-30 17:40:30 +08:00
Wangyang Guo
1d83ca4bca
Small Matrix: support BFLOAT16 data type
2021-08-30 17:40:20 +08:00
Martin Kroeker
bec9d9f63d
Merge pull request #3335 from guowangy/small-matrix-latest
...
Add GEMM optimization for small matrix and single/double kernel for skylakex
2021-08-29 22:33:33 +02:00
Wangyang Guo
dbbb39199f
sgemv: skylakex: fix build warning
2021-08-25 07:13:00 +00:00
Wangyang Guo
e9acb46431
sgemv: skylakex: bug fix for sgemv_t kernel in corner case
2021-08-25 07:07:27 +00:00
Wangyang Guo
f9dba63c28
Small Matrix: skylakex: remove unnecessary b0 source files
2021-08-13 03:28:44 +00:00
Wangyang Guo
989e6bbdd3
Small Matrix: reduce generic kernel source files
2021-08-13 03:17:38 +00:00
Martin Kroeker
04255be948
Merge pull request #3344 from gxw-loongson/develop
...
Delete the macro instruction "li" and use "li.d" instead
2021-08-12 15:16:46 +02:00
gxw
a7bc8ec1f1
Delete the macro instruction "li" and use "li.d" instead
...
Change-Id: Icff7981e2eb7df29ba5af1f8eb5be8443c67450f
2021-08-12 17:02:54 +08:00
Rajalakshmi Srinivasaraghavan
b06880c2cd
POWER10: Improving dasum performance
...
Unrolling a loop in dasum micro code to help in improving
POWER10 performance.
2021-08-10 22:06:04 -05:00
Wangyang Guo
44d0032f3b
Small Matrix: skylakex: fix build error in old compiler
2021-08-05 04:43:47 +00:00
Chen, Guobing
5d86becdae
Add all SBGEMM kernels for IA AVX512-BF16 based platforms
...
Added all SBGEMM kernels including NN/NT/TN/TT for both ColMajor and
RowMajor, based on AVX512-BF16 ISA set on IA.
Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
2021-08-05 11:11:29 +08:00
Wangyang Guo
fee5abd84b
Small Matrix: support cmake build
2021-08-04 08:50:15 +00:00
Wangyang Guo
478d1086c1
Small Matrix: support DYNAMIC_ARCH build
2021-08-04 03:12:41 +00:00
Wangyang Guo
6b58bca18b
Small Matrix: disable low performance default kernel
2021-08-03 06:49:03 +00:00
Wangyang Guo
fa777f5517
Small Matrix: skylakex: add DGEMM_SMALL_M_PERMIT and tune for TN kernel
2021-08-02 07:06:54 +00:00
Wangyang Guo
8592c21af4
Small Matrix: skylakex: dgemm nn: fix typo in idx load
2021-08-02 07:06:54 +00:00
Wangyang Guo
3e79f6d89a
Small Matrix: skylakex: add dgemm tn kernel
2021-08-02 07:06:54 +00:00
Wangyang Guo
323d7da4f7
Small Matrix: skylakex: add dgemm tt kernel
2021-08-02 07:06:54 +00:00
Wangyang Guo
f57fc932ac
Small Matrix: skylakex: add dgemm nt kernel
2021-08-02 07:06:54 +00:00
Wangyang Guo
91ec21202b
Small Matrix: skylakex: add dgemm nn kernel
2021-08-02 07:06:54 +00:00
Wangyang Guo
72e070539c
Small Matrix: skylakex: add sgemm tt kernel
2021-08-02 07:06:54 +00:00
Wangyang Guo
02c6e764f2
Small Matrix: skylakex: add SGEMM_SMALL_M_PERMIT and tune for TN kernel
2021-08-02 07:06:54 +00:00
Wangyang Guo
5dc7c3c8e5
Small Matrix: add GEMM_SMALL_MATRIX_PERMIT to tune small matrics case
2021-08-02 07:06:54 +00:00
Wangyang Guo
642c393879
Small Matrix: skylakex: add sgemm tn kernel
2021-08-02 07:06:54 +00:00
Wangyang Guo
ae3f5c737c
Small Matrix: skylakex: sgemm nt: optimize for M < 12
2021-08-02 07:06:54 +00:00
Wangyang Guo
0d72d75bf9
Small Matrix: skylakex: add sgemm nt kernel
2021-08-02 07:06:54 +00:00
Wangyang Guo
ca7682e3a3
Small Matrix: skylakex: sgemm nn: fix n6 conflicts with n4
2021-08-02 07:06:54 +00:00
Wangyang Guo
9967e61abb
Small Matrix: skylakex: sgemm nn: fix error when beta not zero
2021-08-02 07:06:54 +00:00
Wangyang Guo
a87736346f
Small Matrix: skylakex: sgemm nn: add n6 to improve performance
2021-08-02 07:06:54 +00:00
Wangyang Guo
4c9d9940fd
Small Matrix: skylakex: sgemm nn: reduce store 4 N at a time
2021-08-02 07:06:54 +00:00
Wangyang Guo
13b32f69b7
Small Matrix: skylakex: sgemm nn: reduce store 4 M at a time
2021-08-02 07:06:54 +00:00
Wangyang Guo
3d8c6d9607
Small Matrix: skylakex: sgemm nn: clean up unused code
2021-08-02 07:06:54 +00:00
Wangyang Guo
49b61a3f30
Small Matrix: skylakex: sgemm_nn: optimize for M <= 8
2021-08-02 07:06:54 +00:00
Wangyang Guo
f88470323b
Optimize M < 16 using AVX512 mask
2021-08-02 07:06:54 +00:00
Wangyang Guo
9186456a12
small matrix: SkylakeX: add SGEMM NN kernel
2021-08-02 07:06:54 +00:00
Xianyi Zhang
6022e5629c
Refs #2587 fix small matrix c/zgemm bug.
2021-08-02 07:06:54 +00:00
Xianyi Zhang
57ed58cefe
Refs #2587 Add small matrix optimization reference kernel for c/zgemm.
2021-08-02 07:06:54 +00:00
Xianyi Zhang
17d32a4a82
Change a1b0 gemm to b0 gemm.
2021-08-02 07:06:54 +00:00
Xianyi Zhang
59cb5de46b
Refs #2587 Fix typos.
2021-08-02 07:06:54 +00:00
Xianyi Zhang
be3349405d
Add alpha=1.0 beta=0.0 for small gemm.
2021-08-02 07:01:47 +00:00
Xianyi Zhang
0a2077901c
Add small marix optimization kernel interface.
...
make SMALL_MATRIX_OPT=1
2021-08-02 07:01:47 +00:00
gxw
0b8f7c8c10
Add cmake support for LOONGARCH64
2021-08-02 10:00:41 +08:00
gxw
af0a69f355
Add support for LOONGARCH64
2021-07-27 15:29:12 +08:00
Martin Kroeker
49bbf330ca
Empirical workaround for numpy SVD NaN problem from issue 3318
2021-07-18 22:19:19 +02:00
Martin Kroeker
5b4b385ecf
Temporarily disable the SkylakeX sgemv_t microkernel due to LAPACK testsuite failures
2021-07-14 20:50:14 +02:00
User User-User
39ef0880ae
copy conf
2021-06-19 21:49:58 +02:00
Martin Kroeker
c4b464cac6
Merge pull request #3273 from austinpagan/sbgemm_gcc10_fix
...
Power10: Fix for SBGEMM
2021-06-15 22:58:48 +02:00
Gordon Fossum
e6dd44d989
Power10: Fix for SBGEMM
...
While testing bfloat16 sbgemm kernel, there are some failures for odd value inputs due to updating result for
additional bytes.
2021-06-15 13:07:47 -05:00
Gilles Gouaillardet
9d292d37b2
arm64: add the missing d9 register to the clobber list
...
Refs. numpy/numpy#18422
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2021-06-14 17:01:28 +09:00
Martin Kroeker
2e8ff4a781
Merge pull request #3266 from martin-frbg/powerparam
...
Remove spurious casts from PPC parameters and fix compilation for older targets
2021-06-10 18:05:47 +02:00
Martin Kroeker
dbba381dc3
Merge pull request #3260 from intelmy/sgemv_t_opt
...
Optimized sgemv_t for small N based on AVX512
2021-06-10 16:08:24 +02:00
Martin Kroeker
efdbdd8f82
Add prefetch values for power3
2021-06-10 11:20:29 +02:00
Martin Kroeker
3906ef3b0f
Add prefetch values for power3
2021-06-10 11:19:40 +02:00
Martin Kroeker
8adf0971d8
Add prefetch values for power3
2021-06-10 11:18:22 +02:00
Martin Kroeker
08e2e60762
Add prefetch values for power3
2021-06-10 11:17:33 +02:00
Martin Kroeker
fb9e678235
Fix caxpy/zaxpy for big-endian
2021-06-10 11:15:48 +02:00
Martin Kroeker
dc4fcb48df
Fix inverted conditional for caxpy/zaxpy
2021-06-10 11:14:03 +02:00
Martin Kroeker
7a48247761
fix c/zrot and sgemv for POWER5
2021-06-10 11:11:56 +02:00
Rajalakshmi Srinivasaraghavan
cbb70438df
POWER10: Fixes for sbgemm kernel
...
While testing bfloat16 sbgemm kernel, there are some failures
for odd value inputs due to array access beyond the boundary.
2021-06-09 12:20:09 -05:00
Ma, Yu
706a08d4a0
Optimized sgemv_t for small N based on AVX512
2021-06-08 15:08:28 -04:00
Zhaofeng Li
590be3fae3
riscv64: Add Makefile
2021-06-07 22:55:56 +00:00
Zhaofeng Li
3521cd48cb
RISCV64_GENERIC: Use generic kernel for DSDOT for better precision
...
The implementation in `riscv64/dot.c` fails the `test_dsdot` test, and
the generic kernel seems to have better precision. Tested on SiFive
FU740 (HiFive Unmatched) and QEMU.
Also see #1469 .
2021-06-07 22:50:23 +00:00
Zhaofeng Li
1e0192a5cc
riscv64/imin: Fix wrong comparison
...
Same as #1990 .
2021-06-07 22:49:39 +00:00
Martin Kroeker
5f677e782e
Merge pull request #3196 from guowangy/skylakex-gemm-batch-k
...
GEMM: skylake: improve the performance when m is small
2021-05-22 19:25:28 +02:00
Martin Kroeker
02087a62e7
Merge pull request #3205 from intelmy/sgemv_n_opt
...
optimize on sgemv_n for small n
2021-05-17 17:49:01 +02:00
Martin Kroeker
4ecf631f95
Merge pull request #3228 from martin-frbg/issue3226
...
filter out -mavx flag on Sandybridge zgemm/ztrmm kernels
2021-05-15 09:06:12 +02:00
Martin Kroeker
310b76aad7
Merge pull request #3231 from martin-frbg/issue3227
...
Support compilation with pre-C99 versions of MSVC
2021-05-14 23:28:06 +02:00
Martin Kroeker
c4da892ba0
Only filter out -mavx on Sandybridge ZGEMM/ZTRMM kernels
2021-05-14 23:19:10 +02:00
Martin Kroeker
8b90e5f202
Drop redundant inclusion of complex.h
2021-05-14 15:06:44 +02:00
Martin Kroeker
bd60fb6ffc
filter out -mavx flag on zgemm kernels as it can cause problems with older gcc
2021-05-13 23:05:00 +02:00
Martin Kroeker
37ea8702ee
Merge pull request #3192 from damonyu1989/develop
...
Update the intrinsic api to the offical name.
2021-05-11 16:00:45 +02:00
Martin Kroeker
c0ca63ea46
Fix missing conditionals for non-SKX kernels
2021-05-05 14:55:36 +02:00
pnp
3d4ccd2a13
fix for build error
2021-04-30 12:25:33 -04:00
pnp
c59652f0ce
optimize on sgemv_n for small n
2021-04-30 12:14:58 -04:00
Wangyang Guo
aa7b3dc3db
GEMM: skylake: improve the performance when m is small
2021-04-28 13:56:06 +00:00
damonyu
ceb44bef14
update the intrinsic api to the offical name.
2021-04-27 11:12:29 +08:00
Martin Kroeker
3d511f0e66
replace spurious avx512 requirement with fma check
2021-04-26 21:55:30 +02:00
Rajalakshmi Srinivasaraghavan
2379abaa5e
POWER10: Improve dgemm performance
...
This patch uses vector pair pointer for input load operation
which helps to generate power10 lxvp instructions.
2021-04-13 22:30:06 -05:00
Rajalakshmi Srinivasaraghavan
55bb9f639a
POWER10: Optimized zgemv
...
This patch makes use of Matrix-Multiply Assist (MMA)
feature introduced in POWER ISA v3.1 for zgemv_n and zgemv_t.
2021-04-10 19:00:24 -05:00
Martin Kroeker
2dfb24730d
Use "old" compute(24) function with clang due to register limitations
2021-04-06 19:58:32 +02:00
Martin Kroeker
147e0a75fd
Merge pull request #3170 from CodesWithWolves/sgemm_tcopy_16-invalid-read
...
Remove Unnecessary/Erroneous Adds/Reads In sgemm_tcopy_16.S COPY1x8 Macro
2021-04-03 19:49:47 +02:00
Rajalakshmi Srinivasaraghavan
2dbcddd83d
POWER10: Adding check for little endian
...
This patch makes sure that recent POWER10 patches are used
only for little endian.
2021-03-31 21:32:42 -05:00
CodesWithWolves
d2bda3b56a
Remove Unnecessary/Erroneous Reads In sgemm_tcopy_16.S COPY1x8 Macro
...
There appears to have been some code leak when copying from the COPY2x8
macro above where we're reading 8 bytes into d4-d7 directly after
reading 4 bytes into s4-s7. These 32 bytes in d4-7 are unused and can
possibly overrun the boundary of allocated memory -- Valgrind detected
this which is what dragged my attention to it for a 128,1 copy.
Additionally, there is no need to update the addresses stored in A0-A7
as the only possible paths after running this macro will overwrite A0-7
if looping to the next 8 rows, or overwrite A0-3 if moving to 4 rows --
in which case A4-7 are unused.
2021-03-31 15:44:25 -04:00
Martin Kroeker
bdd6e3a153
Merge pull request #3157 from martin-frbg/issue3020-final
...
Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler on PPC
2021-03-19 15:23:12 +01:00
Martin Kroeker
7b8f580941
Merge pull request #3156 from martin-frbg/omatcopy_d
...
Move x86_64 DOMATCOPY_RT back to the C implementation
2021-03-19 15:22:48 +01:00
Martin Kroeker
86c5a0013f
Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler
2021-03-19 11:47:58 +01:00
Martin Kroeker
ef85c22474
Add workaround for LAPACK test failures with the NVIDIA HPC compiler
2021-03-19 11:46:25 +01:00
Martin Kroeker
d3555d2e50
Add workaround for LAPACK test failures with the NVIDIA HPC compiler
2021-03-19 11:44:31 +01:00
Martin Kroeker
0f5e86a0d9
Remove premature entry for DOMATCOPY_RT
2021-03-18 21:53:50 +01:00
Martin Kroeker
7b294a99fd
Move common.h back to the top of the file so that SKYLAKEX (from config.h) is defined in time
2021-03-18 21:28:19 +01:00
Martin Kroeker
0934568d9c
Move includes under the ifdef for compilers w/o intrinsics support
2021-03-12 12:42:05 +01:00
Rajalakshmi Srinivasaraghavan
09d47af2c0
Optimize zscal function for POWER10
...
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-03-10 17:15:33 -06:00
Martin Kroeker
ef0238ba2b
Merge pull request #3130 from martin-frbg/issue3128
...
Replace spurious AVX512 requirement in the Haswell srot microkernel with an AVX2/FMA3 guard
2021-03-06 19:15:53 +01:00
Martin Kroeker
a9f6f7ad39
Remove spurious AVX512 requirement and add AVX2/FMA3 guard
2021-03-06 14:35:49 +01:00
Rajalakshmi Srinivasaraghavan
41646ed006
Optimize s/dasum function for POWER10
...
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-03-05 16:22:36 -06:00
Rajalakshmi Srinivasaraghavan
0571c3187b
POWER10: Rename mma builtins
...
The LLVM and GCC teams agreed to rename the __builtin_mma_assemble_pair and
__builtin_mma_disassemble_pair built-ins to __builtin_vsx_assemble_pair and
__builtin_vsx_disassemble_pair respectively. This patch is to make
corresponding changes in dgemm kernel. Also made changes in
inputs to those builtins to avoid some potential typecasting issues.
Reference gcc commit id:77ef995c1fbcab76a2a69b9f4700bcfd005d8e62
2021-02-26 20:56:34 -06:00
Martin Kroeker
292d1af1a0
Update omatcopy_rt.c
2021-02-24 09:34:14 +01:00
Martin Kroeker
325b398e3c
Update omatcopy_rt.c
2021-02-24 09:13:12 +01:00
Martin Kroeker
6f5667b4d4
Enable optimized S/D OMATCOPY_RT
2021-02-24 09:03:41 +01:00
Martin Kroeker
cceeee7806
Add optimized omatcopy_rt
2021-02-24 09:00:54 +01:00
Martin Kroeker
0a4546b742
Typo fix
2021-02-23 13:14:35 +01:00
Martin Kroeker
b1eed27a54
Replace naive omatcopy_rt with 4x4 blocked implementation
...
as suggested by MigMuc in issue 2532
2021-02-22 21:35:42 +01:00
Martin Kroeker
47691c031f
Use Haswell optimizations for Zen as well
2021-02-11 09:26:15 +01:00
Martin Kroeker
ce7ddd8921
Use Haswell optimizations for Zen as well
2021-02-11 09:25:36 +01:00
Martin Kroeker
950c047b49
Use Haswell optimizations for Zen as well
2021-02-11 09:24:51 +01:00
Martin Kroeker
46509953a9
Use Haswell optimizations for Zen as well
2021-02-11 09:24:16 +01:00
Martin Kroeker
db348dcff2
Enable optimized srot/drot kernels from Haswell
2021-02-11 09:23:05 +01:00
Rajalakshmi Srinivasaraghavan
2056ffc227
Optimize cscal function for POWER10
...
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-29 13:51:43 -06:00
Rajalakshmi Srinivasaraghavan
3ede843d50
Optimize s/dscal function for POWER10
...
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-24 07:48:28 -06:00
Martin Kroeker
69a5558203
Merge pull request #3059 from Guobing-Chen/BF16_gemm
...
Initial code for Cooperlake BF16 GEMM kernel
2021-01-23 19:08:05 +01:00
Martin Kroeker
d6905403e3
Merge pull request #3068 from alexhenrie/scan-build
...
scan-build fixes
2021-01-23 19:06:29 +01:00
Rajalakshmi Srinivasaraghavan
439b93f6d2
Optimize s/drot function for POWER10
...
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-21 13:24:45 -06:00
Rajalakshmi Srinivasaraghavan
eff7c9166e
Optimize cdot function for POWER10
...
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-15 13:40:34 -06:00
Alex Henrie
202fc9e8ed
Fix uninitialized argument value in dasum_k
2021-01-14 19:40:31 -07:00
Martin Kroeker
e378b24487
Merge pull request #3067 from albertziegenhagel/fix-generic-cmake
...
Fix building "generic" TRMM kernel with CMake
2021-01-14 21:35:19 +01:00
Albert Ziegenhagel
e3f4063683
Fix building "generic" TRMM kernel with CMake
...
The CMake "TARGET_CORE" variables stores the "generic" target name in all lowercase letters, but gets compared to an all uppercase string, which results in the wrong TRMM kernel being selected.
This commit converts the TARGET_CORE to all uppercase before comparing its value to make sure case mismatches are not an issue in the future anymore.
2021-01-14 10:00:49 +01:00
Martin Kroeker
b716c0ef01
Add workaround for NVIDIA HPC
2021-01-12 16:51:35 +01:00
Martin Kroeker
2efa3b70dc
Add workaround for NVIDIA HPC
2021-01-12 16:49:39 +01:00
Martin Kroeker
49959d4f1c
Add workaround for NVIDIA HPC
2021-01-12 16:47:15 +01:00
Martin Kroeker
0f27a03607
Add workaround for NVIDIA HPC mishandling of the asm DOT kernels
2021-01-12 16:39:35 +01:00
Martin Kroeker
c2a8ebfe69
Add workaround for NVIDIA HPC mishandling of the asm DOT kernels
2021-01-12 16:38:51 +01:00
Martin Kroeker
43aac5bacc
Support NVIDIA HPC compiler
2021-01-12 16:36:12 +01:00
Chen, Guobing
b0beb0b1ca
Initial code for Cooperlake BF16 GEMM kernel
2021-01-11 02:15:21 +08:00
Rajalakshmi Srinivasaraghavan
601b711c78
Optimize swap function for POWER10
...
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2021-01-08 08:01:36 -06:00
Ashwin Sekhar T K
1b2508362b
arm64: Fix nrm2 for input vectors with Inf
...
Fix double precision nrm2 kernels returning NaN when the
input vectors contain Inf/-Inf.
2021-01-01 02:49:37 -08:00
Martin Kroeker
3559c5d7a2
Merge pull request #3048 from martin-frbg/issue2998
...
Temporarily revert to the old NRM2 kernels for ThunderX2/3 and NeoverseN1
2020-12-21 13:30:08 +01:00
Martin Kroeker
8631e2976a
Temporarily revert to the old nrm2 kernels
2020-12-21 07:45:13 +01:00
Martin Kroeker
2768bc1764
Temporarily revert to the old nrm2 kernels
2020-12-21 07:42:51 +01:00
Martin Kroeker
6f4698ee1f
Temporarily revert to the old nrm2 kernel
2020-12-21 07:41:18 +01:00
Martin Kroeker
114eb159a4
Disable FMA intrinsics in the srot kernel when the compiler is PGI/NVIDIA
2020-12-19 22:15:58 +01:00
Martin Kroeker
005cce5507
Amend SkylakeX options to support the NVIDIA compiler
2020-12-19 22:11:49 +01:00
Xianyi Zhang
a3cac9cca0
Update sgemm kernel 1x4 for C910.
2020-12-18 11:53:23 +08:00
Martin Kroeker
c73d8ee40d
Conditionally add -mfma to compiler options where needed
2020-12-17 11:34:05 +01:00
Rajalakshmi Srinivasaraghavan
2fb11f873b
POWER10: Improve copy performance
...
This patch aligns the stores to 32 byte boundary for scopy and dcopy
before entering into vector pair loop. For ccopy, changed the store
instructions to stxv to improve performance of unaligned cases.
2020-12-13 10:41:45 -06:00
Martin Kroeker
043128cbe5
Merge pull request #3029 from RajalakshmiSR/axpyp10
...
POWER10: Improve axpy performance
2020-12-10 22:49:28 +01:00
Martin Kroeker
3331ca492d
Merge pull request #3021 from austinpagan/trsm_p10
...
POWER: Added special unrolled vectorized versions of "Solve" for specific si…
2020-12-10 19:42:54 +01:00
Rajalakshmi Srinivasaraghavan
346e30a46a
POWER10: Improve axpy performance
...
This patch aligns the stores to 32 byte boundary for saxpy and daxpy
before entering into vector pair loop. Fox caxpy, changed the store
instructions to stxv to improve performance of unaligned cases.
2020-12-10 11:51:42 -06:00
gxw
4b548857d6
Add msa support for loongson
...
1. Using core loongson3r3 and loongson3r4 for loongson
2. Add DYNAMIC_ARCH for loongson
Change-Id: I1c6b54dbeca3a0cc31d1222af36a7e9bd6ab54c1
2020-12-09 10:28:46 +08:00
Martin Kroeker
7f11e33e8d
Merge pull request #3025 from TiredNotTear/develop
...
MIPS: Fix two bugs
2020-12-08 09:39:27 +01:00
Martin Kroeker
53e0837809
Merge pull request #3022 from jinboson/develop
...
Fix test errors reported by cblas_cgemm & cblas_ctrmm
2020-12-07 08:09:11 +01:00
Hao Chen
ad38bd0e89
Fix failed cgemv and zgemv test case after using msa optimization
...
The cgemv and zgemv test case will call cgemv_n/t_msa.c zgemv_n/t_msa.c files in MIPS environment.
When the macro CONJ is defined, the calculation result will be wrong due to the wrong definition of OP2.
This patch updates the value of OP2 and passes the corresponding test.
2020-12-07 10:25:01 +08:00
Hao Chen
47b639cc9b
Fix failed sswap and dswap case by using msa optimization
...
The swap test case will call sswap_msa.c and dswap_msa.c files in MIPS environmnet.
When inc_x or inc_y is equal to zero, the calculation result of the two functions will be wrong.
This patch adds the processing of inc_x or inc_y equal to zero, and the swap test case has passed.
2020-12-07 10:24:49 +08:00
Martin Kroeker
b660008c7e
Work around DOT and SWAP test failures
2020-12-06 19:15:37 +01:00
Martin Kroeker
f8346603cf
Fix compilation with SolarisStudio
2020-12-06 19:14:16 +01:00
Jin Bo
65de6f5957
Fix test errors reported by cblas_cgemm & cblas_ctrmm
...
The file cgemm_kernel_8x4_msa.c holds the MSA optimization
codes of cblas_cgemm and cblas_ctrmm. It defines two
macros: CGEMM_SCALE_1X2 and CGEMM_TRMM_SCALE_1X2. The pc1
array index in the two macros should be 0 and 1.
2020-12-05 15:08:17 +08:00