Commit Graph

1526 Commits

Author SHA1 Message Date
Zhang Xianyi
d7ba7679b6 Merge branch 'develop' into risc-v 2020-10-16 23:27:38 +08:00
Martin Kroeker
df70667043 fix core list for sse/sse2 2020-10-16 09:55:48 +02:00
Martin Kroeker
f071d1207a add sse2 2020-10-15 22:10:32 +02:00
Martin Kroeker
dc6cefd2f5 Expressly enable -msse for 32bit DYNAMIC_ARCH kernels 2020-10-15 20:16:15 +02:00
Martin Kroeker
c339c40c01 Silence a redefinition warning 2020-10-15 19:08:12 +02:00
Martin Kroeker
10379fc83b Use ifdef instead of if 2020-10-15 19:05:37 +02:00
Martin Kroeker
4c25910da0 Merge pull request #2896 from martin-frbg/intrin-double
Add compiler flag for SSE4 where available
2020-10-15 11:12:35 +02:00
Martin Kroeker
ae6ac83991 Revert "add double precision SSE" 2020-10-15 08:37:02 +02:00
Qiyu8
4fac91ef37 adapt arm platform 2020-10-15 11:08:10 +08:00
Qiyu8
bfdf4b56da Add double precision universal intrinsics for X86/ARM 2020-10-15 10:29:42 +08:00
Martin Kroeker
ebf0470fc2 add sse4.1 for DYNAMIC_ARCH kernels 2020-10-14 20:34:33 +02:00
Martin Kroeker
c9c3ae07af Add double precision operations 2020-10-14 18:10:45 +02:00
Martin Kroeker
756802df61 Merge pull request #2890 from martin-frbg/s-d-sum
Revert special handling of Windows xNRM2 and enable C+intrinsics kern…
2020-10-14 09:02:03 +02:00
Rajalakshmi Srinivasaraghavan
0826d68f93 POWER10: Change the packing format for bfloat16
As the new MMA instructions need the inputs in 4x2 order for bfloat16,
changing the format in copy/packing code.  This avoids permute instructions
in the gemm kernel inner loop.
2020-10-13 16:05:10 -05:00
Rajalakshmi Srinivasaraghavan
b5d30b390d Fix build issues with bfloat16
This patch fixes compilation errors due to recent renaming from SH to SB
with BUILD_BFLOAT16.
2020-10-13 11:00:22 -05:00
Martin Kroeker
fecedc9c69 Add -mssse3 2020-10-13 11:55:41 +02:00
Martin Kroeker
0eacbca85f Add Haswell and Zen to temporary sse3 whitelist 2020-10-13 11:42:39 +02:00
Martin Kroeker
6999086a2b whitelist SANDYBRIDGE for SSE3 2020-10-13 10:32:19 +02:00
Martin Kroeker
8d2df7d066 Revert special handling of Windows xNRM2 and enable C+intrinsics kernel for SSUM/DSUM 2020-10-13 00:14:29 +02:00
Martin Kroeker
08929430cd Merge pull request #2886 from martin-frbg/issue_2767
Rename "HALF" precision functions (sh prefix) to "BFLOAT16" with "sb" prefix
2020-10-13 00:04:35 +02:00
Martin Kroeker
0c84ffe05f Merge pull request #2881 from mattip/fninit
add fninit to reset fpu registers before assembler routines
2020-10-12 23:50:41 +02:00
Matti Picus
403eb513a0 use emms instead, add WIN guards 2020-10-12 18:15:01 +03:00
Qiyu8
0ed1f07660 Optimize the performance of sum by using universal intrinsics 2020-10-12 19:48:53 +08:00
Martin Kroeker
3aecafad80 Change "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-12 00:00:55 +02:00
Martin Kroeker
756062afa5 Rename "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-11 23:56:17 +02:00
Martin Kroeker
2061f7fdff Rename "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-11 23:54:53 +02:00
Martin Kroeker
dc8a1afa63 Rename "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-11 23:53:50 +02:00
Martin Kroeker
fd94236042 Rename "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-11 23:42:07 +02:00
Martin Kroeker
68ce719fac Rename shdot_microk_cooperlake.c to sbdot_microk_cooperlake.c 2020-10-11 23:41:13 +02:00
Martin Kroeker
d7dd9b396c Rename shdot.c to sbdot.c 2020-10-11 23:40:43 +02:00
Martin Kroeker
9ae80490e0 rename "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-11 23:39:42 +02:00
Martin Kroeker
d314d1f49f Rename shgemm_kernel_power10.c to sbgemm_kernel_power10.c 2020-10-11 23:37:38 +02:00
Martin Kroeker
c589c3e2a1 Merge pull request #2882 from martin-frbg/issue2709
Use generic C for (D/Z)NRM2 on Windows x86_64
2020-10-11 22:22:30 +02:00
Martin Kroeker
ec638a82bf Merge pull request #2852 from martin-frbg/issue2588-cmake
Support building only a subset of variable types
2020-10-11 22:21:33 +02:00
Martin Kroeker
6b6adf8a4a Allow compiling only a subset of kernels for specific variable types 2020-10-11 14:52:09 +02:00
Martin Kroeker
ac653c94f3 Merge branch 'develop' into issue2588-cmake 2020-10-11 13:57:07 +02:00
Martin Kroeker
7a53128481 Add whitelist of DYNAMIC_ARCH kernels for which -msse3 needs to be enabled 2020-10-11 01:06:46 +02:00
Martin Kroeker
e1b7123bbe Merge pull request #2867 from Qiyu8/usimd-floatdot
Optimize the performance of dot by using universal intrinsics in X86/ARM
2020-10-10 12:10:25 +02:00
Qiyu8
f32d34a015 add sse3 compiler flag 2020-10-10 10:36:15 +08:00
Martin Kroeker
7812486091 Use generic C for D/Z nrm2 kernels on Windows to work around fpu exception bug 2020-10-06 21:33:16 +02:00
Matti Picus
a5b164946c add fninit to reset fpu registers before assembler routines 2020-10-05 22:13:25 +03:00
User User-User
d2333e7842 aarch64 fix std=c18 compilation 2020-10-03 18:00:34 +03:00
Qiyu8
60e6c68e38 Adapt ARM architect 2020-09-29 16:36:14 +08:00
Qiyu8
1b1a757f5f Optimize the performance of dot by using universal intrinsics in X86/ARM 2020-09-28 20:36:53 +08:00
Rajalakshmi Srinivasaraghavan
2df4235e00 Optimize dcopy/zcopy for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores. Tested in simulator and no new failures.
2020-09-27 21:42:32 -05:00
Martin Kroeker
dfbc62ef7e Support building only a subset of types 2020-09-22 23:25:59 +02:00
Qiyu8
14f7dad3b7 performance improved 2020-09-22 16:52:15 +08:00
Qiyu8
325b539c26 Optimize the performance of daxpy by using universal intrinsics 2020-09-22 10:38:35 +08:00
Marius Hillenbrand
22aa81f3e5 s390x: fix cscal and zscal implementations
The implementation of complex scalar * vector multiplication for Z14
makes some LAPACK tests fail because the numerical differences to the
reference implementation exceed the threshold (as can be seen by running
make lapack-test and replacing kernel/zarch/cscal.c with a generic
implementation for comparison).

The complex multiplication uses terms of the form a * b + c * d for both
real and imaginary parts. The assembly code (and compiler-emitted code
as well) uses fused multiply add operations for the second product and
sum. The results can be "surprising", for example when both terms in the
imaginary part nearly cancel each other out. In that case, the second
product contributes more digits to the sum than the first product that
has been rounded before.

One option is to use separate multiplications (which then round the same
way) and a distinct add. Change the code to pursue that path, by (1)
requesting the compiler not to contract the operations into FMAs and (2)
replacing the assembly kernel with corresponding vectorized C code
(where change 1 also applies).

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-21 13:10:05 +02:00
Marius Hillenbrand
f91057cbad s390x: move common vector definitions and utils into header
... to facilitate reuse beyond gemm_vec.c and avoid code duplication.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-21 11:32:08 +02:00