Commit Graph

644 Commits

Author SHA1 Message Date
Martin Kroeker
0f5e86a0d9 Remove premature entry for DOMATCOPY_RT 2021-03-18 21:53:50 +01:00
Martin Kroeker
0934568d9c Move includes under the ifdef for compilers w/o intrinsics support 2021-03-12 12:42:05 +01:00
Martin Kroeker
a9f6f7ad39 Remove spurious AVX512 requirement and add AVX2/FMA3 guard 2021-03-06 14:35:49 +01:00
Martin Kroeker
292d1af1a0 Update omatcopy_rt.c 2021-02-24 09:34:14 +01:00
Martin Kroeker
325b398e3c Update omatcopy_rt.c 2021-02-24 09:13:12 +01:00
Martin Kroeker
6f5667b4d4 Enable optimized S/D OMATCOPY_RT 2021-02-24 09:03:41 +01:00
Martin Kroeker
cceeee7806 Add optimized omatcopy_rt 2021-02-24 09:00:54 +01:00
Martin Kroeker
47691c031f Use Haswell optimizations for Zen as well 2021-02-11 09:26:15 +01:00
Martin Kroeker
ce7ddd8921 Use Haswell optimizations for Zen as well 2021-02-11 09:25:36 +01:00
Martin Kroeker
950c047b49 Use Haswell optimizations for Zen as well 2021-02-11 09:24:51 +01:00
Martin Kroeker
46509953a9 Use Haswell optimizations for Zen as well 2021-02-11 09:24:16 +01:00
Martin Kroeker
db348dcff2 Enable optimized srot/drot kernels from Haswell 2021-02-11 09:23:05 +01:00
Martin Kroeker
69a5558203 Merge pull request #3059 from Guobing-Chen/BF16_gemm
Initial code for Cooperlake BF16 GEMM kernel
2021-01-23 19:08:05 +01:00
Alex Henrie
202fc9e8ed Fix uninitialized argument value in dasum_k 2021-01-14 19:40:31 -07:00
Chen, Guobing
b0beb0b1ca Initial code for Cooperlake BF16 GEMM kernel 2021-01-11 02:15:21 +08:00
Martin Kroeker
114eb159a4 Disable FMA intrinsics in the srot kernel when the compiler is PGI/NVIDIA 2020-12-19 22:15:58 +01:00
Martin Kroeker
441c08c9ff Merge pull request #3016 from xiegengxin/complex-asum
Improve the performance of zasum and casum with AVX512 intrinsic
2020-12-04 22:07:16 +01:00
Gengxin Xie
0cb7a403b2 fix error declare function blas_level1_thread_with_return_value 2020-12-02 09:51:52 +08:00
Gengxin Xie
b766c1e9bb Improve the performance of zasum and casum with AVX512 intrinsic 2020-12-01 16:49:26 +08:00
Martin Kroeker
f1bf040b25 Merge pull request #2988 from xiegengxin/smp-asum
Improve the performance of dasum and sasum when SMP is defined
2020-11-22 12:24:13 +01:00
Gengxin Xie
d6e7e05bb3 Improve the performance of dasum and sasum when SMP is defined 2020-11-13 14:20:52 +08:00
Qiyu8
a87e537b8c modify macro 2020-11-11 15:53:48 +08:00
Qiyu8
5bc0a7583f only FMA3 and vector larger than 128 have positive effects. 2020-11-11 15:18:01 +08:00
Qiyu8
8c0b206d4c Optimize the performance of rot by using universal intrinsics 2020-11-11 14:33:12 +08:00
Martin Kroeker
ff16329cb7 Merge pull request #2972 from xiegengxin/rot-intrinsic
Improve the performance of rot by using AVX512 and AVX2 intrinsic
2020-11-08 22:43:00 +01:00
Gengxin Xie
725ffbf041 fix typo 2020-11-05 16:25:17 +08:00
Gengxin Xie
d9ba49165a Improve the performance of rot by using AVX512 and AVX2 intrinsic 2020-11-05 15:12:36 +08:00
Chen, Guobing
a7b1f9b1bb Implementation of BF16 based gemv
1. Add a new API -- sbgemv to support bfloat16 based gemv
2. Implement a generic kernel for sbgemv
3. Implement an avx512-bf16 based kernel for sbgemv

Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
2020-10-29 02:08:23 +08:00
İsmail Dönmez
4a1d00f589 Fix build with -Werror=return-type
dgemm_tcopy_16_skylakex.c CNAME function should return an int, add a
return 0 similar to other files.
2020-10-21 08:43:39 +02:00
Bart Oldeman
b073d759d0 x86_64: clobber all xmm registers after vzeroupper
As observed using GCC 10 using -march=native -ftree-vectorize
on Knights Landing, it is now smart enough to find clobbers inside
non-inlined static functions.

In particular, sgemv counted on a kernel to preserve the whole
%ymm2 register (since it was not in the clobber list), but the top
part was destroyed by vzeroupper. This caused many tests to fail.

This patch makes sure all xmm (and ymm/zmm by extension) registers
are listed as clobbered to avoid this happening, as most kernels
already did correctly in fact.
2020-10-20 02:16:47 +00:00
Bart Oldeman
03e781b766 sgemm_direct_skylakex: fix 75eeb26 regression.
The
`#if defined(SKYLAKEX) || defined (COOPERLAKE)`
from that commit was before #include "common.h" so caused the
compiled function to be empty, returning garbage results for
qualifying sgemm's on those architectures.

Closes #2914
2020-10-18 19:58:07 +00:00
Martin Kroeker
c339c40c01 Silence a redefinition warning 2020-10-15 19:08:12 +02:00
Qiyu8
bfdf4b56da Add double precision universal intrinsics for X86/ARM 2020-10-15 10:29:42 +08:00
Martin Kroeker
756802df61 Merge pull request #2890 from martin-frbg/s-d-sum
Revert special handling of Windows xNRM2 and enable C+intrinsics kern…
2020-10-14 09:02:03 +02:00
Martin Kroeker
8d2df7d066 Revert special handling of Windows xNRM2 and enable C+intrinsics kernel for SSUM/DSUM 2020-10-13 00:14:29 +02:00
Martin Kroeker
08929430cd Merge pull request #2886 from martin-frbg/issue_2767
Rename "HALF" precision functions (sh prefix) to "BFLOAT16" with "sb" prefix
2020-10-13 00:04:35 +02:00
Martin Kroeker
0c84ffe05f Merge pull request #2881 from mattip/fninit
add fninit to reset fpu registers before assembler routines
2020-10-12 23:50:41 +02:00
Matti Picus
403eb513a0 use emms instead, add WIN guards 2020-10-12 18:15:01 +03:00
Martin Kroeker
dc8a1afa63 Rename "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-11 23:53:50 +02:00
Martin Kroeker
fd94236042 Rename "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-11 23:42:07 +02:00
Martin Kroeker
68ce719fac Rename shdot_microk_cooperlake.c to sbdot_microk_cooperlake.c 2020-10-11 23:41:13 +02:00
Martin Kroeker
d7dd9b396c Rename shdot.c to sbdot.c 2020-10-11 23:40:43 +02:00
Martin Kroeker
7812486091 Use generic C for D/Z nrm2 kernels on Windows to work around fpu exception bug 2020-10-06 21:33:16 +02:00
Matti Picus
a5b164946c add fninit to reset fpu registers before assembler routines 2020-10-05 22:13:25 +03:00
Qiyu8
14f7dad3b7 performance improved 2020-09-22 16:52:15 +08:00
Qiyu8
325b539c26 Optimize the performance of daxpy by using universal intrinsics 2020-09-22 10:38:35 +08:00
Martin Kroeker
91c84e1c01 Merge pull request #2796 from Guobing-Chen/BF16_dot_coversion_apis
Add bfloat16 based dot and conversion with single/double
2020-09-14 15:00:19 +02:00
Martin Kroeker
e72430fe46 Merge pull request #2803 from xiegengxin/AVX2-asum
Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic
2020-09-06 18:32:15 +02:00
Chen, Guobing
deaeb6c5b8 Add bfloat16 based dot and conversion with single/double
1. Added bfloat16 based dot as new API: shdot
2. Implemented generic kernel and cooperlake-specific (AVX512-BF16) kernel for shdot
3. Added 4 conversion APIs for bfloat16 data type <=> single/double: shstobf16 shdtobf16 sbf16tos dbf16tod
     shstobf16 -- convert single float array to bfloat16 array
     shdtobf16 -- convert double float array to bfloat16 array
     sbf16tos  -- convert bfloat16 array to single float array
     dbf16tod  -- convert bfloat16 array to double float array
4. Implemented generic kernels for all 4 conversion APIs, and cooperlake-specific kernel for shstobf16 and shdtobf16
5. Update level1 thread facilitate functions and macros to support multi-threading for these new APIs
6. Fix Cooperlake platform detection/specify issue when under dynamic-arch building
7. Change the typedef of bfloat16 from unsigned short to more strict uint16_t

Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
2020-09-04 02:31:25 +08:00
Gengxin Xie
1b0f17eeed align to 64, using SSE when input size is small 2020-09-03 14:25:54 +08:00