Commit Graph

789 Commits

Author SHA1 Message Date
Martin Kroeker 6f5667b4d4
Enable optimized S/D OMATCOPY_RT 2021-02-24 09:03:41 +01:00
Martin Kroeker cceeee7806
Add optimized omatcopy_rt 2021-02-24 09:00:54 +01:00
Martin Kroeker 47691c031f
Use Haswell optimizations for Zen as well 2021-02-11 09:26:15 +01:00
Martin Kroeker ce7ddd8921
Use Haswell optimizations for Zen as well 2021-02-11 09:25:36 +01:00
Martin Kroeker 950c047b49
Use Haswell optimizations for Zen as well 2021-02-11 09:24:51 +01:00
Martin Kroeker 46509953a9
Use Haswell optimizations for Zen as well 2021-02-11 09:24:16 +01:00
Martin Kroeker db348dcff2
Enable optimized srot/drot kernels from Haswell 2021-02-11 09:23:05 +01:00
Martin Kroeker 69a5558203
Merge pull request #3059 from Guobing-Chen/BF16_gemm
Initial code for Cooperlake BF16 GEMM kernel
2021-01-23 19:08:05 +01:00
Alex Henrie 202fc9e8ed Fix uninitialized argument value in dasum_k 2021-01-14 19:40:31 -07:00
Chen, Guobing b0beb0b1ca Initial code for Cooperlake BF16 GEMM kernel 2021-01-11 02:15:21 +08:00
Martin Kroeker 114eb159a4
Disable FMA intrinsics in the srot kernel when the compiler is PGI/NVIDIA 2020-12-19 22:15:58 +01:00
Martin Kroeker 441c08c9ff
Merge pull request #3016 from xiegengxin/complex-asum
Improve the performance of zasum and casum with AVX512 intrinsic
2020-12-04 22:07:16 +01:00
Gengxin Xie 0cb7a403b2 fix error declare function blas_level1_thread_with_return_value 2020-12-02 09:51:52 +08:00
Gengxin Xie b766c1e9bb Improve the performance of zasum and casum with AVX512 intrinsic 2020-12-01 16:49:26 +08:00
Martin Kroeker f1bf040b25
Merge pull request #2988 from xiegengxin/smp-asum
Improve the performance of dasum and sasum when SMP is defined
2020-11-22 12:24:13 +01:00
Gengxin Xie d6e7e05bb3 Improve the performance of dasum and sasum when SMP is defined 2020-11-13 14:20:52 +08:00
Qiyu8 a87e537b8c modify macro 2020-11-11 15:53:48 +08:00
Qiyu8 5bc0a7583f only FMA3 and vector larger than 128 have positive effects. 2020-11-11 15:18:01 +08:00
Qiyu8 8c0b206d4c Optimize the performance of rot by using universal intrinsics 2020-11-11 14:33:12 +08:00
Martin Kroeker ff16329cb7
Merge pull request #2972 from xiegengxin/rot-intrinsic
Improve the performance of rot by using AVX512 and AVX2 intrinsic
2020-11-08 22:43:00 +01:00
Gengxin Xie 725ffbf041 fix typo 2020-11-05 16:25:17 +08:00
Gengxin Xie d9ba49165a Improve the performance of rot by using AVX512 and AVX2 intrinsic 2020-11-05 15:12:36 +08:00
Chen, Guobing a7b1f9b1bb Implementation of BF16 based gemv
1. Add a new API -- sbgemv to support bfloat16 based gemv
2. Implement a generic kernel for sbgemv
3. Implement an avx512-bf16 based kernel for sbgemv

Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
2020-10-29 02:08:23 +08:00
İsmail Dönmez 4a1d00f589
Fix build with -Werror=return-type
dgemm_tcopy_16_skylakex.c CNAME function should return an int, add a
return 0 similar to other files.
2020-10-21 08:43:39 +02:00
Bart Oldeman b073d759d0 x86_64: clobber all xmm registers after vzeroupper
As observed using GCC 10 using -march=native -ftree-vectorize
on Knights Landing, it is now smart enough to find clobbers inside
non-inlined static functions.

In particular, sgemv counted on a kernel to preserve the whole
%ymm2 register (since it was not in the clobber list), but the top
part was destroyed by vzeroupper. This caused many tests to fail.

This patch makes sure all xmm (and ymm/zmm by extension) registers
are listed as clobbered to avoid this happening, as most kernels
already did correctly in fact.
2020-10-20 02:16:47 +00:00
Bart Oldeman 03e781b766 sgemm_direct_skylakex: fix 75eeb26 regression.
The
`#if defined(SKYLAKEX) || defined (COOPERLAKE)`
from that commit was before #include "common.h" so caused the
compiled function to be empty, returning garbage results for
qualifying sgemm's on those architectures.

Closes #2914
2020-10-18 19:58:07 +00:00
Martin Kroeker c339c40c01
Silence a redefinition warning 2020-10-15 19:08:12 +02:00
Qiyu8 bfdf4b56da Add double precision universal intrinsics for X86/ARM 2020-10-15 10:29:42 +08:00
Martin Kroeker 756802df61
Merge pull request #2890 from martin-frbg/s-d-sum
Revert special handling of Windows xNRM2 and enable C+intrinsics kern…
2020-10-14 09:02:03 +02:00
Martin Kroeker 8d2df7d066
Revert special handling of Windows xNRM2 and enable C+intrinsics kernel for SSUM/DSUM 2020-10-13 00:14:29 +02:00
Martin Kroeker 08929430cd
Merge pull request #2886 from martin-frbg/issue_2767
Rename "HALF" precision functions (sh prefix) to "BFLOAT16" with "sb" prefix
2020-10-13 00:04:35 +02:00
Martin Kroeker 0c84ffe05f
Merge pull request #2881 from mattip/fninit
add fninit to reset fpu registers before assembler routines
2020-10-12 23:50:41 +02:00
Matti Picus 403eb513a0 use emms instead, add WIN guards 2020-10-12 18:15:01 +03:00
Martin Kroeker dc8a1afa63
Rename "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-11 23:53:50 +02:00
Martin Kroeker fd94236042
Rename "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-11 23:42:07 +02:00
Martin Kroeker 68ce719fac
Rename shdot_microk_cooperlake.c to sbdot_microk_cooperlake.c 2020-10-11 23:41:13 +02:00
Martin Kroeker d7dd9b396c
Rename shdot.c to sbdot.c 2020-10-11 23:40:43 +02:00
Martin Kroeker 7812486091
Use generic C for D/Z nrm2 kernels on Windows to work around fpu exception bug 2020-10-06 21:33:16 +02:00
Matti Picus a5b164946c add fninit to reset fpu registers before assembler routines 2020-10-05 22:13:25 +03:00
Qiyu8 14f7dad3b7 performance improved 2020-09-22 16:52:15 +08:00
Qiyu8 325b539c26 Optimize the performance of daxpy by using universal intrinsics 2020-09-22 10:38:35 +08:00
Martin Kroeker 91c84e1c01
Merge pull request #2796 from Guobing-Chen/BF16_dot_coversion_apis
Add bfloat16 based dot and conversion with single/double
2020-09-14 15:00:19 +02:00
Martin Kroeker e72430fe46
Merge pull request #2803 from xiegengxin/AVX2-asum
Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic
2020-09-06 18:32:15 +02:00
Chen, Guobing deaeb6c5b8 Add bfloat16 based dot and conversion with single/double
1. Added bfloat16 based dot as new API: shdot
2. Implemented generic kernel and cooperlake-specific (AVX512-BF16) kernel for shdot
3. Added 4 conversion APIs for bfloat16 data type <=> single/double: shstobf16 shdtobf16 sbf16tos dbf16tod
     shstobf16 -- convert single float array to bfloat16 array
     shdtobf16 -- convert double float array to bfloat16 array
     sbf16tos  -- convert bfloat16 array to single float array
     dbf16tod  -- convert bfloat16 array to double float array
4. Implemented generic kernels for all 4 conversion APIs, and cooperlake-specific kernel for shstobf16 and shdtobf16
5. Update level1 thread facilitate functions and macros to support multi-threading for these new APIs
6. Fix Cooperlake platform detection/specify issue when under dynamic-arch building
7. Change the typedef of bfloat16 from unsigned short to more strict uint16_t

Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
2020-09-04 02:31:25 +08:00
Gengxin Xie 1b0f17eeed align to 64, using SSE when input size is small 2020-09-03 14:25:54 +08:00
Gengxin Xie 448152cdd8 define __AVX2__ to ensure the haswell code compiled with avx2 2020-08-31 14:39:08 +08:00
Gengxin Xie cb3c190a3a Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic 2020-08-31 11:44:08 +08:00
Martin Kroeker b2053239fc
Fix mssing dummy parameter (imag part of alpha) of zdot_thread_function 2020-08-23 15:08:16 +02:00
Martin Kroeker 9ee21a0a39
Merge pull request #2780 from Guobing-Chen/CPL_build_support
Enable COOPERLAKE build target
2020-08-20 19:54:29 +02:00
Martin Kroeker 75eeb265d7
[WIP] Refactor the driver code for direct SGEMM (#2782)
Move "direct SGEMM" functionality out of the SkylakeX SGEMM kernel and make it available
(on x86_64 targets only for now) in DYNAMIC_ARCH builds
* Add  sgemm_direct targets in the kernel Makefile.L3 and CMakeLists.txt
* Add direct_sgemm functions to the gotoblas struct in common_param.h
* Move sgemm_direct_performant helper to separate file
* Update gemm.c  to macros for sgemm_direct to support dynamic_arch naming via common_s,h
* (Conditionally) add sgemm_direct functions in setparam-ref.c
2020-08-19 14:51:09 +02:00
Chen, Guobing e740c4873d Enable COOPERLAKE build target
Enable new build target platform -- COOPERLAKE. This target platform
supports all the SKYLAKEX supported ISAs + avx512bf16. So all the
SKYLAKEX specific kernels/drivers and related code are now extended
to be also active on COOPERLAKE. Besides, new BF16 related kernels
are active under this target.
2020-08-13 06:18:00 +08:00
Martin Kroeker 81dcfdcf39
Multiply by 2 instead of left-shifting a potentially negative number
fixes GCC ubsan warning in the BLAS tests
2020-08-02 18:29:56 +02:00
Martin Kroeker 0ef4b3f1f2
Multiply instead of doing a left shift of a potentially negative number
fixes GCC ubsan report in the BLAS tests
2020-08-02 18:27:40 +02:00
Martin Kroeker aa53a8a5cb
Multiply by two instead of left-shifting one place
fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests
2020-08-02 18:25:09 +02:00
Martin Kroeker aa3a1e7d8c
Multiply by two rather than left shift by one place
fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests
2020-08-02 18:22:31 +02:00
Martin Kroeker e30ad0e521
Strip UTF8 byte order marker from source 2020-06-26 09:00:43 +02:00
Martin Kroeker 93592d1260
Merge pull request #2675 from wjc404/develop
AVX512 DGEMM TCOPY_16 Function
2020-06-23 09:29:02 +02:00
wjc404 086d87a302
AVX512 dgemm tcopy_16 function 2020-06-20 00:07:43 +08:00
Martin Kroeker c3574ffe53
Merge pull request #2646 from wjc404/develop
Optimize AVX512 parallel DGEMM performance
2020-06-07 13:18:22 +02:00
wjc404 0e3ac4a06b
Add files via upload 2020-06-06 14:56:57 +08:00
Martin Kroeker 2271c3506b
Work around excessive LAPACK test failures on Skylake-X
Something in the plain C parts of x86_64 cscal.c and zscal.c appears to be miscompiled by both gfortran9 and ifort when compiling for skylakex-avx512, even when the optimized Haswell microkernel is not in use.
2020-05-09 23:49:18 +02:00
Martin Kroeker 90dba9f716
Duplicate earlier Clang 9.0.0 workaround for corresponding Apple Clang version
As discussed on the original PR #2329, the "Apple Clang 11.0.3" that appears to be based the same LLVM release produces the same miscompilation of this file.
2020-05-05 10:44:50 +02:00
Martin Kroeker 5b0093b5fe
Convert aligned moves to unaligned
should have no performance impact on reasonably modern cpus and fixes occasional crashes in actual user code.
2020-04-13 14:58:52 +02:00
Martin Kroeker 567d2760e6
Merge pull request #2520 from wjc404/develop
Fix avx512 sgemm performance bug when ldc is a multiple of 1024
2020-03-30 20:15:59 +02:00
wjc404 b8307768e2
Add files via upload 2020-03-21 05:42:10 +08:00
Martin Kroeker af8a619e1f
Merge pull request #2517 from wjc404/develop
Temporary fix for SKX STRSM
2020-03-17 10:12:53 +01:00
wjc404 62b9608986
Update KERNEL.SKYLAKEX 2020-03-17 12:52:55 +08:00
Martin Kroeker a1b181cea2
Merge pull request #2516 from wjc404/develop
AVX2 STRSM kernels
2020-03-16 21:58:34 +01:00
wjc404 cdc0e9011e
Update KERNEL.ZEN 2020-03-16 16:39:37 +00:00
wjc404 fa049d49c2
AVX2 STRSM kernel 2020-03-17 00:34:08 +08:00
Martin Kroeker ea8eec5d17
Merge pull request #2422 from wjc404/develop
Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix performance bugs in AVX2 s/c/z GEMM
2020-02-29 19:07:35 +01:00
wjc404 dd22eb7621
Update cgemm_kernel_8x2_haswell.c 2020-02-27 22:26:15 +08:00
wjc404 2352331e60
Update zgemm_kernel_4x2_haswell.c 2020-02-27 22:25:19 +08:00
wjc404 1b980001dd
Update zgemm_kernel_4x2_haswell.c 2020-02-26 18:38:12 +08:00
wjc404 2515e1152f
Update cgemm_kernel_8x2_haswell.c 2020-02-26 18:36:54 +08:00
wjc404 903854c168
Add files via upload 2020-02-22 23:40:02 +08:00
wjc404 a2ff577a30
Update KERNEL.ZEN 2020-02-22 23:39:43 +08:00
wjc404 97a32cb0a5
Update KERNEL.HASWELL 2020-02-22 23:39:20 +08:00
Martin Liska aeea14ee40
Come up with LOAD_AND_COMPARE_TO_MXX macro in iamax_sse.S. 2020-02-17 09:01:53 +01:00
Martin Liska 18bcc36a69
Fix implementation of iamax_sse.S as reported in #2116.
The was a typo in iamax_sse.S where one of the comparison
was cmpeqps instead of cmpeqss. That misdetected index
for sequences where the minimum value was 0.
2020-02-17 09:01:53 +01:00
wjc404 f566787e6e
Update KERNEL.SKYLAKEX 2020-02-16 22:58:44 +08:00
wjc404 e3368cbf18
AVX512 STRMM kernel 2020-02-16 22:58:00 +08:00
Bart Oldeman 7ea5e07d1c Fix inline asm in dscal: mark x, x1 as clobbered. Fixes #2408
The leaq instructions in dscal_kernel_inc_8 modify x and x1 so they
must be declared as input/output constraints, otherwise the compiler
may assume the corresponding registers are not modified.
2020-02-12 14:11:44 +00:00
wjc404 3447d04eaf
Update dgemm_kernel_16x2_skylakex.c 2020-02-06 02:14:10 +00:00
wjc404 8b5cdcc64c
Update sgemm_kernel_8x4_haswell.c 2020-02-06 01:47:46 +00:00
wjc404 4e00d96a78
Update dgemm_kernel_16x2_skylakex.c 2020-02-06 01:46:36 +00:00
wjc404 096da2f51a
Update dgemm_kernel_16x2_skylakex.c 2020-02-05 13:36:57 +08:00
wjc404 081b188529
Update KERNEL.SKYLAKEX 2020-02-03 21:38:08 +08:00
wjc404 8019e70211
AVX512 16x2 DGEMM kernel 2020-02-03 21:32:56 +08:00
wjc404 e5dcdeb550
Update sgemm_direct_skylakex.c 2020-01-13 16:59:23 +08:00
wjc404 952cc2ba38
Update sgemm_kernel_16x4_skylakex_2.c 2020-01-13 16:58:54 +08:00
wjc404 feaafbedd3
make skylakex sgemm code more friendly for readers
BTW some kernels were adjusted to improve performance
2020-01-13 16:28:41 +08:00
wjc404 3a100b2797
Update KERNEL.SKYLAKEX 2020-01-09 13:48:41 +08:00
wjc404 bd4c032f52
Update sgemm_kernel_8x4_haswell.c 2020-01-07 11:22:46 +08:00
wjc404 9dc9b7b95e
Update sgemm_kernel_8x4_haswell.c 2020-01-06 20:11:36 +08:00
wjc404 92b10212de
optimize AVX2 SGEMM 2020-01-06 12:11:21 +08:00
wjc404 b73bf01378
optimize AVX2 SGEMM 2020-01-06 12:09:14 +08:00
wjc404 eb3c9f1db9
optimize AVX2 SGEMM 2020-01-06 12:07:02 +08:00
wjc404 a0f0a802fc
Update zgemm3m_kernel_4x4_haswell.c 2019-12-30 17:33:42 +08:00
wjc404 700fe5b5ee
Add files via upload 2019-12-30 17:18:59 +08:00
wjc404 f60840c420
Update KERNEL.ZEN 2019-12-30 16:04:23 +08:00
wjc404 109e18cd96
Update KERNEL.HASWELL 2019-12-30 16:03:24 +08:00
wjc404 ae1579be13
Create zgemm3m_kernel_4x4_haswell.c 2019-12-30 16:02:51 +08:00
wjc404 cd765f094b
Update cgemm3m_kernel_8x4_haswell.c 2019-12-27 18:23:29 +08:00
wjc404 3a66c8cac1
Update KERNEL.ZEN 2019-12-27 18:04:08 +08:00
wjc404 ed9af2f7da
Update KERNEL.HASWELL 2019-12-27 18:01:38 +08:00
wjc404 5fd1edead9
Create cgemm3m_kernel_8x4_haswell.c 2019-12-27 18:00:55 +08:00
wjc404 eeecd623d8
Update cgemm_kernel_8x2_haswell.c 2019-12-24 00:40:16 +08:00
wjc404 2cd9306bb5
Update KERNEL.ZEN 2019-12-23 23:42:30 +08:00
wjc404 c418c81224
Update KERNEL.HASWELL 2019-12-23 23:41:44 +08:00
wjc404 025741f16a
Fast Haswell CGEMM kernel 2019-12-23 23:40:03 +08:00
wjc404 f41d52665d
Fast Haswell ZGEMM kernel 2019-12-21 14:37:06 +08:00
wjc404 d573d24de7
Fast Haswell ZGEMM kernel 2019-12-21 14:35:15 +08:00
Isuru Fernando b863b32ac5 Workaround an ICE in clang 9.0.0
This bug is not there in 8.x nor in the 9.0 daily snapshot.
2019-12-01 12:59:46 -06:00
wjc404 934e601e93
Update dgemm_kernel_4x8_skylakex_2.c 2019-11-28 19:56:35 +08:00
wjc404 eb1e9c8c92
some optimizations 2019-11-26 14:12:20 +08:00
Wang, Long bfb5fbdb4d revised fix windows compatible for #2313
Signed-off-by: Wang, Long <long1.wang@intel.com>
2019-11-21 10:22:58 +08:00
Wang, Long 1191db1a49 For the sake of windows compatible, used "unsigned long long" to ensure 64-bit length
Signed-off-by: Wang, Long <long1.wang@intel.com>
2019-11-20 21:30:47 +08:00
Wang, Long 0caf1434c9 Fix the integer overflow issue for large matrix size
For large matrix, e.g. M=N=K, and M>1290, int mnk=M*N*K will overflow.
This will lead to wrong branching to single-threading. The performance
is downgraded significantly.

Signed-off-by: Wang, Long <long1.wang@intel.com>
2019-11-20 14:11:17 +08:00
wjc404 819e852ae7
AVX512 CGEMM & ZGEMM kernels
96-99% 1-thread performance of MKL2018
2019-11-11 20:04:52 +08:00
wjc404 836c414e22
optimizations of software prefetching 2019-11-05 13:36:56 +08:00
wjc404 430c11e135
Add files via upload 2019-11-04 20:10:12 +08:00
wjc404 fbacd2605d
optimizations via software prefetches 2019-11-04 19:37:19 +08:00
wjc404 1df9a2013d
new sgemm kernel for skylakex 2019-11-02 00:00:48 +08:00
wjc404 6ff013bae0
native support for icopy_4
90% MKL 1-thread performance.
2019-10-19 03:54:44 +08:00
wjc404 0d669e04bb
Update dgemm_kernel_8x8_skylakex.c 2019-10-18 15:00:17 +08:00
wjc404 17cdd9f9e1
some correction 2019-10-18 14:58:07 +08:00
wjc404 6bcb06fcb1
make further changes to icopy_8 easier 2019-10-18 10:47:31 +08:00
wjc404 b7315f8401
Add files via upload 2019-10-16 19:23:36 +08:00
wjc404 9b19e9e1b0
Update dgemm_kernel_8x8_skylakex.c 2019-10-16 10:14:51 +08:00
wjc404 6bd67ddbab
Update dgemm_kernel_8x8_skylakex.c 2019-10-16 03:20:08 +08:00
wjc404 844629af57
Add files via upload 2019-10-16 02:00:34 +08:00
Martin Kroeker 11c59acfb1
Keep both PGI/SUN and default code paths to avoid breaking Clang/WIndows 2019-08-28 18:07:44 +02:00
Martin Kroeker 3a55dca2dc
Make x86_64 zdot compile with PGI and Sun C again
broken by #2222 as CREAL,CIMAG do not expand to a valid lvalue with these compilers
2019-08-28 11:35:31 +02:00
Martin Kroeker 9ef96b32a6
Add multithreading support to the x86_64 zdot kernel (#2222)
* Add multithreading support

copied from the ThunderX2T99 kernel. For #2221
2019-08-15 22:09:12 +02:00
Martin Kroeker dccff2e785
Merge pull request #2206 from martin-frbg/zen-dtrmm
Replace vpermpd with vpermilpd in the Haswell DTRMM kernel
2019-08-09 07:55:20 +02:00
Martin Kroeker 5c3458a6e7
Merge pull request #2199 from martin-frbg/zen-dtrsm
Replace most vpermpd calls in the Haswell DTRSM_RN kernel
2019-08-09 07:55:02 +02:00
Martin Kroeker acf6002ab2
Replace most vpermpd calls in the Haswell DTRSM_RN kernel 2019-08-03 12:40:13 +02:00
Martin Kroeker 2dfb804cb9
Replace vpermpd with vpermilpd in the Haswell DTRMM kernel
to improve performance on AMD Zen (#2180) applying wjc404's improvement of the DGEMM kernel from #2186
2019-07-28 23:17:28 +02:00
Martin Kroeker 4c153ec9da
Merge pull request #2196 from wjc404/develop
Add vbroadcastsd kernel to dgemm_kernel_4x8_haswell.S
2019-07-28 23:11:40 +02:00
wjc404 7eecd8e39c
Add files via upload 2019-07-28 07:39:09 +08:00
Martin Kroeker 7b0b7c11d2
Merge pull request #2190 from martin-frbg/zdot-zen
Replace vpermpd with vpermilpd in the Haswell/Zen zdot microkernel
2019-07-23 16:15:08 +02:00
Martin Kroeker 28e96458e5
Replace vpermpd with vpermilpd
to improve performance on Zen/Zen2 (as demonstrated by wjc404 in #2180)
2019-07-22 08:28:16 +02:00
wjc404 95fb98f556
Update dgemm_kernel_4x8_haswell.S 2019-07-21 01:10:32 +08:00
wjc404 4801c6d36b
Update dgemm_kernel_4x8_haswell.S 2019-07-21 00:47:45 +08:00
wjc404 9440fa607d
Add files via upload 2019-07-20 22:08:22 +08:00
wjc404 94db259e5b
Add files via upload 2019-07-20 22:04:41 +08:00
wjc404 f49f8047ac
Add files via upload 2019-07-20 14:33:37 +08:00
wjc404 825777faab
Update dgemm_kernel_4x8_haswell.S 2019-07-19 23:58:24 +08:00
wjc404 9c89757562
Add files via upload 2019-07-19 23:47:58 +08:00
wjc404 9b04baeaee
Update dgemm_kernel_4x8_haswell.S 2019-07-17 23:50:03 +08:00
wjc404 8a074b3965
Update dgemm_kernel_4x8_haswell.S 2019-07-17 23:47:30 +08:00
wjc404 211ab03b14
Update dgemm_kernel_4x8_haswell.S 2019-07-17 22:39:15 +08:00
wjc404 1733f927e6
Update dgemm_kernel_4x8_haswell.S 2019-07-17 21:27:41 +08:00
wjc404 182b06d6ad
Update dgemm_kernel_4x8_haswell.S 2019-07-17 17:02:35 +08:00
wjc404 7a9050d681
Update dgemm_kernel_4x8_haswell.S 2019-07-17 00:55:06 +08:00
wjc404 0ba29fd262
Update dgemm_kernel_4x8_haswell.S for zen2
replaced a bunch of vpermpd instructions with vpermilpd and vperm2f128
2019-07-17 00:46:51 +08:00
Martin Kroeker 9ea30f3788
Replace ISMIN and ISAMIN kernels on all x86_64 platforms (#2125)
* Mark iamax_sse.S as unsuitable for MIN due to issue #2116
* Use iamax.S rather than iamax_sse.S for ISMIN/ISAMIN on all x86_64 as workaround for #2116
2019-05-09 14:42:36 +02:00
Martin Kroeker b1561ecc68
Disable DGEMMINCOPY as well for now
#1955
2019-05-05 15:52:01 +02:00
Martin Kroeker 7ed8431527
Disable the SkyLakeX DGEMMITCOPY kernel as well
as a stopgap measure for https://github.com/numpy/numpy/issues/13401 as mentioned in #1955
2019-05-04 22:54:41 +02:00
Martin Kroeker c04a729081
Add ?sum definitions for generic kernel 2019-03-31 13:55:49 +02:00
Martin Kroeker 9d717cb5ee
Add x86_64 implementation of ?sum
as trivial copy of ?asum with the fabs calls removed
2019-03-30 22:27:04 +01:00
Martin Kroeker 32c7063cb0
Merge pull request #2061 from martin-frbg/martin-frbg-patch-1
Disable the AVX512 DGEMM kernel (again)
2019-03-30 21:21:38 +01:00
Martin Kroeker e608d4f7fe
Disable the AVX512 DGEMM kernel (again)
Due to as yet unresolved errors seen in #1955 and #2029
2019-03-13 22:10:28 +01:00
Celelibi b7f59da42d Fix crash in sgemm SSE/nano kernel on x86_64
Fix bug #2047.

Signed-off-by: Celelibi <celelibi@gmail.com>
2019-03-07 16:55:13 +01:00
Andrew 6eee1beac5 move fix to right place 2019-02-24 20:41:02 +02:00
Martin Kroeker e12cdf58ef
Merge pull request #2024 from martin-frbg/gcc9fixes4
Fix inline assembly constraints in Bulldozer TRSM kernels
2019-02-17 11:49:15 +01:00
Martin Kroeker 1860c9456d
Merge pull request #2023 from martin-frbg/gcc9fixes3
Fix inline assembly constraints in various x86_64 GEMVN kernels
2019-02-17 11:48:57 +01:00
Martin Kroeker f9bb76d29a
Fix inline assembly constraints in Bulldozer TRSM kernels
rework indices to allow marking i,as and bs as both input and output (marked operand n1 as well for simplicity). For #2009
2019-02-16 20:06:48 +01:00
Martin Kroeker efb9038f72
Fix inline assembly constraints 2019-02-16 18:46:17 +01:00
Martin Kroeker e976557d29
Fix inline assembly constraints
rework indices to allow marking argument lda as input and output.
2019-02-16 18:36:39 +01:00
Martin Kroeker 9d8be15789
Fix inline assembly constraints
rework indices to allow marking argument lda4 as input and output. For #2009
2019-02-16 18:24:11 +01:00
Martin Kroeker d752799a0f
Merge pull request #2021 from martin-frbg/gcc9fixes2
Fix wrong constraints in inline assembly of Haswell DTRSM kernel
2019-02-16 18:05:40 +01:00
Martin Kroeker c26c0b77a7
Fix wrong constraints in inline assembly
for #2009
2019-02-15 15:08:16 +01:00
Martin Kroeker 1c6da2d03c
Merge pull request #2019 from martin-frbg/gcc9fixes
Fix unannounced modification of input operand 8 (lda4) in Haswell GEMVN microkernel
2019-02-15 15:02:54 +01:00
Martin Kroeker 4255a58cd2
Rename operands to put lda on the input/output constraint list 2019-02-15 10:10:04 +01:00
Martin Kroeker 46e415b140
Save and restore input argument 8 (lda4)
Fixes miscompilation with gcc9 -ftree-vectorize (related to issue #2009)
2019-02-14 22:43:18 +01:00
Bart Oldeman 69a97ca7b9 dgemv_kernel_4x4(Haswell): add missing clobbers for xmm0,xmm1,xmm2,xmm3
This fixes a crash in dblat2 when OpenBLAS is compiled using
-march=znver1 -ftree-vectorize -O2

See also:
https://github.com/easybuilders/easybuild-easyconfigs/issues/7180
2019-02-14 16:27:58 +00:00
Martin Kroeker ab1630f9fa
Fix declaration of arguments in inline assembly
Argument 0 is modified so should be input and output
2019-02-12 16:14:02 +01:00
Martin Kroeker b824fa70eb
Fix declaration of assembly arguments in SSYMV and DSYMV microkernels
Arguments 0 and 1 are both input and output
2019-02-12 16:00:18 +01:00
Martin Kroeker 91481a3e4e
Fix declaration of input arguments in inline assembly
Argument 0 is modified as it doubles as a counter
2019-02-12 15:51:43 +01:00
Martin Kroeker dc6ac9eab0
Fix declaration of input arguments in the x86_64 s/dGEMV_T and s/dGEMV_N kernels
Arguments 0 and 1 need to be tagged as both input and output
2019-02-12 15:33:48 +01:00
Martin Kroeker 32b0f1168e
Fix declaration of input arguments in the Sandybridge GER microkernels (#1967)
* Tag arguments 0 and 1 as both input and output
2019-01-18 08:11:39 +01:00
Martin Kroeker b495e54310
Fix declaration of input arguments in the x86_64 SCAL microkernels (#1966)
* Tag arguments 0 and 1 as both input and output (see #1964)
2019-01-18 08:11:07 +01:00
Martin Kroeker d5e6940253
Fix declaration of input arguments in the x86_64 microkernels for DOT and AXPY (#1965)
* Tag operands 0 and 1 as both input and output

For #1964 (basically a continuation of coding problems first seen in #1292)
2019-01-17 23:20:32 +01:00
Arjan van de Ven 795285c587 Fix thinko in skylake beta handling
casting ints is cheaper but it has a rounding, not memory casing effect, resulting in
invalid outcome
2018-12-24 18:49:50 +00:00
Arjan van de Ven d321448a63 dgemm: use dgemm_ncopy_8_skylakex.c also for Haswell
The dgemm_ncopy_8_skylakex.c code is not avx512 specific and gives
a nice performance boost for medium sized matrices
2018-12-16 23:09:22 +00:00
Arjan van de Ven c43331ad0a dgemm: Use the skylakex beta function also for haswell
it's more efficient for certain tall/skinny matrices
2018-12-16 23:09:17 +00:00
Arjan van de Ven 69d206440a Make the skylakex/haswell sgemm code compile and run even with compilers without avx2 support 2018-12-16 00:19:41 +00:00
Arjan van de Ven 0586899a10 Use sgemm_ncopy_4_skylakex.c also for Haswell
sgemm_ncopy_4_skylakex.c uses SSE transpose operations where the
real perf win happens; this also works great for Haswell.

This gives double digit percentage gains on small and skinny matrices
2018-12-15 13:49:19 +00:00
Arjan van de Ven 00dc09ad19 Use the skylake sgemm beta code also for haswell
with a few small changes it's possible to use the skylake sgemm code
also for haswell, this gives a modest gain (10% range) for smallish
matrixes but does wonders for very skinny matrixes
2018-12-15 13:49:13 +00:00
Arjan van de Ven cdc668d82b Add a "sgemm direct" mode for small matrixes
OpenBLAS has a fancy algorithm for copying the input data while laying
it out in a more CPU friendly memory layout.

This is great for large matrixes; the cost of the copy is easily
ammortized by the gains from the better memory layout.

But for small matrixes (on CPUs that can do efficient unaligned loads) this
copy can be a net loss.

This patch adds (for SKYLAKEX initially) a "sgemm direct" mode, that bypasses
the whole copy machinary for ALPHA=1/BETA=0/... standard arguments,
for small matrixes only.

What is small? For the non-threaded case this has been measured to be
in the M*N*K = 28 * 512 * 512 range, while in the threaded case it's
less, around M*N*K = 1 * 512 * 512
2018-12-13 13:47:31 +00:00
Martin Kroeker 701ea88347
Use p2align instead of align for OSX compatibility
fixes #1902
2018-12-03 13:06:43 +01:00
Andrew 19c4bdd8b3 Add return value so that freebsd system clang does not err out 2018-11-25 21:35:01 +01:00
Arjan van de Ven dcc5d6291e skylakex: Make the sgemm/dgemm beta code robust for a N=0 or M=0 case
in the threading code there are cases where N or M can become 0,
and the optimized beta code did not handle this well, leading
to a crash

during the audit for the crash a few edge conditions on the if statements
were found and fixed as well
2018-11-01 01:42:09 +00:00
Arjan van de Ven 55b244ca0d enable the SGEMM/SKX C based kernel
In QA the final bug was found so now the sklyakex sgemm C based kernel can
be activated....
2018-10-12 09:30:35 +00:00
Arjan van de Ven d4bad73834 Add a C+intrinsics version of the SGEMM/skylakex kernel
for most sizes this is 1.2x to 1.4x faster than the current code
2018-10-10 01:49:22 +00:00
Arjan van de Ven 582c589727 dgemm/skylakex: replace discrete mul/add with fma
very minor gains since it's not super hot code, but general principles
2018-10-06 23:13:26 +00:00
Arjan van de Ven adbf6afa25 Add vector optimizations for ncopy as well for dgemm/skylakex 2018-10-06 21:18:12 +00:00
Arjan van de Ven 32bec8afbb add a skylakex optimized dgemm beta function 2018-10-06 16:36:26 +00:00