OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Wangyang Guo	ca7682e3a3	Small Matrix: skylakex: sgemm nn: fix n6 conflicts with n4	2021-08-02 07:06:54 +00:00
Wangyang Guo	9967e61abb	Small Matrix: skylakex: sgemm nn: fix error when beta not zero	2021-08-02 07:06:54 +00:00
Wangyang Guo	a87736346f	Small Matrix: skylakex: sgemm nn: add n6 to improve performance	2021-08-02 07:06:54 +00:00
Wangyang Guo	4c9d9940fd	Small Matrix: skylakex: sgemm nn: reduce store 4 N at a time	2021-08-02 07:06:54 +00:00
Wangyang Guo	13b32f69b7	Small Matrix: skylakex: sgemm nn: reduce store 4 M at a time	2021-08-02 07:06:54 +00:00
Wangyang Guo	3d8c6d9607	Small Matrix: skylakex: sgemm nn: clean up unused code	2021-08-02 07:06:54 +00:00
Wangyang Guo	49b61a3f30	Small Matrix: skylakex: sgemm_nn: optimize for M <= 8	2021-08-02 07:06:54 +00:00
Wangyang Guo	f88470323b	Optimize M < 16 using AVX512 mask	2021-08-02 07:06:54 +00:00
Wangyang Guo	9186456a12	small matrix: SkylakeX: add SGEMM NN kernel	2021-08-02 07:06:54 +00:00
Martin Kroeker	5b4b385ecf	Temporarily disable the SkylakeX sgemv_t microkernel due to LAPACK testsuite failures	2021-07-14 20:50:14 +02:00
Ma, Yu	706a08d4a0	Optimized sgemv_t for small N based on AVX512	2021-06-08 15:08:28 -04:00
Martin Kroeker	5f677e782e	Merge pull request #3196 from guowangy/skylakex-gemm-batch-k GEMM: skylake: improve the performance when m is small	2021-05-22 19:25:28 +02:00
Martin Kroeker	02087a62e7	Merge pull request #3205 from intelmy/sgemv_n_opt optimize on sgemv_n for small n	2021-05-17 17:49:01 +02:00
Martin Kroeker	8b90e5f202	Drop redundant inclusion of complex.h	2021-05-14 15:06:44 +02:00
Martin Kroeker	c0ca63ea46	Fix missing conditionals for non-SKX kernels	2021-05-05 14:55:36 +02:00
pnp	3d4ccd2a13	fix for build error	2021-04-30 12:25:33 -04:00
pnp	c59652f0ce	optimize on sgemv_n for small n	2021-04-30 12:14:58 -04:00
Wangyang Guo	aa7b3dc3db	GEMM: skylake: improve the performance when m is small	2021-04-28 13:56:06 +00:00
Martin Kroeker	3d511f0e66	replace spurious avx512 requirement with fma check	2021-04-26 21:55:30 +02:00
Martin Kroeker	2dfb24730d	Use "old" compute(24) function with clang due to register limitations	2021-04-06 19:58:32 +02:00
Martin Kroeker	7b8f580941	Merge pull request #3156 from martin-frbg/omatcopy_d Move x86_64 DOMATCOPY_RT back to the C implementation	2021-03-19 15:22:48 +01:00
Martin Kroeker	0f5e86a0d9	Remove premature entry for DOMATCOPY_RT	2021-03-18 21:53:50 +01:00
Martin Kroeker	7b294a99fd	Move common.h back to the top of the file so that SKYLAKEX (from config.h) is defined in time	2021-03-18 21:28:19 +01:00
Martin Kroeker	0934568d9c	Move includes under the ifdef for compilers w/o intrinsics support	2021-03-12 12:42:05 +01:00
Martin Kroeker	a9f6f7ad39	Remove spurious AVX512 requirement and add AVX2/FMA3 guard	2021-03-06 14:35:49 +01:00
Martin Kroeker	292d1af1a0	Update omatcopy_rt.c	2021-02-24 09:34:14 +01:00
Martin Kroeker	325b398e3c	Update omatcopy_rt.c	2021-02-24 09:13:12 +01:00
Martin Kroeker	6f5667b4d4	Enable optimized S/D OMATCOPY_RT	2021-02-24 09:03:41 +01:00
Martin Kroeker	cceeee7806	Add optimized omatcopy_rt	2021-02-24 09:00:54 +01:00
Martin Kroeker	47691c031f	Use Haswell optimizations for Zen as well	2021-02-11 09:26:15 +01:00
Martin Kroeker	ce7ddd8921	Use Haswell optimizations for Zen as well	2021-02-11 09:25:36 +01:00
Martin Kroeker	950c047b49	Use Haswell optimizations for Zen as well	2021-02-11 09:24:51 +01:00
Martin Kroeker	46509953a9	Use Haswell optimizations for Zen as well	2021-02-11 09:24:16 +01:00
Martin Kroeker	db348dcff2	Enable optimized srot/drot kernels from Haswell	2021-02-11 09:23:05 +01:00
Martin Kroeker	69a5558203	Merge pull request #3059 from Guobing-Chen/BF16_gemm Initial code for Cooperlake BF16 GEMM kernel	2021-01-23 19:08:05 +01:00
Alex Henrie	202fc9e8ed	Fix uninitialized argument value in dasum_k	2021-01-14 19:40:31 -07:00
Chen, Guobing	b0beb0b1ca	Initial code for Cooperlake BF16 GEMM kernel	2021-01-11 02:15:21 +08:00
Martin Kroeker	114eb159a4	Disable FMA intrinsics in the srot kernel when the compiler is PGI/NVIDIA	2020-12-19 22:15:58 +01:00
Martin Kroeker	441c08c9ff	Merge pull request #3016 from xiegengxin/complex-asum Improve the performance of zasum and casum with AVX512 intrinsic	2020-12-04 22:07:16 +01:00
Gengxin Xie	0cb7a403b2	fix error declare function blas_level1_thread_with_return_value	2020-12-02 09:51:52 +08:00
Gengxin Xie	b766c1e9bb	Improve the performance of zasum and casum with AVX512 intrinsic	2020-12-01 16:49:26 +08:00
Martin Kroeker	f1bf040b25	Merge pull request #2988 from xiegengxin/smp-asum Improve the performance of dasum and sasum when SMP is defined	2020-11-22 12:24:13 +01:00
Gengxin Xie	d6e7e05bb3	Improve the performance of dasum and sasum when SMP is defined	2020-11-13 14:20:52 +08:00
Qiyu8	a87e537b8c	modify macro	2020-11-11 15:53:48 +08:00
Qiyu8	5bc0a7583f	only FMA3 and vector larger than 128 have positive effects.	2020-11-11 15:18:01 +08:00
Qiyu8	8c0b206d4c	Optimize the performance of rot by using universal intrinsics	2020-11-11 14:33:12 +08:00
Martin Kroeker	ff16329cb7	Merge pull request #2972 from xiegengxin/rot-intrinsic Improve the performance of rot by using AVX512 and AVX2 intrinsic	2020-11-08 22:43:00 +01:00
Gengxin Xie	725ffbf041	fix typo	2020-11-05 16:25:17 +08:00
Gengxin Xie	d9ba49165a	Improve the performance of rot by using AVX512 and AVX2 intrinsic	2020-11-05 15:12:36 +08:00
Chen, Guobing	a7b1f9b1bb	Implementation of BF16 based gemv 1. Add a new API -- sbgemv to support bfloat16 based gemv 2. Implement a generic kernel for sbgemv 3. Implement an avx512-bf16 based kernel for sbgemv Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2020-10-29 02:08:23 +08:00
İsmail Dönmez	4a1d00f589	Fix build with -Werror=return-type dgemm_tcopy_16_skylakex.c CNAME function should return an int, add a return 0 similar to other files.	2020-10-21 08:43:39 +02:00
Bart Oldeman	b073d759d0	x86_64: clobber all xmm registers after vzeroupper As observed using GCC 10 using -march=native -ftree-vectorize on Knights Landing, it is now smart enough to find clobbers inside non-inlined static functions. In particular, sgemv counted on a kernel to preserve the whole %ymm2 register (since it was not in the clobber list), but the top part was destroyed by vzeroupper. This caused many tests to fail. This patch makes sure all xmm (and ymm/zmm by extension) registers are listed as clobbered to avoid this happening, as most kernels already did correctly in fact.	2020-10-20 02:16:47 +00:00
Bart Oldeman	03e781b766	sgemm_direct_skylakex: fix `75eeb26` regression. The `#if defined(SKYLAKEX) \|\| defined (COOPERLAKE)` from that commit was before #include "common.h" so caused the compiled function to be empty, returning garbage results for qualifying sgemm's on those architectures. Closes #2914	2020-10-18 19:58:07 +00:00
Martin Kroeker	c339c40c01	Silence a redefinition warning	2020-10-15 19:08:12 +02:00
Qiyu8	bfdf4b56da	Add double precision universal intrinsics for X86/ARM	2020-10-15 10:29:42 +08:00
Martin Kroeker	756802df61	Merge pull request #2890 from martin-frbg/s-d-sum Revert special handling of Windows xNRM2 and enable C+intrinsics kern…	2020-10-14 09:02:03 +02:00
Martin Kroeker	8d2df7d066	Revert special handling of Windows xNRM2 and enable C+intrinsics kernel for SSUM/DSUM	2020-10-13 00:14:29 +02:00
Martin Kroeker	08929430cd	Merge pull request #2886 from martin-frbg/issue_2767 Rename "HALF" precision functions (sh prefix) to "BFLOAT16" with "sb" prefix	2020-10-13 00:04:35 +02:00
Martin Kroeker	0c84ffe05f	Merge pull request #2881 from mattip/fninit add fninit to reset fpu registers before assembler routines	2020-10-12 23:50:41 +02:00
Matti Picus	403eb513a0	use emms instead, add WIN guards	2020-10-12 18:15:01 +03:00
Martin Kroeker	dc8a1afa63	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:53:50 +02:00
Martin Kroeker	fd94236042	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:42:07 +02:00
Martin Kroeker	68ce719fac	Rename shdot_microk_cooperlake.c to sbdot_microk_cooperlake.c	2020-10-11 23:41:13 +02:00
Martin Kroeker	d7dd9b396c	Rename shdot.c to sbdot.c	2020-10-11 23:40:43 +02:00
Martin Kroeker	7812486091	Use generic C for D/Z nrm2 kernels on Windows to work around fpu exception bug	2020-10-06 21:33:16 +02:00
Matti Picus	a5b164946c	add fninit to reset fpu registers before assembler routines	2020-10-05 22:13:25 +03:00
Qiyu8	14f7dad3b7	performance improved	2020-09-22 16:52:15 +08:00
Qiyu8	325b539c26	Optimize the performance of daxpy by using universal intrinsics	2020-09-22 10:38:35 +08:00
Martin Kroeker	91c84e1c01	Merge pull request #2796 from Guobing-Chen/BF16_dot_coversion_apis Add bfloat16 based dot and conversion with single/double	2020-09-14 15:00:19 +02:00
Martin Kroeker	e72430fe46	Merge pull request #2803 from xiegengxin/AVX2-asum Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic	2020-09-06 18:32:15 +02:00
Chen, Guobing	deaeb6c5b8	Add bfloat16 based dot and conversion with single/double 1. Added bfloat16 based dot as new API: shdot 2. Implemented generic kernel and cooperlake-specific (AVX512-BF16) kernel for shdot 3. Added 4 conversion APIs for bfloat16 data type <=> single/double: shstobf16 shdtobf16 sbf16tos dbf16tod shstobf16 -- convert single float array to bfloat16 array shdtobf16 -- convert double float array to bfloat16 array sbf16tos -- convert bfloat16 array to single float array dbf16tod -- convert bfloat16 array to double float array 4. Implemented generic kernels for all 4 conversion APIs, and cooperlake-specific kernel for shstobf16 and shdtobf16 5. Update level1 thread facilitate functions and macros to support multi-threading for these new APIs 6. Fix Cooperlake platform detection/specify issue when under dynamic-arch building 7. Change the typedef of bfloat16 from unsigned short to more strict uint16_t Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2020-09-04 02:31:25 +08:00
Gengxin Xie	1b0f17eeed	align to 64, using SSE when input size is small	2020-09-03 14:25:54 +08:00
Gengxin Xie	448152cdd8	define __AVX2__ to ensure the haswell code compiled with avx2	2020-08-31 14:39:08 +08:00
Gengxin Xie	cb3c190a3a	Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic	2020-08-31 11:44:08 +08:00
Martin Kroeker	b2053239fc	Fix mssing dummy parameter (imag part of alpha) of zdot_thread_function	2020-08-23 15:08:16 +02:00
Martin Kroeker	9ee21a0a39	Merge pull request #2780 from Guobing-Chen/CPL_build_support Enable COOPERLAKE build target	2020-08-20 19:54:29 +02:00
Martin Kroeker	75eeb265d7	[WIP] Refactor the driver code for direct SGEMM (#2782 ) Move "direct SGEMM" functionality out of the SkylakeX SGEMM kernel and make it available (on x86_64 targets only for now) in DYNAMIC_ARCH builds * Add sgemm_direct targets in the kernel Makefile.L3 and CMakeLists.txt * Add direct_sgemm functions to the gotoblas struct in common_param.h * Move sgemm_direct_performant helper to separate file * Update gemm.c to macros for sgemm_direct to support dynamic_arch naming via common_s,h * (Conditionally) add sgemm_direct functions in setparam-ref.c	2020-08-19 14:51:09 +02:00
Chen, Guobing	e740c4873d	Enable COOPERLAKE build target Enable new build target platform -- COOPERLAKE. This target platform supports all the SKYLAKEX supported ISAs + avx512bf16. So all the SKYLAKEX specific kernels/drivers and related code are now extended to be also active on COOPERLAKE. Besides, new BF16 related kernels are active under this target.	2020-08-13 06:18:00 +08:00
Martin Kroeker	81dcfdcf39	Multiply by 2 instead of left-shifting a potentially negative number fixes GCC ubsan warning in the BLAS tests	2020-08-02 18:29:56 +02:00
Martin Kroeker	0ef4b3f1f2	Multiply instead of doing a left shift of a potentially negative number fixes GCC ubsan report in the BLAS tests	2020-08-02 18:27:40 +02:00
Martin Kroeker	aa53a8a5cb	Multiply by two instead of left-shifting one place fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests	2020-08-02 18:25:09 +02:00
Martin Kroeker	aa3a1e7d8c	Multiply by two rather than left shift by one place fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests	2020-08-02 18:22:31 +02:00
Martin Kroeker	e30ad0e521	Strip UTF8 byte order marker from source	2020-06-26 09:00:43 +02:00
Martin Kroeker	93592d1260	Merge pull request #2675 from wjc404/develop AVX512 DGEMM TCOPY_16 Function	2020-06-23 09:29:02 +02:00
wjc404	086d87a302	AVX512 dgemm tcopy_16 function	2020-06-20 00:07:43 +08:00
Martin Kroeker	c3574ffe53	Merge pull request #2646 from wjc404/develop Optimize AVX512 parallel DGEMM performance	2020-06-07 13:18:22 +02:00
wjc404	0e3ac4a06b	Add files via upload	2020-06-06 14:56:57 +08:00
Martin Kroeker	2271c3506b	Work around excessive LAPACK test failures on Skylake-X Something in the plain C parts of x86_64 cscal.c and zscal.c appears to be miscompiled by both gfortran9 and ifort when compiling for skylakex-avx512, even when the optimized Haswell microkernel is not in use.	2020-05-09 23:49:18 +02:00
Martin Kroeker	90dba9f716	Duplicate earlier Clang 9.0.0 workaround for corresponding Apple Clang version As discussed on the original PR #2329, the "Apple Clang 11.0.3" that appears to be based the same LLVM release produces the same miscompilation of this file.	2020-05-05 10:44:50 +02:00
Martin Kroeker	5b0093b5fe	Convert aligned moves to unaligned should have no performance impact on reasonably modern cpus and fixes occasional crashes in actual user code.	2020-04-13 14:58:52 +02:00
Martin Kroeker	567d2760e6	Merge pull request #2520 from wjc404/develop Fix avx512 sgemm performance bug when ldc is a multiple of 1024	2020-03-30 20:15:59 +02:00
wjc404	b8307768e2	Add files via upload	2020-03-21 05:42:10 +08:00
Martin Kroeker	af8a619e1f	Merge pull request #2517 from wjc404/develop Temporary fix for SKX STRSM	2020-03-17 10:12:53 +01:00
wjc404	62b9608986	Update KERNEL.SKYLAKEX	2020-03-17 12:52:55 +08:00
Martin Kroeker	a1b181cea2	Merge pull request #2516 from wjc404/develop AVX2 STRSM kernels	2020-03-16 21:58:34 +01:00
wjc404	cdc0e9011e	Update KERNEL.ZEN	2020-03-16 16:39:37 +00:00
wjc404	fa049d49c2	AVX2 STRSM kernel	2020-03-17 00:34:08 +08:00
Martin Kroeker	ea8eec5d17	Merge pull request #2422 from wjc404/develop Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix performance bugs in AVX2 s/c/z GEMM	2020-02-29 19:07:35 +01:00
wjc404	dd22eb7621	Update cgemm_kernel_8x2_haswell.c	2020-02-27 22:26:15 +08:00
wjc404	2352331e60	Update zgemm_kernel_4x2_haswell.c	2020-02-27 22:25:19 +08:00
wjc404	1b980001dd	Update zgemm_kernel_4x2_haswell.c	2020-02-26 18:38:12 +08:00
wjc404	2515e1152f	Update cgemm_kernel_8x2_haswell.c	2020-02-26 18:36:54 +08:00
wjc404	903854c168	Add files via upload	2020-02-22 23:40:02 +08:00
wjc404	a2ff577a30	Update KERNEL.ZEN	2020-02-22 23:39:43 +08:00
wjc404	97a32cb0a5	Update KERNEL.HASWELL	2020-02-22 23:39:20 +08:00
Martin Liska	aeea14ee40	Come up with LOAD_AND_COMPARE_TO_MXX macro in iamax_sse.S.	2020-02-17 09:01:53 +01:00
Martin Liska	18bcc36a69	Fix implementation of iamax_sse.S as reported in #2116 . The was a typo in iamax_sse.S where one of the comparison was cmpeqps instead of cmpeqss. That misdetected index for sequences where the minimum value was 0.	2020-02-17 09:01:53 +01:00
wjc404	f566787e6e	Update KERNEL.SKYLAKEX	2020-02-16 22:58:44 +08:00
wjc404	e3368cbf18	AVX512 STRMM kernel	2020-02-16 22:58:00 +08:00
Bart Oldeman	7ea5e07d1c	Fix inline asm in dscal: mark x, x1 as clobbered. Fixes #2408 The leaq instructions in dscal_kernel_inc_8 modify x and x1 so they must be declared as input/output constraints, otherwise the compiler may assume the corresponding registers are not modified.	2020-02-12 14:11:44 +00:00
wjc404	3447d04eaf	Update dgemm_kernel_16x2_skylakex.c	2020-02-06 02:14:10 +00:00
wjc404	8b5cdcc64c	Update sgemm_kernel_8x4_haswell.c	2020-02-06 01:47:46 +00:00
wjc404	4e00d96a78	Update dgemm_kernel_16x2_skylakex.c	2020-02-06 01:46:36 +00:00
wjc404	096da2f51a	Update dgemm_kernel_16x2_skylakex.c	2020-02-05 13:36:57 +08:00
wjc404	081b188529	Update KERNEL.SKYLAKEX	2020-02-03 21:38:08 +08:00
wjc404	8019e70211	AVX512 16x2 DGEMM kernel	2020-02-03 21:32:56 +08:00
wjc404	e5dcdeb550	Update sgemm_direct_skylakex.c	2020-01-13 16:59:23 +08:00
wjc404	952cc2ba38	Update sgemm_kernel_16x4_skylakex_2.c	2020-01-13 16:58:54 +08:00
wjc404	feaafbedd3	make skylakex sgemm code more friendly for readers BTW some kernels were adjusted to improve performance	2020-01-13 16:28:41 +08:00
wjc404	3a100b2797	Update KERNEL.SKYLAKEX	2020-01-09 13:48:41 +08:00
wjc404	bd4c032f52	Update sgemm_kernel_8x4_haswell.c	2020-01-07 11:22:46 +08:00
wjc404	9dc9b7b95e	Update sgemm_kernel_8x4_haswell.c	2020-01-06 20:11:36 +08:00
wjc404	92b10212de	optimize AVX2 SGEMM	2020-01-06 12:11:21 +08:00
wjc404	b73bf01378	optimize AVX2 SGEMM	2020-01-06 12:09:14 +08:00
wjc404	eb3c9f1db9	optimize AVX2 SGEMM	2020-01-06 12:07:02 +08:00
wjc404	a0f0a802fc	Update zgemm3m_kernel_4x4_haswell.c	2019-12-30 17:33:42 +08:00
wjc404	700fe5b5ee	Add files via upload	2019-12-30 17:18:59 +08:00
wjc404	f60840c420	Update KERNEL.ZEN	2019-12-30 16:04:23 +08:00
wjc404	109e18cd96	Update KERNEL.HASWELL	2019-12-30 16:03:24 +08:00
wjc404	ae1579be13	Create zgemm3m_kernel_4x4_haswell.c	2019-12-30 16:02:51 +08:00
wjc404	cd765f094b	Update cgemm3m_kernel_8x4_haswell.c	2019-12-27 18:23:29 +08:00
wjc404	3a66c8cac1	Update KERNEL.ZEN	2019-12-27 18:04:08 +08:00
wjc404	ed9af2f7da	Update KERNEL.HASWELL	2019-12-27 18:01:38 +08:00
wjc404	5fd1edead9	Create cgemm3m_kernel_8x4_haswell.c	2019-12-27 18:00:55 +08:00
wjc404	eeecd623d8	Update cgemm_kernel_8x2_haswell.c	2019-12-24 00:40:16 +08:00
wjc404	2cd9306bb5	Update KERNEL.ZEN	2019-12-23 23:42:30 +08:00
wjc404	c418c81224	Update KERNEL.HASWELL	2019-12-23 23:41:44 +08:00
wjc404	025741f16a	Fast Haswell CGEMM kernel	2019-12-23 23:40:03 +08:00
wjc404	f41d52665d	Fast Haswell ZGEMM kernel	2019-12-21 14:37:06 +08:00
wjc404	d573d24de7	Fast Haswell ZGEMM kernel	2019-12-21 14:35:15 +08:00
Isuru Fernando	b863b32ac5	Workaround an ICE in clang 9.0.0 This bug is not there in 8.x nor in the 9.0 daily snapshot.	2019-12-01 12:59:46 -06:00
wjc404	934e601e93	Update dgemm_kernel_4x8_skylakex_2.c	2019-11-28 19:56:35 +08:00
wjc404	eb1e9c8c92	some optimizations	2019-11-26 14:12:20 +08:00
Wang, Long	bfb5fbdb4d	revised fix windows compatible for #2313 Signed-off-by: Wang, Long <long1.wang@intel.com>	2019-11-21 10:22:58 +08:00
Wang, Long	1191db1a49	For the sake of windows compatible, used "unsigned long long" to ensure 64-bit length Signed-off-by: Wang, Long <long1.wang@intel.com>	2019-11-20 21:30:47 +08:00
Wang, Long	0caf1434c9	Fix the integer overflow issue for large matrix size For large matrix, e.g. M=N=K, and M>1290, int mnk=MNK will overflow. This will lead to wrong branching to single-threading. The performance is downgraded significantly. Signed-off-by: Wang, Long <long1.wang@intel.com>	2019-11-20 14:11:17 +08:00
wjc404	819e852ae7	AVX512 CGEMM & ZGEMM kernels 96-99% 1-thread performance of MKL2018	2019-11-11 20:04:52 +08:00
wjc404	836c414e22	optimizations of software prefetching	2019-11-05 13:36:56 +08:00
wjc404	430c11e135	Add files via upload	2019-11-04 20:10:12 +08:00
wjc404	fbacd2605d	optimizations via software prefetches	2019-11-04 19:37:19 +08:00

1 2 3 4 5 ...

766 Commits