OpenBLAS

Commit Graph

Author	SHA1	Message	Date
İsmail Dönmez	4a1d00f589	Fix build with -Werror=return-type dgemm_tcopy_16_skylakex.c CNAME function should return an int, add a return 0 similar to other files.	2020-10-21 08:43:39 +02:00
Bart Oldeman	b073d759d0	x86_64: clobber all xmm registers after vzeroupper As observed using GCC 10 using -march=native -ftree-vectorize on Knights Landing, it is now smart enough to find clobbers inside non-inlined static functions. In particular, sgemv counted on a kernel to preserve the whole %ymm2 register (since it was not in the clobber list), but the top part was destroyed by vzeroupper. This caused many tests to fail. This patch makes sure all xmm (and ymm/zmm by extension) registers are listed as clobbered to avoid this happening, as most kernels already did correctly in fact.	2020-10-20 02:16:47 +00:00
Bart Oldeman	03e781b766	sgemm_direct_skylakex: fix `75eeb26` regression. The `#if defined(SKYLAKEX) \|\| defined (COOPERLAKE)` from that commit was before #include "common.h" so caused the compiled function to be empty, returning garbage results for qualifying sgemm's on those architectures. Closes #2914	2020-10-18 19:58:07 +00:00
Martin Kroeker	c339c40c01	Silence a redefinition warning	2020-10-15 19:08:12 +02:00
Qiyu8	bfdf4b56da	Add double precision universal intrinsics for X86/ARM	2020-10-15 10:29:42 +08:00
Martin Kroeker	756802df61	Merge pull request #2890 from martin-frbg/s-d-sum Revert special handling of Windows xNRM2 and enable C+intrinsics kern…	2020-10-14 09:02:03 +02:00
Martin Kroeker	8d2df7d066	Revert special handling of Windows xNRM2 and enable C+intrinsics kernel for SSUM/DSUM	2020-10-13 00:14:29 +02:00
Martin Kroeker	08929430cd	Merge pull request #2886 from martin-frbg/issue_2767 Rename "HALF" precision functions (sh prefix) to "BFLOAT16" with "sb" prefix	2020-10-13 00:04:35 +02:00
Martin Kroeker	0c84ffe05f	Merge pull request #2881 from mattip/fninit add fninit to reset fpu registers before assembler routines	2020-10-12 23:50:41 +02:00
Matti Picus	403eb513a0	use emms instead, add WIN guards	2020-10-12 18:15:01 +03:00
Martin Kroeker	dc8a1afa63	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:53:50 +02:00
Martin Kroeker	fd94236042	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:42:07 +02:00
Martin Kroeker	68ce719fac	Rename shdot_microk_cooperlake.c to sbdot_microk_cooperlake.c	2020-10-11 23:41:13 +02:00
Martin Kroeker	d7dd9b396c	Rename shdot.c to sbdot.c	2020-10-11 23:40:43 +02:00
Martin Kroeker	7812486091	Use generic C for D/Z nrm2 kernels on Windows to work around fpu exception bug	2020-10-06 21:33:16 +02:00
Matti Picus	a5b164946c	add fninit to reset fpu registers before assembler routines	2020-10-05 22:13:25 +03:00
Qiyu8	14f7dad3b7	performance improved	2020-09-22 16:52:15 +08:00
Qiyu8	325b539c26	Optimize the performance of daxpy by using universal intrinsics	2020-09-22 10:38:35 +08:00
Martin Kroeker	91c84e1c01	Merge pull request #2796 from Guobing-Chen/BF16_dot_coversion_apis Add bfloat16 based dot and conversion with single/double	2020-09-14 15:00:19 +02:00
Martin Kroeker	e72430fe46	Merge pull request #2803 from xiegengxin/AVX2-asum Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic	2020-09-06 18:32:15 +02:00
Chen, Guobing	deaeb6c5b8	Add bfloat16 based dot and conversion with single/double 1. Added bfloat16 based dot as new API: shdot 2. Implemented generic kernel and cooperlake-specific (AVX512-BF16) kernel for shdot 3. Added 4 conversion APIs for bfloat16 data type <=> single/double: shstobf16 shdtobf16 sbf16tos dbf16tod shstobf16 -- convert single float array to bfloat16 array shdtobf16 -- convert double float array to bfloat16 array sbf16tos -- convert bfloat16 array to single float array dbf16tod -- convert bfloat16 array to double float array 4. Implemented generic kernels for all 4 conversion APIs, and cooperlake-specific kernel for shstobf16 and shdtobf16 5. Update level1 thread facilitate functions and macros to support multi-threading for these new APIs 6. Fix Cooperlake platform detection/specify issue when under dynamic-arch building 7. Change the typedef of bfloat16 from unsigned short to more strict uint16_t Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2020-09-04 02:31:25 +08:00
Gengxin Xie	1b0f17eeed	align to 64, using SSE when input size is small	2020-09-03 14:25:54 +08:00
Gengxin Xie	448152cdd8	define __AVX2__ to ensure the haswell code compiled with avx2	2020-08-31 14:39:08 +08:00
Gengxin Xie	cb3c190a3a	Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic	2020-08-31 11:44:08 +08:00
Martin Kroeker	b2053239fc	Fix mssing dummy parameter (imag part of alpha) of zdot_thread_function	2020-08-23 15:08:16 +02:00
Martin Kroeker	9ee21a0a39	Merge pull request #2780 from Guobing-Chen/CPL_build_support Enable COOPERLAKE build target	2020-08-20 19:54:29 +02:00
Martin Kroeker	75eeb265d7	[WIP] Refactor the driver code for direct SGEMM (#2782 ) Move "direct SGEMM" functionality out of the SkylakeX SGEMM kernel and make it available (on x86_64 targets only for now) in DYNAMIC_ARCH builds * Add sgemm_direct targets in the kernel Makefile.L3 and CMakeLists.txt * Add direct_sgemm functions to the gotoblas struct in common_param.h * Move sgemm_direct_performant helper to separate file * Update gemm.c to macros for sgemm_direct to support dynamic_arch naming via common_s,h * (Conditionally) add sgemm_direct functions in setparam-ref.c	2020-08-19 14:51:09 +02:00
Chen, Guobing	e740c4873d	Enable COOPERLAKE build target Enable new build target platform -- COOPERLAKE. This target platform supports all the SKYLAKEX supported ISAs + avx512bf16. So all the SKYLAKEX specific kernels/drivers and related code are now extended to be also active on COOPERLAKE. Besides, new BF16 related kernels are active under this target.	2020-08-13 06:18:00 +08:00
Martin Kroeker	81dcfdcf39	Multiply by 2 instead of left-shifting a potentially negative number fixes GCC ubsan warning in the BLAS tests	2020-08-02 18:29:56 +02:00
Martin Kroeker	0ef4b3f1f2	Multiply instead of doing a left shift of a potentially negative number fixes GCC ubsan report in the BLAS tests	2020-08-02 18:27:40 +02:00
Martin Kroeker	aa53a8a5cb	Multiply by two instead of left-shifting one place fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests	2020-08-02 18:25:09 +02:00
Martin Kroeker	aa3a1e7d8c	Multiply by two rather than left shift by one place fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests	2020-08-02 18:22:31 +02:00
Martin Kroeker	e30ad0e521	Strip UTF8 byte order marker from source	2020-06-26 09:00:43 +02:00
Martin Kroeker	93592d1260	Merge pull request #2675 from wjc404/develop AVX512 DGEMM TCOPY_16 Function	2020-06-23 09:29:02 +02:00
wjc404	086d87a302	AVX512 dgemm tcopy_16 function	2020-06-20 00:07:43 +08:00
Martin Kroeker	c3574ffe53	Merge pull request #2646 from wjc404/develop Optimize AVX512 parallel DGEMM performance	2020-06-07 13:18:22 +02:00
wjc404	0e3ac4a06b	Add files via upload	2020-06-06 14:56:57 +08:00
Martin Kroeker	2271c3506b	Work around excessive LAPACK test failures on Skylake-X Something in the plain C parts of x86_64 cscal.c and zscal.c appears to be miscompiled by both gfortran9 and ifort when compiling for skylakex-avx512, even when the optimized Haswell microkernel is not in use.	2020-05-09 23:49:18 +02:00
Martin Kroeker	90dba9f716	Duplicate earlier Clang 9.0.0 workaround for corresponding Apple Clang version As discussed on the original PR #2329, the "Apple Clang 11.0.3" that appears to be based the same LLVM release produces the same miscompilation of this file.	2020-05-05 10:44:50 +02:00
Martin Kroeker	5b0093b5fe	Convert aligned moves to unaligned should have no performance impact on reasonably modern cpus and fixes occasional crashes in actual user code.	2020-04-13 14:58:52 +02:00
Martin Kroeker	567d2760e6	Merge pull request #2520 from wjc404/develop Fix avx512 sgemm performance bug when ldc is a multiple of 1024	2020-03-30 20:15:59 +02:00
wjc404	b8307768e2	Add files via upload	2020-03-21 05:42:10 +08:00
Martin Kroeker	af8a619e1f	Merge pull request #2517 from wjc404/develop Temporary fix for SKX STRSM	2020-03-17 10:12:53 +01:00
wjc404	62b9608986	Update KERNEL.SKYLAKEX	2020-03-17 12:52:55 +08:00
Martin Kroeker	a1b181cea2	Merge pull request #2516 from wjc404/develop AVX2 STRSM kernels	2020-03-16 21:58:34 +01:00
wjc404	cdc0e9011e	Update KERNEL.ZEN	2020-03-16 16:39:37 +00:00
wjc404	fa049d49c2	AVX2 STRSM kernel	2020-03-17 00:34:08 +08:00
Martin Kroeker	ea8eec5d17	Merge pull request #2422 from wjc404/develop Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix performance bugs in AVX2 s/c/z GEMM	2020-02-29 19:07:35 +01:00
wjc404	dd22eb7621	Update cgemm_kernel_8x2_haswell.c	2020-02-27 22:26:15 +08:00
wjc404	2352331e60	Update zgemm_kernel_4x2_haswell.c	2020-02-27 22:25:19 +08:00
wjc404	1b980001dd	Update zgemm_kernel_4x2_haswell.c	2020-02-26 18:38:12 +08:00
wjc404	2515e1152f	Update cgemm_kernel_8x2_haswell.c	2020-02-26 18:36:54 +08:00
wjc404	903854c168	Add files via upload	2020-02-22 23:40:02 +08:00
wjc404	a2ff577a30	Update KERNEL.ZEN	2020-02-22 23:39:43 +08:00
wjc404	97a32cb0a5	Update KERNEL.HASWELL	2020-02-22 23:39:20 +08:00
Martin Liska	aeea14ee40	Come up with LOAD_AND_COMPARE_TO_MXX macro in iamax_sse.S.	2020-02-17 09:01:53 +01:00
Martin Liska	18bcc36a69	Fix implementation of iamax_sse.S as reported in #2116 . The was a typo in iamax_sse.S where one of the comparison was cmpeqps instead of cmpeqss. That misdetected index for sequences where the minimum value was 0.	2020-02-17 09:01:53 +01:00
wjc404	f566787e6e	Update KERNEL.SKYLAKEX	2020-02-16 22:58:44 +08:00
wjc404	e3368cbf18	AVX512 STRMM kernel	2020-02-16 22:58:00 +08:00
Bart Oldeman	7ea5e07d1c	Fix inline asm in dscal: mark x, x1 as clobbered. Fixes #2408 The leaq instructions in dscal_kernel_inc_8 modify x and x1 so they must be declared as input/output constraints, otherwise the compiler may assume the corresponding registers are not modified.	2020-02-12 14:11:44 +00:00
wjc404	3447d04eaf	Update dgemm_kernel_16x2_skylakex.c	2020-02-06 02:14:10 +00:00
wjc404	8b5cdcc64c	Update sgemm_kernel_8x4_haswell.c	2020-02-06 01:47:46 +00:00
wjc404	4e00d96a78	Update dgemm_kernel_16x2_skylakex.c	2020-02-06 01:46:36 +00:00
wjc404	096da2f51a	Update dgemm_kernel_16x2_skylakex.c	2020-02-05 13:36:57 +08:00
wjc404	081b188529	Update KERNEL.SKYLAKEX	2020-02-03 21:38:08 +08:00
wjc404	8019e70211	AVX512 16x2 DGEMM kernel	2020-02-03 21:32:56 +08:00
wjc404	e5dcdeb550	Update sgemm_direct_skylakex.c	2020-01-13 16:59:23 +08:00
wjc404	952cc2ba38	Update sgemm_kernel_16x4_skylakex_2.c	2020-01-13 16:58:54 +08:00
wjc404	feaafbedd3	make skylakex sgemm code more friendly for readers BTW some kernels were adjusted to improve performance	2020-01-13 16:28:41 +08:00
wjc404	3a100b2797	Update KERNEL.SKYLAKEX	2020-01-09 13:48:41 +08:00
wjc404	bd4c032f52	Update sgemm_kernel_8x4_haswell.c	2020-01-07 11:22:46 +08:00
wjc404	9dc9b7b95e	Update sgemm_kernel_8x4_haswell.c	2020-01-06 20:11:36 +08:00
wjc404	92b10212de	optimize AVX2 SGEMM	2020-01-06 12:11:21 +08:00
wjc404	b73bf01378	optimize AVX2 SGEMM	2020-01-06 12:09:14 +08:00
wjc404	eb3c9f1db9	optimize AVX2 SGEMM	2020-01-06 12:07:02 +08:00
wjc404	a0f0a802fc	Update zgemm3m_kernel_4x4_haswell.c	2019-12-30 17:33:42 +08:00
wjc404	700fe5b5ee	Add files via upload	2019-12-30 17:18:59 +08:00
wjc404	f60840c420	Update KERNEL.ZEN	2019-12-30 16:04:23 +08:00
wjc404	109e18cd96	Update KERNEL.HASWELL	2019-12-30 16:03:24 +08:00
wjc404	ae1579be13	Create zgemm3m_kernel_4x4_haswell.c	2019-12-30 16:02:51 +08:00
wjc404	cd765f094b	Update cgemm3m_kernel_8x4_haswell.c	2019-12-27 18:23:29 +08:00
wjc404	3a66c8cac1	Update KERNEL.ZEN	2019-12-27 18:04:08 +08:00
wjc404	ed9af2f7da	Update KERNEL.HASWELL	2019-12-27 18:01:38 +08:00
wjc404	5fd1edead9	Create cgemm3m_kernel_8x4_haswell.c	2019-12-27 18:00:55 +08:00
wjc404	eeecd623d8	Update cgemm_kernel_8x2_haswell.c	2019-12-24 00:40:16 +08:00
wjc404	2cd9306bb5	Update KERNEL.ZEN	2019-12-23 23:42:30 +08:00
wjc404	c418c81224	Update KERNEL.HASWELL	2019-12-23 23:41:44 +08:00
wjc404	025741f16a	Fast Haswell CGEMM kernel	2019-12-23 23:40:03 +08:00
wjc404	f41d52665d	Fast Haswell ZGEMM kernel	2019-12-21 14:37:06 +08:00
wjc404	d573d24de7	Fast Haswell ZGEMM kernel	2019-12-21 14:35:15 +08:00
Isuru Fernando	b863b32ac5	Workaround an ICE in clang 9.0.0 This bug is not there in 8.x nor in the 9.0 daily snapshot.	2019-12-01 12:59:46 -06:00
wjc404	934e601e93	Update dgemm_kernel_4x8_skylakex_2.c	2019-11-28 19:56:35 +08:00
wjc404	eb1e9c8c92	some optimizations	2019-11-26 14:12:20 +08:00
Wang, Long	bfb5fbdb4d	revised fix windows compatible for #2313 Signed-off-by: Wang, Long <long1.wang@intel.com>	2019-11-21 10:22:58 +08:00
Wang, Long	1191db1a49	For the sake of windows compatible, used "unsigned long long" to ensure 64-bit length Signed-off-by: Wang, Long <long1.wang@intel.com>	2019-11-20 21:30:47 +08:00
Wang, Long	0caf1434c9	Fix the integer overflow issue for large matrix size For large matrix, e.g. M=N=K, and M>1290, int mnk=MNK will overflow. This will lead to wrong branching to single-threading. The performance is downgraded significantly. Signed-off-by: Wang, Long <long1.wang@intel.com>	2019-11-20 14:11:17 +08:00
wjc404	819e852ae7	AVX512 CGEMM & ZGEMM kernels 96-99% 1-thread performance of MKL2018	2019-11-11 20:04:52 +08:00
wjc404	836c414e22	optimizations of software prefetching	2019-11-05 13:36:56 +08:00
wjc404	430c11e135	Add files via upload	2019-11-04 20:10:12 +08:00
wjc404	fbacd2605d	optimizations via software prefetches	2019-11-04 19:37:19 +08:00
wjc404	1df9a2013d	new sgemm kernel for skylakex	2019-11-02 00:00:48 +08:00
wjc404	6ff013bae0	native support for icopy_4 90% MKL 1-thread performance.	2019-10-19 03:54:44 +08:00
wjc404	0d669e04bb	Update dgemm_kernel_8x8_skylakex.c	2019-10-18 15:00:17 +08:00
wjc404	17cdd9f9e1	some correction	2019-10-18 14:58:07 +08:00
wjc404	6bcb06fcb1	make further changes to icopy_8 easier	2019-10-18 10:47:31 +08:00
wjc404	b7315f8401	Add files via upload	2019-10-16 19:23:36 +08:00
wjc404	9b19e9e1b0	Update dgemm_kernel_8x8_skylakex.c	2019-10-16 10:14:51 +08:00
wjc404	6bd67ddbab	Update dgemm_kernel_8x8_skylakex.c	2019-10-16 03:20:08 +08:00
wjc404	844629af57	Add files via upload	2019-10-16 02:00:34 +08:00
Martin Kroeker	11c59acfb1	Keep both PGI/SUN and default code paths to avoid breaking Clang/WIndows	2019-08-28 18:07:44 +02:00
Martin Kroeker	3a55dca2dc	Make x86_64 zdot compile with PGI and Sun C again broken by #2222 as CREAL,CIMAG do not expand to a valid lvalue with these compilers	2019-08-28 11:35:31 +02:00
Martin Kroeker	9ef96b32a6	Add multithreading support to the x86_64 zdot kernel (#2222 ) * Add multithreading support copied from the ThunderX2T99 kernel. For #2221	2019-08-15 22:09:12 +02:00
Martin Kroeker	dccff2e785	Merge pull request #2206 from martin-frbg/zen-dtrmm Replace vpermpd with vpermilpd in the Haswell DTRMM kernel	2019-08-09 07:55:20 +02:00
Martin Kroeker	5c3458a6e7	Merge pull request #2199 from martin-frbg/zen-dtrsm Replace most vpermpd calls in the Haswell DTRSM_RN kernel	2019-08-09 07:55:02 +02:00
Martin Kroeker	acf6002ab2	Replace most vpermpd calls in the Haswell DTRSM_RN kernel	2019-08-03 12:40:13 +02:00
Martin Kroeker	2dfb804cb9	Replace vpermpd with vpermilpd in the Haswell DTRMM kernel to improve performance on AMD Zen (#2180) applying wjc404's improvement of the DGEMM kernel from #2186	2019-07-28 23:17:28 +02:00
Martin Kroeker	4c153ec9da	Merge pull request #2196 from wjc404/develop Add vbroadcastsd kernel to dgemm_kernel_4x8_haswell.S	2019-07-28 23:11:40 +02:00
wjc404	7eecd8e39c	Add files via upload	2019-07-28 07:39:09 +08:00
Martin Kroeker	7b0b7c11d2	Merge pull request #2190 from martin-frbg/zdot-zen Replace vpermpd with vpermilpd in the Haswell/Zen zdot microkernel	2019-07-23 16:15:08 +02:00
Martin Kroeker	28e96458e5	Replace vpermpd with vpermilpd to improve performance on Zen/Zen2 (as demonstrated by wjc404 in #2180)	2019-07-22 08:28:16 +02:00
wjc404	95fb98f556	Update dgemm_kernel_4x8_haswell.S	2019-07-21 01:10:32 +08:00
wjc404	4801c6d36b	Update dgemm_kernel_4x8_haswell.S	2019-07-21 00:47:45 +08:00
wjc404	9440fa607d	Add files via upload	2019-07-20 22:08:22 +08:00
wjc404	94db259e5b	Add files via upload	2019-07-20 22:04:41 +08:00
wjc404	f49f8047ac	Add files via upload	2019-07-20 14:33:37 +08:00
wjc404	825777faab	Update dgemm_kernel_4x8_haswell.S	2019-07-19 23:58:24 +08:00
wjc404	9c89757562	Add files via upload	2019-07-19 23:47:58 +08:00
wjc404	9b04baeaee	Update dgemm_kernel_4x8_haswell.S	2019-07-17 23:50:03 +08:00
wjc404	8a074b3965	Update dgemm_kernel_4x8_haswell.S	2019-07-17 23:47:30 +08:00
wjc404	211ab03b14	Update dgemm_kernel_4x8_haswell.S	2019-07-17 22:39:15 +08:00
wjc404	1733f927e6	Update dgemm_kernel_4x8_haswell.S	2019-07-17 21:27:41 +08:00
wjc404	182b06d6ad	Update dgemm_kernel_4x8_haswell.S	2019-07-17 17:02:35 +08:00
wjc404	7a9050d681	Update dgemm_kernel_4x8_haswell.S	2019-07-17 00:55:06 +08:00
wjc404	0ba29fd262	Update dgemm_kernel_4x8_haswell.S for zen2 replaced a bunch of vpermpd instructions with vpermilpd and vperm2f128	2019-07-17 00:46:51 +08:00
Martin Kroeker	9ea30f3788	Replace ISMIN and ISAMIN kernels on all x86_64 platforms (#2125 ) * Mark iamax_sse.S as unsuitable for MIN due to issue #2116 * Use iamax.S rather than iamax_sse.S for ISMIN/ISAMIN on all x86_64 as workaround for #2116	2019-05-09 14:42:36 +02:00
Martin Kroeker	b1561ecc68	Disable DGEMMINCOPY as well for now #1955	2019-05-05 15:52:01 +02:00
Martin Kroeker	7ed8431527	Disable the SkyLakeX DGEMMITCOPY kernel as well as a stopgap measure for https://github.com/numpy/numpy/issues/13401 as mentioned in #1955	2019-05-04 22:54:41 +02:00
Martin Kroeker	c04a729081	Add ?sum definitions for generic kernel	2019-03-31 13:55:49 +02:00
Martin Kroeker	9d717cb5ee	Add x86_64 implementation of ?sum as trivial copy of ?asum with the fabs calls removed	2019-03-30 22:27:04 +01:00
Martin Kroeker	32c7063cb0	Merge pull request #2061 from martin-frbg/martin-frbg-patch-1 Disable the AVX512 DGEMM kernel (again)	2019-03-30 21:21:38 +01:00
Martin Kroeker	e608d4f7fe	Disable the AVX512 DGEMM kernel (again) Due to as yet unresolved errors seen in #1955 and #2029	2019-03-13 22:10:28 +01:00
Celelibi	b7f59da42d	Fix crash in sgemm SSE/nano kernel on x86_64 Fix bug #2047. Signed-off-by: Celelibi <celelibi@gmail.com>	2019-03-07 16:55:13 +01:00
Andrew	6eee1beac5	move fix to right place	2019-02-24 20:41:02 +02:00
Martin Kroeker	e12cdf58ef	Merge pull request #2024 from martin-frbg/gcc9fixes4 Fix inline assembly constraints in Bulldozer TRSM kernels	2019-02-17 11:49:15 +01:00
Martin Kroeker	1860c9456d	Merge pull request #2023 from martin-frbg/gcc9fixes3 Fix inline assembly constraints in various x86_64 GEMVN kernels	2019-02-17 11:48:57 +01:00
Martin Kroeker	f9bb76d29a	Fix inline assembly constraints in Bulldozer TRSM kernels rework indices to allow marking i,as and bs as both input and output (marked operand n1 as well for simplicity). For #2009	2019-02-16 20:06:48 +01:00
Martin Kroeker	efb9038f72	Fix inline assembly constraints	2019-02-16 18:46:17 +01:00
Martin Kroeker	e976557d29	Fix inline assembly constraints rework indices to allow marking argument lda as input and output.	2019-02-16 18:36:39 +01:00
Martin Kroeker	9d8be15789	Fix inline assembly constraints rework indices to allow marking argument lda4 as input and output. For #2009	2019-02-16 18:24:11 +01:00
Martin Kroeker	d752799a0f	Merge pull request #2021 from martin-frbg/gcc9fixes2 Fix wrong constraints in inline assembly of Haswell DTRSM kernel	2019-02-16 18:05:40 +01:00
Martin Kroeker	c26c0b77a7	Fix wrong constraints in inline assembly for #2009	2019-02-15 15:08:16 +01:00
Martin Kroeker	1c6da2d03c	Merge pull request #2019 from martin-frbg/gcc9fixes Fix unannounced modification of input operand 8 (lda4) in Haswell GEMVN microkernel	2019-02-15 15:02:54 +01:00
Martin Kroeker	4255a58cd2	Rename operands to put lda on the input/output constraint list	2019-02-15 10:10:04 +01:00
Martin Kroeker	46e415b140	Save and restore input argument 8 (lda4) Fixes miscompilation with gcc9 -ftree-vectorize (related to issue #2009)	2019-02-14 22:43:18 +01:00
Bart Oldeman	69a97ca7b9	dgemv_kernel_4x4(Haswell): add missing clobbers for xmm0,xmm1,xmm2,xmm3 This fixes a crash in dblat2 when OpenBLAS is compiled using -march=znver1 -ftree-vectorize -O2 See also: https://github.com/easybuilders/easybuild-easyconfigs/issues/7180	2019-02-14 16:27:58 +00:00
Martin Kroeker	ab1630f9fa	Fix declaration of arguments in inline assembly Argument 0 is modified so should be input and output	2019-02-12 16:14:02 +01:00
Martin Kroeker	b824fa70eb	Fix declaration of assembly arguments in SSYMV and DSYMV microkernels Arguments 0 and 1 are both input and output	2019-02-12 16:00:18 +01:00
Martin Kroeker	91481a3e4e	Fix declaration of input arguments in inline assembly Argument 0 is modified as it doubles as a counter	2019-02-12 15:51:43 +01:00
Martin Kroeker	dc6ac9eab0	Fix declaration of input arguments in the x86_64 s/dGEMV_T and s/dGEMV_N kernels Arguments 0 and 1 need to be tagged as both input and output	2019-02-12 15:33:48 +01:00
Martin Kroeker	32b0f1168e	Fix declaration of input arguments in the Sandybridge GER microkernels (#1967 ) * Tag arguments 0 and 1 as both input and output	2019-01-18 08:11:39 +01:00
Martin Kroeker	b495e54310	Fix declaration of input arguments in the x86_64 SCAL microkernels (#1966 ) * Tag arguments 0 and 1 as both input and output (see #1964)	2019-01-18 08:11:07 +01:00
Martin Kroeker	d5e6940253	Fix declaration of input arguments in the x86_64 microkernels for DOT and AXPY (#1965 ) * Tag operands 0 and 1 as both input and output For #1964 (basically a continuation of coding problems first seen in #1292)	2019-01-17 23:20:32 +01:00
Arjan van de Ven	795285c587	Fix thinko in skylake beta handling casting ints is cheaper but it has a rounding, not memory casing effect, resulting in invalid outcome	2018-12-24 18:49:50 +00:00
Arjan van de Ven	d321448a63	dgemm: use dgemm_ncopy_8_skylakex.c also for Haswell The dgemm_ncopy_8_skylakex.c code is not avx512 specific and gives a nice performance boost for medium sized matrices	2018-12-16 23:09:22 +00:00
Arjan van de Ven	c43331ad0a	dgemm: Use the skylakex beta function also for haswell it's more efficient for certain tall/skinny matrices	2018-12-16 23:09:17 +00:00
Arjan van de Ven	69d206440a	Make the skylakex/haswell sgemm code compile and run even with compilers without avx2 support	2018-12-16 00:19:41 +00:00
Arjan van de Ven	0586899a10	Use sgemm_ncopy_4_skylakex.c also for Haswell sgemm_ncopy_4_skylakex.c uses SSE transpose operations where the real perf win happens; this also works great for Haswell. This gives double digit percentage gains on small and skinny matrices	2018-12-15 13:49:19 +00:00
Arjan van de Ven	00dc09ad19	Use the skylake sgemm beta code also for haswell with a few small changes it's possible to use the skylake sgemm code also for haswell, this gives a modest gain (10% range) for smallish matrixes but does wonders for very skinny matrixes	2018-12-15 13:49:13 +00:00
Arjan van de Ven	cdc668d82b	Add a "sgemm direct" mode for small matrixes OpenBLAS has a fancy algorithm for copying the input data while laying it out in a more CPU friendly memory layout. This is great for large matrixes; the cost of the copy is easily ammortized by the gains from the better memory layout. But for small matrixes (on CPUs that can do efficient unaligned loads) this copy can be a net loss. This patch adds (for SKYLAKEX initially) a "sgemm direct" mode, that bypasses the whole copy machinary for ALPHA=1/BETA=0/... standard arguments, for small matrixes only. What is small? For the non-threaded case this has been measured to be in the MNK = 28 * 512 * 512 range, while in the threaded case it's less, around MNK = 1 * 512 * 512	2018-12-13 13:47:31 +00:00
Martin Kroeker	701ea88347	Use p2align instead of align for OSX compatibility fixes #1902	2018-12-03 13:06:43 +01:00
Andrew	19c4bdd8b3	Add return value so that freebsd system clang does not err out	2018-11-25 21:35:01 +01:00
Arjan van de Ven	dcc5d6291e	skylakex: Make the sgemm/dgemm beta code robust for a N=0 or M=0 case in the threading code there are cases where N or M can become 0, and the optimized beta code did not handle this well, leading to a crash during the audit for the crash a few edge conditions on the if statements were found and fixed as well	2018-11-01 01:42:09 +00:00
Arjan van de Ven	55b244ca0d	enable the SGEMM/SKX C based kernel In QA the final bug was found so now the sklyakex sgemm C based kernel can be activated....	2018-10-12 09:30:35 +00:00
Arjan van de Ven	d4bad73834	Add a C+intrinsics version of the SGEMM/skylakex kernel for most sizes this is 1.2x to 1.4x faster than the current code	2018-10-10 01:49:22 +00:00
Arjan van de Ven	582c589727	dgemm/skylakex: replace discrete mul/add with fma very minor gains since it's not super hot code, but general principles	2018-10-06 23:13:26 +00:00
Arjan van de Ven	adbf6afa25	Add vector optimizations for ncopy as well for dgemm/skylakex	2018-10-06 21:18:12 +00:00
Arjan van de Ven	32bec8afbb	add a skylakex optimized dgemm beta function	2018-10-06 16:36:26 +00:00
Arjan van de Ven	20c5d668fe	dgemm/avx512 simplify and speed up the 4x4 kernel	2018-10-06 14:12:32 +00:00
Arjan van de Ven	6d43c51ccf	undo slow dgemm/skylake microoptimization the compare is more costly than the work	2018-10-06 14:00:37 +00:00
Arjan van de Ven	d74dc39b0f	Add optimized *copy versions for skylakex Add optimized n/t copy versions for skylakex; in the patch the tcopy is also rewritten using intrinsics; the ncopy file will be worked on in a future commit	2018-10-06 13:51:44 +00:00
Arjan van de Ven	66b43affbc	Add a 24x8 kernel to the skylakex dgemm implementation Minor gains for small matrixes, but at 512x512 and above the gain gets more significant.	2018-10-05 13:22:21 +00:00
Arjan van de Ven	1938819c25	skylake dgemm: Add a 16x8 kernel The next step for the avx512 dgemm code is adding a 16x8 kernel. In the 8x8 kernel, each FMA has a matching load (the broadcast); in the 16x8 kernel we can reuse this load for 2 FMAs, which in turn reduces pressure on the load ports of the CPU and gives a nice performance boost (in the 25% range).	2018-10-05 13:11:35 +00:00
Martin Kroeker	b7496c3638	Function name needs to be CNAME, set from outside to allow suffixing for dynamic_arch	2018-10-04 19:14:59 +02:00
Arjan van de Ven	45fe8cb0c5	Create a AVX512 enabled version of DGEMM This patch adds dgemm_kernel_4x8_skylakex.c which is * dgemm_kernel_4x8_haswell.s converted to C + intrinsics * 8x8 support added * 8x8 kernel implemented using AVX512 Performance is a work in progress, but already shows a 10% - 20% increase for a wide range of matrix sizes.	2018-10-03 14:45:25 +00:00
Martin Kroeker	375dff54fc	Merge pull request #1733 from fenrus75/dsymv Add an AVX512 enabled DSYMV (L) function	2018-08-12 18:18:36 +02:00
Martin Kroeker	a5f165275a	Merge pull request #1732 from fenrus75/dgemv Add an AVX512 enabled DGEMV (n) function	2018-08-12 18:17:42 +02:00
Martin Kroeker	8c13aa495a	Merge pull request #1730 from fenrus75/fix-sdot Fix typo in sdot function	2018-08-12 18:17:01 +02:00
Arjan van de Ven	9bec34cb67	Add an AVX512 enabled DSYMV (L) function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-11 17:46:24 +00:00
Arjan van de Ven	87bebdbd8a	Add an AVX512 enabled DGEMV (n) function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-11 17:38:12 +00:00
Arjan van de Ven	36add7570a	Fix typo in sdot function it looks like my previous pull request was short the final commit; fix a typo in sdot	2018-08-11 17:16:45 +00:00
Arjan van de Ven	cacacc8007	Add an AVX512 enabled DSCAL function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-11 17:14:57 +00:00
Martin Kroeker	1a00ef3d27	Merge pull request #1725 from fenrus75/axpy Add a AVX512 enabled SAXPY/DAXPY functions	2018-08-11 11:01:20 +02:00
Arjan van de Ven	2e99873ff7	Add a AVX512 enabled SAXPY/DAXPY functions written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-10 02:58:32 +00:00
Arjan van de Ven	00abaa865b	Add an AVX512 enabled SDOT function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-10 02:33:43 +00:00
Arjan van de Ven	7932ff3ea9	Add an AVX512 enabled DDOT function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-09 03:55:52 +00:00
Martin Kroeker	6e54b0a027	Disable the 16x2 DTRMM kernel on SkylakeX as well	2018-06-30 17:31:06 +02:00
Martin Kroeker	f0a8dc2eec	Disable the AVX512 DGEMM kernel for now due to #1643	2018-06-30 11:34:48 +02:00
Craig Donner	c2545b0fd6	Fixed a few more unnecessary calls to num_cpu_avail. I don't have as many benchmarks for these as for gemm, but it should still make a difference for small matrices.	2018-06-11 10:17:16 +01:00
Arjan van de Ven	89372e0993	Use AVX512 also for DGEMM this required switching to the generic gemm_beta code (which is faster anyway on SKX) for both DGEMM and SGEMM Performance for the not-retuned version is in the 30% range	2018-06-03 22:17:27 +00:00
Arjan van de Ven	99c7bba8e4	Initial support for SkylakeX / AVX512 This patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server) target. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set, which brings 2 basic things: 1) 512 bit wide SIMD (2x width of AVX2) 2) 32 SIMD registers (2x the number on AVX2) This initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel to AVX512VL; more will follow later but this patch aims to get the infrastructure in place for this "later". Full performance tuning has not been done yet; with more registers and wider SIMD it's in theory possible to retune the kernels but even without that there's an interesting enough performance increase (30-40% range) with just this change.	2018-06-03 07:58:52 +00:00

... 2 3 4 5 6 ...

766 Commits