OpenBLAS

Commit Graph

Author	SHA1	Message	Date
wjc404	1b980001dd	Update zgemm_kernel_4x2_haswell.c	2020-02-26 18:38:12 +08:00
wjc404	2515e1152f	Update cgemm_kernel_8x2_haswell.c	2020-02-26 18:36:54 +08:00
wjc404	903854c168	Add files via upload	2020-02-22 23:40:02 +08:00
wjc404	a2ff577a30	Update KERNEL.ZEN	2020-02-22 23:39:43 +08:00
wjc404	97a32cb0a5	Update KERNEL.HASWELL	2020-02-22 23:39:20 +08:00
Martin Liska	aeea14ee40	Come up with LOAD_AND_COMPARE_TO_MXX macro in iamax_sse.S.	2020-02-17 09:01:53 +01:00
Martin Liska	18bcc36a69	Fix implementation of iamax_sse.S as reported in #2116 . The was a typo in iamax_sse.S where one of the comparison was cmpeqps instead of cmpeqss. That misdetected index for sequences where the minimum value was 0.	2020-02-17 09:01:53 +01:00
wjc404	f566787e6e	Update KERNEL.SKYLAKEX	2020-02-16 22:58:44 +08:00
wjc404	e3368cbf18	AVX512 STRMM kernel	2020-02-16 22:58:00 +08:00
Bart Oldeman	7ea5e07d1c	Fix inline asm in dscal: mark x, x1 as clobbered. Fixes #2408 The leaq instructions in dscal_kernel_inc_8 modify x and x1 so they must be declared as input/output constraints, otherwise the compiler may assume the corresponding registers are not modified.	2020-02-12 14:11:44 +00:00
wjc404	3447d04eaf	Update dgemm_kernel_16x2_skylakex.c	2020-02-06 02:14:10 +00:00
wjc404	8b5cdcc64c	Update sgemm_kernel_8x4_haswell.c	2020-02-06 01:47:46 +00:00
wjc404	4e00d96a78	Update dgemm_kernel_16x2_skylakex.c	2020-02-06 01:46:36 +00:00
wjc404	096da2f51a	Update dgemm_kernel_16x2_skylakex.c	2020-02-05 13:36:57 +08:00
wjc404	081b188529	Update KERNEL.SKYLAKEX	2020-02-03 21:38:08 +08:00
wjc404	8019e70211	AVX512 16x2 DGEMM kernel	2020-02-03 21:32:56 +08:00
wjc404	e5dcdeb550	Update sgemm_direct_skylakex.c	2020-01-13 16:59:23 +08:00
wjc404	952cc2ba38	Update sgemm_kernel_16x4_skylakex_2.c	2020-01-13 16:58:54 +08:00
wjc404	feaafbedd3	make skylakex sgemm code more friendly for readers BTW some kernels were adjusted to improve performance	2020-01-13 16:28:41 +08:00
wjc404	3a100b2797	Update KERNEL.SKYLAKEX	2020-01-09 13:48:41 +08:00
wjc404	bd4c032f52	Update sgemm_kernel_8x4_haswell.c	2020-01-07 11:22:46 +08:00
wjc404	9dc9b7b95e	Update sgemm_kernel_8x4_haswell.c	2020-01-06 20:11:36 +08:00
wjc404	92b10212de	optimize AVX2 SGEMM	2020-01-06 12:11:21 +08:00
wjc404	b73bf01378	optimize AVX2 SGEMM	2020-01-06 12:09:14 +08:00
wjc404	eb3c9f1db9	optimize AVX2 SGEMM	2020-01-06 12:07:02 +08:00
wjc404	a0f0a802fc	Update zgemm3m_kernel_4x4_haswell.c	2019-12-30 17:33:42 +08:00
wjc404	700fe5b5ee	Add files via upload	2019-12-30 17:18:59 +08:00
wjc404	f60840c420	Update KERNEL.ZEN	2019-12-30 16:04:23 +08:00
wjc404	109e18cd96	Update KERNEL.HASWELL	2019-12-30 16:03:24 +08:00
wjc404	ae1579be13	Create zgemm3m_kernel_4x4_haswell.c	2019-12-30 16:02:51 +08:00
wjc404	cd765f094b	Update cgemm3m_kernel_8x4_haswell.c	2019-12-27 18:23:29 +08:00
wjc404	3a66c8cac1	Update KERNEL.ZEN	2019-12-27 18:04:08 +08:00
wjc404	ed9af2f7da	Update KERNEL.HASWELL	2019-12-27 18:01:38 +08:00
wjc404	5fd1edead9	Create cgemm3m_kernel_8x4_haswell.c	2019-12-27 18:00:55 +08:00
wjc404	eeecd623d8	Update cgemm_kernel_8x2_haswell.c	2019-12-24 00:40:16 +08:00
wjc404	2cd9306bb5	Update KERNEL.ZEN	2019-12-23 23:42:30 +08:00
wjc404	c418c81224	Update KERNEL.HASWELL	2019-12-23 23:41:44 +08:00
wjc404	025741f16a	Fast Haswell CGEMM kernel	2019-12-23 23:40:03 +08:00
wjc404	f41d52665d	Fast Haswell ZGEMM kernel	2019-12-21 14:37:06 +08:00
wjc404	d573d24de7	Fast Haswell ZGEMM kernel	2019-12-21 14:35:15 +08:00
Isuru Fernando	b863b32ac5	Workaround an ICE in clang 9.0.0 This bug is not there in 8.x nor in the 9.0 daily snapshot.	2019-12-01 12:59:46 -06:00
wjc404	934e601e93	Update dgemm_kernel_4x8_skylakex_2.c	2019-11-28 19:56:35 +08:00
wjc404	eb1e9c8c92	some optimizations	2019-11-26 14:12:20 +08:00
Wang, Long	bfb5fbdb4d	revised fix windows compatible for #2313 Signed-off-by: Wang, Long <long1.wang@intel.com>	2019-11-21 10:22:58 +08:00
Wang, Long	1191db1a49	For the sake of windows compatible, used "unsigned long long" to ensure 64-bit length Signed-off-by: Wang, Long <long1.wang@intel.com>	2019-11-20 21:30:47 +08:00
Wang, Long	0caf1434c9	Fix the integer overflow issue for large matrix size For large matrix, e.g. M=N=K, and M>1290, int mnk=MNK will overflow. This will lead to wrong branching to single-threading. The performance is downgraded significantly. Signed-off-by: Wang, Long <long1.wang@intel.com>	2019-11-20 14:11:17 +08:00
wjc404	819e852ae7	AVX512 CGEMM & ZGEMM kernels 96-99% 1-thread performance of MKL2018	2019-11-11 20:04:52 +08:00
wjc404	836c414e22	optimizations of software prefetching	2019-11-05 13:36:56 +08:00
wjc404	430c11e135	Add files via upload	2019-11-04 20:10:12 +08:00
wjc404	fbacd2605d	optimizations via software prefetches	2019-11-04 19:37:19 +08:00
wjc404	1df9a2013d	new sgemm kernel for skylakex	2019-11-02 00:00:48 +08:00
wjc404	6ff013bae0	native support for icopy_4 90% MKL 1-thread performance.	2019-10-19 03:54:44 +08:00
wjc404	0d669e04bb	Update dgemm_kernel_8x8_skylakex.c	2019-10-18 15:00:17 +08:00
wjc404	17cdd9f9e1	some correction	2019-10-18 14:58:07 +08:00
wjc404	6bcb06fcb1	make further changes to icopy_8 easier	2019-10-18 10:47:31 +08:00
wjc404	b7315f8401	Add files via upload	2019-10-16 19:23:36 +08:00
wjc404	9b19e9e1b0	Update dgemm_kernel_8x8_skylakex.c	2019-10-16 10:14:51 +08:00
wjc404	6bd67ddbab	Update dgemm_kernel_8x8_skylakex.c	2019-10-16 03:20:08 +08:00
wjc404	844629af57	Add files via upload	2019-10-16 02:00:34 +08:00
Martin Kroeker	11c59acfb1	Keep both PGI/SUN and default code paths to avoid breaking Clang/WIndows	2019-08-28 18:07:44 +02:00
Martin Kroeker	3a55dca2dc	Make x86_64 zdot compile with PGI and Sun C again broken by #2222 as CREAL,CIMAG do not expand to a valid lvalue with these compilers	2019-08-28 11:35:31 +02:00
Martin Kroeker	9ef96b32a6	Add multithreading support to the x86_64 zdot kernel (#2222 ) * Add multithreading support copied from the ThunderX2T99 kernel. For #2221	2019-08-15 22:09:12 +02:00
Martin Kroeker	dccff2e785	Merge pull request #2206 from martin-frbg/zen-dtrmm Replace vpermpd with vpermilpd in the Haswell DTRMM kernel	2019-08-09 07:55:20 +02:00
Martin Kroeker	5c3458a6e7	Merge pull request #2199 from martin-frbg/zen-dtrsm Replace most vpermpd calls in the Haswell DTRSM_RN kernel	2019-08-09 07:55:02 +02:00
Martin Kroeker	acf6002ab2	Replace most vpermpd calls in the Haswell DTRSM_RN kernel	2019-08-03 12:40:13 +02:00
Martin Kroeker	2dfb804cb9	Replace vpermpd with vpermilpd in the Haswell DTRMM kernel to improve performance on AMD Zen (#2180) applying wjc404's improvement of the DGEMM kernel from #2186	2019-07-28 23:17:28 +02:00
Martin Kroeker	4c153ec9da	Merge pull request #2196 from wjc404/develop Add vbroadcastsd kernel to dgemm_kernel_4x8_haswell.S	2019-07-28 23:11:40 +02:00
wjc404	7eecd8e39c	Add files via upload	2019-07-28 07:39:09 +08:00
Martin Kroeker	7b0b7c11d2	Merge pull request #2190 from martin-frbg/zdot-zen Replace vpermpd with vpermilpd in the Haswell/Zen zdot microkernel	2019-07-23 16:15:08 +02:00
Martin Kroeker	28e96458e5	Replace vpermpd with vpermilpd to improve performance on Zen/Zen2 (as demonstrated by wjc404 in #2180)	2019-07-22 08:28:16 +02:00
wjc404	95fb98f556	Update dgemm_kernel_4x8_haswell.S	2019-07-21 01:10:32 +08:00
wjc404	4801c6d36b	Update dgemm_kernel_4x8_haswell.S	2019-07-21 00:47:45 +08:00
wjc404	9440fa607d	Add files via upload	2019-07-20 22:08:22 +08:00
wjc404	94db259e5b	Add files via upload	2019-07-20 22:04:41 +08:00
wjc404	f49f8047ac	Add files via upload	2019-07-20 14:33:37 +08:00
wjc404	825777faab	Update dgemm_kernel_4x8_haswell.S	2019-07-19 23:58:24 +08:00
wjc404	9c89757562	Add files via upload	2019-07-19 23:47:58 +08:00
wjc404	9b04baeaee	Update dgemm_kernel_4x8_haswell.S	2019-07-17 23:50:03 +08:00
wjc404	8a074b3965	Update dgemm_kernel_4x8_haswell.S	2019-07-17 23:47:30 +08:00
wjc404	211ab03b14	Update dgemm_kernel_4x8_haswell.S	2019-07-17 22:39:15 +08:00
wjc404	1733f927e6	Update dgemm_kernel_4x8_haswell.S	2019-07-17 21:27:41 +08:00
wjc404	182b06d6ad	Update dgemm_kernel_4x8_haswell.S	2019-07-17 17:02:35 +08:00
wjc404	7a9050d681	Update dgemm_kernel_4x8_haswell.S	2019-07-17 00:55:06 +08:00
wjc404	0ba29fd262	Update dgemm_kernel_4x8_haswell.S for zen2 replaced a bunch of vpermpd instructions with vpermilpd and vperm2f128	2019-07-17 00:46:51 +08:00
Martin Kroeker	9ea30f3788	Replace ISMIN and ISAMIN kernels on all x86_64 platforms (#2125 ) * Mark iamax_sse.S as unsuitable for MIN due to issue #2116 * Use iamax.S rather than iamax_sse.S for ISMIN/ISAMIN on all x86_64 as workaround for #2116	2019-05-09 14:42:36 +02:00
Martin Kroeker	b1561ecc68	Disable DGEMMINCOPY as well for now #1955	2019-05-05 15:52:01 +02:00
Martin Kroeker	7ed8431527	Disable the SkyLakeX DGEMMITCOPY kernel as well as a stopgap measure for https://github.com/numpy/numpy/issues/13401 as mentioned in #1955	2019-05-04 22:54:41 +02:00
Martin Kroeker	c04a729081	Add ?sum definitions for generic kernel	2019-03-31 13:55:49 +02:00
Martin Kroeker	9d717cb5ee	Add x86_64 implementation of ?sum as trivial copy of ?asum with the fabs calls removed	2019-03-30 22:27:04 +01:00
Martin Kroeker	32c7063cb0	Merge pull request #2061 from martin-frbg/martin-frbg-patch-1 Disable the AVX512 DGEMM kernel (again)	2019-03-30 21:21:38 +01:00
Martin Kroeker	e608d4f7fe	Disable the AVX512 DGEMM kernel (again) Due to as yet unresolved errors seen in #1955 and #2029	2019-03-13 22:10:28 +01:00
Celelibi	b7f59da42d	Fix crash in sgemm SSE/nano kernel on x86_64 Fix bug #2047. Signed-off-by: Celelibi <celelibi@gmail.com>	2019-03-07 16:55:13 +01:00
Andrew	6eee1beac5	move fix to right place	2019-02-24 20:41:02 +02:00
Martin Kroeker	e12cdf58ef	Merge pull request #2024 from martin-frbg/gcc9fixes4 Fix inline assembly constraints in Bulldozer TRSM kernels	2019-02-17 11:49:15 +01:00
Martin Kroeker	1860c9456d	Merge pull request #2023 from martin-frbg/gcc9fixes3 Fix inline assembly constraints in various x86_64 GEMVN kernels	2019-02-17 11:48:57 +01:00
Martin Kroeker	f9bb76d29a	Fix inline assembly constraints in Bulldozer TRSM kernels rework indices to allow marking i,as and bs as both input and output (marked operand n1 as well for simplicity). For #2009	2019-02-16 20:06:48 +01:00
Martin Kroeker	efb9038f72	Fix inline assembly constraints	2019-02-16 18:46:17 +01:00
Martin Kroeker	e976557d29	Fix inline assembly constraints rework indices to allow marking argument lda as input and output.	2019-02-16 18:36:39 +01:00
Martin Kroeker	9d8be15789	Fix inline assembly constraints rework indices to allow marking argument lda4 as input and output. For #2009	2019-02-16 18:24:11 +01:00
Martin Kroeker	d752799a0f	Merge pull request #2021 from martin-frbg/gcc9fixes2 Fix wrong constraints in inline assembly of Haswell DTRSM kernel	2019-02-16 18:05:40 +01:00
Martin Kroeker	c26c0b77a7	Fix wrong constraints in inline assembly for #2009	2019-02-15 15:08:16 +01:00
Martin Kroeker	1c6da2d03c	Merge pull request #2019 from martin-frbg/gcc9fixes Fix unannounced modification of input operand 8 (lda4) in Haswell GEMVN microkernel	2019-02-15 15:02:54 +01:00
Martin Kroeker	4255a58cd2	Rename operands to put lda on the input/output constraint list	2019-02-15 10:10:04 +01:00
Martin Kroeker	46e415b140	Save and restore input argument 8 (lda4) Fixes miscompilation with gcc9 -ftree-vectorize (related to issue #2009)	2019-02-14 22:43:18 +01:00
Bart Oldeman	69a97ca7b9	dgemv_kernel_4x4(Haswell): add missing clobbers for xmm0,xmm1,xmm2,xmm3 This fixes a crash in dblat2 when OpenBLAS is compiled using -march=znver1 -ftree-vectorize -O2 See also: https://github.com/easybuilders/easybuild-easyconfigs/issues/7180	2019-02-14 16:27:58 +00:00
Martin Kroeker	ab1630f9fa	Fix declaration of arguments in inline assembly Argument 0 is modified so should be input and output	2019-02-12 16:14:02 +01:00
Martin Kroeker	b824fa70eb	Fix declaration of assembly arguments in SSYMV and DSYMV microkernels Arguments 0 and 1 are both input and output	2019-02-12 16:00:18 +01:00
Martin Kroeker	91481a3e4e	Fix declaration of input arguments in inline assembly Argument 0 is modified as it doubles as a counter	2019-02-12 15:51:43 +01:00
Martin Kroeker	dc6ac9eab0	Fix declaration of input arguments in the x86_64 s/dGEMV_T and s/dGEMV_N kernels Arguments 0 and 1 need to be tagged as both input and output	2019-02-12 15:33:48 +01:00
Martin Kroeker	32b0f1168e	Fix declaration of input arguments in the Sandybridge GER microkernels (#1967 ) * Tag arguments 0 and 1 as both input and output	2019-01-18 08:11:39 +01:00
Martin Kroeker	b495e54310	Fix declaration of input arguments in the x86_64 SCAL microkernels (#1966 ) * Tag arguments 0 and 1 as both input and output (see #1964)	2019-01-18 08:11:07 +01:00
Martin Kroeker	d5e6940253	Fix declaration of input arguments in the x86_64 microkernels for DOT and AXPY (#1965 ) * Tag operands 0 and 1 as both input and output For #1964 (basically a continuation of coding problems first seen in #1292)	2019-01-17 23:20:32 +01:00
Arjan van de Ven	795285c587	Fix thinko in skylake beta handling casting ints is cheaper but it has a rounding, not memory casing effect, resulting in invalid outcome	2018-12-24 18:49:50 +00:00
Arjan van de Ven	d321448a63	dgemm: use dgemm_ncopy_8_skylakex.c also for Haswell The dgemm_ncopy_8_skylakex.c code is not avx512 specific and gives a nice performance boost for medium sized matrices	2018-12-16 23:09:22 +00:00
Arjan van de Ven	c43331ad0a	dgemm: Use the skylakex beta function also for haswell it's more efficient for certain tall/skinny matrices	2018-12-16 23:09:17 +00:00
Arjan van de Ven	69d206440a	Make the skylakex/haswell sgemm code compile and run even with compilers without avx2 support	2018-12-16 00:19:41 +00:00
Arjan van de Ven	0586899a10	Use sgemm_ncopy_4_skylakex.c also for Haswell sgemm_ncopy_4_skylakex.c uses SSE transpose operations where the real perf win happens; this also works great for Haswell. This gives double digit percentage gains on small and skinny matrices	2018-12-15 13:49:19 +00:00
Arjan van de Ven	00dc09ad19	Use the skylake sgemm beta code also for haswell with a few small changes it's possible to use the skylake sgemm code also for haswell, this gives a modest gain (10% range) for smallish matrixes but does wonders for very skinny matrixes	2018-12-15 13:49:13 +00:00
Arjan van de Ven	cdc668d82b	Add a "sgemm direct" mode for small matrixes OpenBLAS has a fancy algorithm for copying the input data while laying it out in a more CPU friendly memory layout. This is great for large matrixes; the cost of the copy is easily ammortized by the gains from the better memory layout. But for small matrixes (on CPUs that can do efficient unaligned loads) this copy can be a net loss. This patch adds (for SKYLAKEX initially) a "sgemm direct" mode, that bypasses the whole copy machinary for ALPHA=1/BETA=0/... standard arguments, for small matrixes only. What is small? For the non-threaded case this has been measured to be in the MNK = 28 * 512 * 512 range, while in the threaded case it's less, around MNK = 1 * 512 * 512	2018-12-13 13:47:31 +00:00
Martin Kroeker	701ea88347	Use p2align instead of align for OSX compatibility fixes #1902	2018-12-03 13:06:43 +01:00
Andrew	19c4bdd8b3	Add return value so that freebsd system clang does not err out	2018-11-25 21:35:01 +01:00
Arjan van de Ven	dcc5d6291e	skylakex: Make the sgemm/dgemm beta code robust for a N=0 or M=0 case in the threading code there are cases where N or M can become 0, and the optimized beta code did not handle this well, leading to a crash during the audit for the crash a few edge conditions on the if statements were found and fixed as well	2018-11-01 01:42:09 +00:00
Arjan van de Ven	55b244ca0d	enable the SGEMM/SKX C based kernel In QA the final bug was found so now the sklyakex sgemm C based kernel can be activated....	2018-10-12 09:30:35 +00:00
Arjan van de Ven	d4bad73834	Add a C+intrinsics version of the SGEMM/skylakex kernel for most sizes this is 1.2x to 1.4x faster than the current code	2018-10-10 01:49:22 +00:00
Arjan van de Ven	582c589727	dgemm/skylakex: replace discrete mul/add with fma very minor gains since it's not super hot code, but general principles	2018-10-06 23:13:26 +00:00
Arjan van de Ven	adbf6afa25	Add vector optimizations for ncopy as well for dgemm/skylakex	2018-10-06 21:18:12 +00:00
Arjan van de Ven	32bec8afbb	add a skylakex optimized dgemm beta function	2018-10-06 16:36:26 +00:00
Arjan van de Ven	20c5d668fe	dgemm/avx512 simplify and speed up the 4x4 kernel	2018-10-06 14:12:32 +00:00
Arjan van de Ven	6d43c51ccf	undo slow dgemm/skylake microoptimization the compare is more costly than the work	2018-10-06 14:00:37 +00:00
Arjan van de Ven	d74dc39b0f	Add optimized *copy versions for skylakex Add optimized n/t copy versions for skylakex; in the patch the tcopy is also rewritten using intrinsics; the ncopy file will be worked on in a future commit	2018-10-06 13:51:44 +00:00
Arjan van de Ven	66b43affbc	Add a 24x8 kernel to the skylakex dgemm implementation Minor gains for small matrixes, but at 512x512 and above the gain gets more significant.	2018-10-05 13:22:21 +00:00
Arjan van de Ven	1938819c25	skylake dgemm: Add a 16x8 kernel The next step for the avx512 dgemm code is adding a 16x8 kernel. In the 8x8 kernel, each FMA has a matching load (the broadcast); in the 16x8 kernel we can reuse this load for 2 FMAs, which in turn reduces pressure on the load ports of the CPU and gives a nice performance boost (in the 25% range).	2018-10-05 13:11:35 +00:00
Martin Kroeker	b7496c3638	Function name needs to be CNAME, set from outside to allow suffixing for dynamic_arch	2018-10-04 19:14:59 +02:00
Arjan van de Ven	45fe8cb0c5	Create a AVX512 enabled version of DGEMM This patch adds dgemm_kernel_4x8_skylakex.c which is * dgemm_kernel_4x8_haswell.s converted to C + intrinsics * 8x8 support added * 8x8 kernel implemented using AVX512 Performance is a work in progress, but already shows a 10% - 20% increase for a wide range of matrix sizes.	2018-10-03 14:45:25 +00:00
Martin Kroeker	375dff54fc	Merge pull request #1733 from fenrus75/dsymv Add an AVX512 enabled DSYMV (L) function	2018-08-12 18:18:36 +02:00
Martin Kroeker	a5f165275a	Merge pull request #1732 from fenrus75/dgemv Add an AVX512 enabled DGEMV (n) function	2018-08-12 18:17:42 +02:00
Martin Kroeker	8c13aa495a	Merge pull request #1730 from fenrus75/fix-sdot Fix typo in sdot function	2018-08-12 18:17:01 +02:00
Arjan van de Ven	9bec34cb67	Add an AVX512 enabled DSYMV (L) function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-11 17:46:24 +00:00
Arjan van de Ven	87bebdbd8a	Add an AVX512 enabled DGEMV (n) function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-11 17:38:12 +00:00
Arjan van de Ven	36add7570a	Fix typo in sdot function it looks like my previous pull request was short the final commit; fix a typo in sdot	2018-08-11 17:16:45 +00:00
Arjan van de Ven	cacacc8007	Add an AVX512 enabled DSCAL function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-11 17:14:57 +00:00
Martin Kroeker	1a00ef3d27	Merge pull request #1725 from fenrus75/axpy Add a AVX512 enabled SAXPY/DAXPY functions	2018-08-11 11:01:20 +02:00
Arjan van de Ven	2e99873ff7	Add a AVX512 enabled SAXPY/DAXPY functions written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-10 02:58:32 +00:00
Arjan van de Ven	00abaa865b	Add an AVX512 enabled SDOT function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-10 02:33:43 +00:00
Arjan van de Ven	7932ff3ea9	Add an AVX512 enabled DDOT function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-09 03:55:52 +00:00
Martin Kroeker	6e54b0a027	Disable the 16x2 DTRMM kernel on SkylakeX as well	2018-06-30 17:31:06 +02:00
Martin Kroeker	f0a8dc2eec	Disable the AVX512 DGEMM kernel for now due to #1643	2018-06-30 11:34:48 +02:00
Craig Donner	c2545b0fd6	Fixed a few more unnecessary calls to num_cpu_avail. I don't have as many benchmarks for these as for gemm, but it should still make a difference for small matrices.	2018-06-11 10:17:16 +01:00
Arjan van de Ven	89372e0993	Use AVX512 also for DGEMM this required switching to the generic gemm_beta code (which is faster anyway on SKX) for both DGEMM and SGEMM Performance for the not-retuned version is in the 30% range	2018-06-03 22:17:27 +00:00
Arjan van de Ven	99c7bba8e4	Initial support for SkylakeX / AVX512 This patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server) target. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set, which brings 2 basic things: 1) 512 bit wide SIMD (2x width of AVX2) 2) 32 SIMD registers (2x the number on AVX2) This initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel to AVX512VL; more will follow later but this patch aims to get the infrastructure in place for this "later". Full performance tuning has not been done yet; with more registers and wider SIMD it's in theory possible to retune the kernels but even without that there's an interesting enough performance increase (30-40% range) with just this change.	2018-06-03 07:58:52 +00:00
Martin Kroeker	840e01061f	Merge pull request #1491 from martin-frbg/ddot_mt Add multithreading support for Haswell DDOT	2018-03-27 21:43:05 +02:00
Martin Kroeker	a55694dd5b	Declare dot_compute static to avoid conflicts in multiarch builds	2018-03-16 22:23:36 +01:00
Martin Kroeker	85a41e9cdb	Add multithreading support for Haswell DDOT copied from ashwinyes' implementation in dot_thunderx2t99.c	2018-03-16 16:58:47 +01:00
Martin Kroeker	81215711a2	Re-enable DAXPY microkernels for x86_64 as the inaccuracies seen in the original testcase for #1332 appear to be due to an artefact that amplifies the very small rounding differences between FMA and discrete multiply+add	2018-03-04 19:37:03 +01:00
Martin Kroeker	497f0c3d8a	Replace .align with .p2align in the Nehalem microkernels	2018-02-26 20:58:33 +01:00
Martin Kroeker	ea37db828e	Convert .align to .p2align for OSX compatibility	2018-02-26 20:48:03 +01:00
Martin Kroeker	7c1925acec	Use .p2align instead of .align for compatibility on Sandybridge as well	2018-02-24 19:43:15 +01:00
Martin Kroeker	2359c7c1a9	Use .p2align instead of .align for portability The OSX assembler apparently mishandles the argument to decimal .align, leading to a significant loss of performance as observed in #730, #901 and most recently #1470	2018-02-24 17:50:13 +01:00
Martin Kroeker	e388459a27	Merge pull request #1419 from brada4/develop Initialize unitialized values for repeated calls	2018-01-31 23:48:34 +01:00
Andrew	4938faa822	core.IdenticalExpr clang501 checker	2018-01-19 23:15:58 +01:00
Martin Kroeker	42285d8e70	Merge pull request #1410 from brada4/develop Address warnings #1357	2018-01-06 20:02:46 +01:00
Andrew	4d0b005e5b	Eliminate remaining unused results in kernels (clang5 analyzer)	2018-01-01 20:54:39 +01:00
Martin Kroeker	b81656936f	Merge pull request #1409 from martin-frbg/issue1292-2 Tag %1 and %2 as both input and output operands	2017-12-31 20:18:48 +01:00
Martin Kroeker	b973990df2	Tag %1 and %2 as both input and output operands fix from #1292 extended to the other gemv microkernels	2017-12-31 18:03:36 +01:00
Martin Kroeker	1e31124eb0	Merge pull request #1406 from martin-frbg/issue1292 Tag %1 and %2 as both input and output	2017-12-30 14:52:03 +01:00
Martin Kroeker	723f396a20	Tag %1 and %2 as both input and output The inline assembly modifies its input operands, so mark them as output to avoid surprises with optimization. Fixes #1292	2017-12-29 23:56:41 +01:00
Martin Kroeker	43c0622e7b	Retire Piledriver/Steamroller/Excavator daxpy microkernels as well related to issue #1332	2017-12-13 18:40:39 +01:00
Martin Kroeker	0623636c98	Use Sandybridge daxpy kernel on Haswell and Zen for now The testcase from #1332 exposes a problem in daxpy_microk_haswell-2.c that is not seen with any of the other Intel x86_64 microkernels.	2017-12-10 19:24:31 +01:00
Andrew	281a2b952f	warning cleanup (#1380 ) * dead increments in driver/level2 * dead increments in kernel/generic * part dead increments in kernel/x86_64	2017-12-05 19:54:10 +01:00
Martin Kroeker	6c77b5f267	Merge pull request #1369 from martin-frbg/dsdot Add optimized dsdot to all other x86_64 kernels that use sdot.c	2017-11-28 18:15:31 +01:00
Martin Kroeker	c92cd6d162	Add trivially optimized dsdot based on sdot	2017-11-24 20:05:27 +01:00
Martin Kroeker	cae5d9a20b	Add trivially optimized dsdot based on sdot	2017-11-24 20:04:29 +01:00
Martin Kroeker	3d891c3106	Add trivially optimized dsdot based on sdot	2017-11-24 20:03:40 +01:00
Martin Kroeker	4fbdcfa823	Add trivially optimized dsdot based on sdot	2017-11-24 20:02:28 +01:00
Martin Kroeker	1bb6a96ebc	Add trivially optimized dsdot based on sdot	2017-11-24 20:01:42 +01:00
Martin Kroeker	6bd163f37a	Add trivially optimized dsdot based on sdot	2017-11-24 20:00:23 +01:00
Martin Kroeker	f0333333d1	Add trivially optimized dsdot based on sdot	2017-11-24 19:59:28 +01:00
Andrew	e89b979b2c	fix spurious compiler warning fix (no code change)	2017-11-24 18:39:04 +01:00
Andrew	7e9b29b9b8	fix spurious compiler warning (no code change)	2017-11-24 18:36:37 +01:00
Martin Kroeker	6157d0902a	Merge pull request #1358 from martin-frbg/unused_vars Clean up spurious unused variables in the kernels	2017-11-15 11:31:43 +01:00
Martin Kroeker	3fea849bbf	Remove unused variables from Haswell dtrmm and Bulldozer dtrsm	2017-11-14 23:35:10 +01:00
Martin Kroeker	8f177621bc	Remove unused variables at0...at3 from ?symv_U	2017-11-14 23:32:25 +01:00
Martin Kroeker	5f402b7759	Remove unused (loop?) variable j from the gemv_n_4 implementations	2017-11-14 23:29:42 +01:00
Martin Kroeker	a07807caac	Eliminate loop code when called as/from dsdot	2017-10-25 16:45:41 +02:00
Martin Kroeker	5e3e91d0fc	Split the microkernel workload into chunks of 32 floats for dsdot mode to limit loss of precision	2017-10-22 18:18:51 +02:00
Martin Kroeker	28c3fa8950	Add dsdot	2017-10-16 23:29:03 +02:00
Martin Kroeker	8ac87c1cb6	Implement DSDOT with unchanged sdot microkernels	2017-10-16 23:27:51 +02:00
Isuru Fernando	505b218829	Merge remote-tracking branch 'upstream/develop' into dyn	2017-08-06 19:07:00 +05:30
Isuru Fernando	1d1854032b	Add missing EXCAVATOR	2017-08-02 19:03:04 +05:30
Isuru Fernando	2c51a990ac	Fix extra whitespaces. CMake parser macro fails with it TODO: Fix the parser macro to strip trailing whitespaces	2017-08-02 18:26:57 +05:30
Isuru Fernando	ca17b4b75c	Fix complex support for MSVC headers	2017-07-28 11:50:29 +05:30
Denis Steckelmacher	c9ff735da6	Add ZEN support (tested for auto-detected static backend)	2017-03-19 15:32:50 +01:00
Martin Kroeker	a6efabf155	Replace gnu _real_ , _imag_ extensions in initializers	2017-03-13 00:38:37 +01:00
Martin Kroeker	dc34a0da96	Merge pull request #915 from mdong/small_fix_for_icc remove input from clobbered list	2017-02-23 20:00:22 +01:00
Martin Kroeker	4998e19869	Change file comments to work around clang 3.9 assembler bug	2016-10-13 16:51:08 +02:00
Martin Kroeker	16446d1d23	Remove explicit include of complex.h	2016-09-29 23:45:56 +02:00
mdong	098d8ec5d6	remove input from clobbered list	2016-06-24 16:37:58 -04:00
Werner Saar	298b13bba4	updated some kernel files for EXCAVATOR	2016-04-25 10:36:23 +02:00
Zhang Xianyi	f24d5307cf	Refs #834 . Fix zgemv config bug on Steamroller.	2016-04-12 22:26:11 +08:00
Zhang Xianyi	d4380c1fe4	Refs xianyi/OpenBLAS-CI#10 , Fix sdot for scipy test_iterative.test_convergence test failure on AMD bulldozer and piledriver.	2016-04-07 01:44:18 +08:00
Werner Saar	faa5e2e5e3	FIX: forgot the add the files cgemv_n_4.c and cgemv_t_4.c	2016-03-10 11:10:38 +01:00
Werner Saar	fdf291be30	Added optimized cgemv_n and cgemv_t kernels for bulldozer, piledriver and steamroller	2016-03-10 09:42:07 +01:00
Werner Saar	c99cc41cbd	Added optimized zgemv_n kernel for bulldozer, piledriver and steamroller	2016-03-09 14:02:03 +01:00
Werner Saar	acdff55a6a	Bugfix for ztrmv	2016-03-07 09:39:34 +01:00
Zhang Xianyi	7d6b68eb4a	Refs #786 . Revert to default assembly kernel.	2016-03-07 11:34:58 +08:00
Zhang Xianyi	8f758eeff9	Refs #786 . avoid old assembly c/zgemv kernels.	2016-03-05 08:32:03 +08:00
Zhang Xianyi	efa4f5c936	Refs #695 #783 . Replace default x86_64 cgemv_t asm kernel by C kernel.	2016-03-01 11:18:56 +08:00
Zhang Xianyi	6e7be06e07	Refs JuliaLang/julia#5728 . Fix gemv performance bug on Haswell Mac OSX. On Mac OS X, it should use .align 4 (equal to .align 16 on Linux). I didn't get the performance benefit from .align. Thus, I deleted it.	2016-02-19 17:56:07 -05:00
Zhang Xianyi	962376664d	Refs #768 . Swap the result of zdot x87 fp kernel.	2016-02-02 09:15:02 +08:00
Zhang Xianyi	c44ff4d648	Refs #714 . avoid compiling warnings.	2016-01-28 04:38:07 +08:00
Werner Saar	c8f2c5d636	added optimized trsm_kernels	2016-01-05 13:05:05 +01:00
Zhang Xianyi	69363622a8	Fix DYNAMIC_ARCH=1 bug.	2015-10-27 05:10:40 +08:00
Zhang Xianyi	f874465bb8	Use cmake to build OpenBLAS GENERIC Target on MSVC x86 64-bit. Disable CBLAS and LAPACK.	2015-08-10 14:10:44 -05:00
Zhang Xianyi	ab0a0a75fc	Merge branch 'develop' into cmake	2015-08-03 23:59:01 -05:00
Zhang Xianyi	1cf2b10224	Use pure C generic target on x86 and x86_64. make TARGET=GENERIC ?gemm3m is unimplemented on generic target.	2015-08-03 23:55:56 -05:00
Zhang Xianyi	7ac7e147d4	Fixed cmake building bugs on Linux. Disable LAPACK by default.	2015-08-04 04:37:05 +08:00
Werner Saar	e7c969e164	added optimized dtrmm_kernel for haswell	2015-06-13 16:16:29 +02:00
Werner Saar	9bd962f655	modified haswell parameter dgemm_unroll_n	2015-06-13 10:28:27 +02:00
Werner Saar	24f58c8bb1	added optimized cscal and zscal kernels for steamroller	2015-05-18 12:40:07 +02:00
Werner Saar	95b1faf667	added optimized cscal and zscal kernels for steamroller and piledriver	2015-05-18 10:50:57 +02:00
Werner Saar	2d9e406050	added optimized cscal kernel for sandybridge	2015-05-18 08:46:06 +02:00
Werner Saar	59083e3ce1	added optimized cscal kernel for bulldozer	2015-05-18 07:33:52 +02:00
wernsaar	685be40339	Merge pull request #571 from wernsaar/develop added optimized cscal and zscal functions	2015-05-17 14:09:14 +02:00
Werner Saar	31c9e399e9	added optimized cscal kernel for haswell	2015-05-17 13:44:09 +02:00
Werner Saar	7de6bb9889	added optimized zscal kernel for bulldozer	2015-05-17 11:45:19 +02:00
Werner Saar	d63034303b	added optimized zscal kernel for haswell	2015-05-16 16:41:45 +02:00
Zhang Xianyi	51ff17d46e	Add AMD Excavator target.	2015-05-13 16:16:30 -05:00
Werner Saar	18e90ee2e3	bugfix: added static to functions	2015-05-13 13:31:26 +02:00
Werner Saar	e00cccc41e	added optimized dscal kernel for piledriver	2015-05-13 13:05:35 +02:00
Werner Saar	73f09bf64f	optimized dscal kernel for increment != 1	2015-05-13 12:14:39 +02:00
Werner Saar	02e772c7e4	added optimized dscal kernel for haswell	2015-05-12 17:19:58 +02:00
Werner Saar	7aee913991	added optimized dscal kernel for sandybridge	2015-05-12 16:27:43 +02:00
Werner Saar	e50a933037	added optimized dscal kernel for bulldozer	2015-05-12 12:28:44 +02:00
Werner Saar	133c11a156	updated dgemv_n kernel for nehalem	2015-04-30 14:38:06 +02:00
Werner Saar	30f52d53df	optimized dgemv_n kernel for haswell	2015-04-30 12:11:39 +02:00
Werner Saar	5e83d80725	optimized dger kernel for sandybridge	2015-04-28 16:58:11 +02:00
Werner Saar	b2e1797dc6	added optimized sger kernel for sandybridge	2015-04-28 15:33:38 +02:00
Werner Saar	e216f686cb	optimized saxpy and daxpy for sandybridge	2015-04-28 10:18:32 +02:00
Werner Saar	fc0e0391f3	bugfixes: replaced int with BLASLONG	2015-04-24 14:30:44 +02:00
Werner Saar	c22068c406	optimized sdot.c for increments != 1	2015-04-24 13:13:20 +02:00
Werner Saar	dee100d0e4	optimized saxpy.c for increments != 1	2015-04-24 11:52:59 +02:00
Werner Saar	0273966abb	optimized daxpy kernel for increments != 1	2015-04-24 11:39:17 +02:00
Werner Saar	3a67daa954	optimized ddot.c for increments != 1	2015-04-24 10:56:55 +02:00
Werner Saar	b4f2153dcd	added optimized ssymv kernels for sandybridge	2015-04-23 12:19:24 +02:00
Werner Saar	1c4b0eeae3	added optimized ssymv kernels for haswell	2015-04-23 10:23:13 +02:00
Werner Saar	1bec9abb9a	added optimized dsymv kernels for sandybridge	2015-04-22 12:09:43 +02:00
Werner Saar	3814bf60d3	added optimized dsymv kernels for haswell	2015-04-22 10:42:50 +02:00
Werner Saar	6d0db0151f	added optimized zaxpy-kernels	2015-04-16 11:19:37 +02:00
Zhang Xianyi	37b9033c90	Merge pull request #543 from jeromerobert/develop Fix a buffer overflow with MAX_STACK_ALLOC size in dgemv_t	2015-04-15 11:18:14 -05:00
Werner Saar	13889515b3	added optimized caxpy-kernel for sandybridge	2015-04-15 16:29:25 +02:00

... 3 4 5 6 7 ...

766 Commits