OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Arjan van de Ven	55b244ca0d	enable the SGEMM/SKX C based kernel In QA the final bug was found so now the sklyakex sgemm C based kernel can be activated....	2018-10-12 09:30:35 +00:00
Arjan van de Ven	d4bad73834	Add a C+intrinsics version of the SGEMM/skylakex kernel for most sizes this is 1.2x to 1.4x faster than the current code	2018-10-10 01:49:22 +00:00
Arjan van de Ven	582c589727	dgemm/skylakex: replace discrete mul/add with fma very minor gains since it's not super hot code, but general principles	2018-10-06 23:13:26 +00:00
Arjan van de Ven	adbf6afa25	Add vector optimizations for ncopy as well for dgemm/skylakex	2018-10-06 21:18:12 +00:00
Arjan van de Ven	32bec8afbb	add a skylakex optimized dgemm beta function	2018-10-06 16:36:26 +00:00
Arjan van de Ven	20c5d668fe	dgemm/avx512 simplify and speed up the 4x4 kernel	2018-10-06 14:12:32 +00:00
Arjan van de Ven	6d43c51ccf	undo slow dgemm/skylake microoptimization the compare is more costly than the work	2018-10-06 14:00:37 +00:00
Arjan van de Ven	d74dc39b0f	Add optimized *copy versions for skylakex Add optimized n/t copy versions for skylakex; in the patch the tcopy is also rewritten using intrinsics; the ncopy file will be worked on in a future commit	2018-10-06 13:51:44 +00:00
Arjan van de Ven	66b43affbc	Add a 24x8 kernel to the skylakex dgemm implementation Minor gains for small matrixes, but at 512x512 and above the gain gets more significant.	2018-10-05 13:22:21 +00:00
Arjan van de Ven	1938819c25	skylake dgemm: Add a 16x8 kernel The next step for the avx512 dgemm code is adding a 16x8 kernel. In the 8x8 kernel, each FMA has a matching load (the broadcast); in the 16x8 kernel we can reuse this load for 2 FMAs, which in turn reduces pressure on the load ports of the CPU and gives a nice performance boost (in the 25% range).	2018-10-05 13:11:35 +00:00
Martin Kroeker	b7496c3638	Function name needs to be CNAME, set from outside to allow suffixing for dynamic_arch	2018-10-04 19:14:59 +02:00
Arjan van de Ven	45fe8cb0c5	Create a AVX512 enabled version of DGEMM This patch adds dgemm_kernel_4x8_skylakex.c which is * dgemm_kernel_4x8_haswell.s converted to C + intrinsics * 8x8 support added * 8x8 kernel implemented using AVX512 Performance is a work in progress, but already shows a 10% - 20% increase for a wide range of matrix sizes.	2018-10-03 14:45:25 +00:00
Martin Kroeker	375dff54fc	Merge pull request #1733 from fenrus75/dsymv Add an AVX512 enabled DSYMV (L) function	2018-08-12 18:18:36 +02:00
Martin Kroeker	a5f165275a	Merge pull request #1732 from fenrus75/dgemv Add an AVX512 enabled DGEMV (n) function	2018-08-12 18:17:42 +02:00
Martin Kroeker	8c13aa495a	Merge pull request #1730 from fenrus75/fix-sdot Fix typo in sdot function	2018-08-12 18:17:01 +02:00
Arjan van de Ven	9bec34cb67	Add an AVX512 enabled DSYMV (L) function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-11 17:46:24 +00:00
Arjan van de Ven	87bebdbd8a	Add an AVX512 enabled DGEMV (n) function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-11 17:38:12 +00:00
Arjan van de Ven	36add7570a	Fix typo in sdot function it looks like my previous pull request was short the final commit; fix a typo in sdot	2018-08-11 17:16:45 +00:00
Arjan van de Ven	cacacc8007	Add an AVX512 enabled DSCAL function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-11 17:14:57 +00:00
Martin Kroeker	1a00ef3d27	Merge pull request #1725 from fenrus75/axpy Add a AVX512 enabled SAXPY/DAXPY functions	2018-08-11 11:01:20 +02:00
Arjan van de Ven	2e99873ff7	Add a AVX512 enabled SAXPY/DAXPY functions written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-10 02:58:32 +00:00
Arjan van de Ven	00abaa865b	Add an AVX512 enabled SDOT function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-10 02:33:43 +00:00
Arjan van de Ven	7932ff3ea9	Add an AVX512 enabled DDOT function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-09 03:55:52 +00:00
Martin Kroeker	6e54b0a027	Disable the 16x2 DTRMM kernel on SkylakeX as well	2018-06-30 17:31:06 +02:00
Martin Kroeker	f0a8dc2eec	Disable the AVX512 DGEMM kernel for now due to #1643	2018-06-30 11:34:48 +02:00
Craig Donner	c2545b0fd6	Fixed a few more unnecessary calls to num_cpu_avail. I don't have as many benchmarks for these as for gemm, but it should still make a difference for small matrices.	2018-06-11 10:17:16 +01:00
Arjan van de Ven	89372e0993	Use AVX512 also for DGEMM this required switching to the generic gemm_beta code (which is faster anyway on SKX) for both DGEMM and SGEMM Performance for the not-retuned version is in the 30% range	2018-06-03 22:17:27 +00:00
Arjan van de Ven	99c7bba8e4	Initial support for SkylakeX / AVX512 This patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server) target. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set, which brings 2 basic things: 1) 512 bit wide SIMD (2x width of AVX2) 2) 32 SIMD registers (2x the number on AVX2) This initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel to AVX512VL; more will follow later but this patch aims to get the infrastructure in place for this "later". Full performance tuning has not been done yet; with more registers and wider SIMD it's in theory possible to retune the kernels but even without that there's an interesting enough performance increase (30-40% range) with just this change.	2018-06-03 07:58:52 +00:00
Martin Kroeker	840e01061f	Merge pull request #1491 from martin-frbg/ddot_mt Add multithreading support for Haswell DDOT	2018-03-27 21:43:05 +02:00
Martin Kroeker	a55694dd5b	Declare dot_compute static to avoid conflicts in multiarch builds	2018-03-16 22:23:36 +01:00
Martin Kroeker	85a41e9cdb	Add multithreading support for Haswell DDOT copied from ashwinyes' implementation in dot_thunderx2t99.c	2018-03-16 16:58:47 +01:00
Martin Kroeker	81215711a2	Re-enable DAXPY microkernels for x86_64 as the inaccuracies seen in the original testcase for #1332 appear to be due to an artefact that amplifies the very small rounding differences between FMA and discrete multiply+add	2018-03-04 19:37:03 +01:00
Martin Kroeker	497f0c3d8a	Replace .align with .p2align in the Nehalem microkernels	2018-02-26 20:58:33 +01:00
Martin Kroeker	ea37db828e	Convert .align to .p2align for OSX compatibility	2018-02-26 20:48:03 +01:00
Martin Kroeker	7c1925acec	Use .p2align instead of .align for compatibility on Sandybridge as well	2018-02-24 19:43:15 +01:00
Martin Kroeker	2359c7c1a9	Use .p2align instead of .align for portability The OSX assembler apparently mishandles the argument to decimal .align, leading to a significant loss of performance as observed in #730, #901 and most recently #1470	2018-02-24 17:50:13 +01:00
Martin Kroeker	e388459a27	Merge pull request #1419 from brada4/develop Initialize unitialized values for repeated calls	2018-01-31 23:48:34 +01:00
Andrew	4938faa822	core.IdenticalExpr clang501 checker	2018-01-19 23:15:58 +01:00
Martin Kroeker	42285d8e70	Merge pull request #1410 from brada4/develop Address warnings #1357	2018-01-06 20:02:46 +01:00
Andrew	4d0b005e5b	Eliminate remaining unused results in kernels (clang5 analyzer)	2018-01-01 20:54:39 +01:00
Martin Kroeker	b81656936f	Merge pull request #1409 from martin-frbg/issue1292-2 Tag %1 and %2 as both input and output operands	2017-12-31 20:18:48 +01:00
Martin Kroeker	b973990df2	Tag %1 and %2 as both input and output operands fix from #1292 extended to the other gemv microkernels	2017-12-31 18:03:36 +01:00
Martin Kroeker	1e31124eb0	Merge pull request #1406 from martin-frbg/issue1292 Tag %1 and %2 as both input and output	2017-12-30 14:52:03 +01:00
Martin Kroeker	723f396a20	Tag %1 and %2 as both input and output The inline assembly modifies its input operands, so mark them as output to avoid surprises with optimization. Fixes #1292	2017-12-29 23:56:41 +01:00
Martin Kroeker	43c0622e7b	Retire Piledriver/Steamroller/Excavator daxpy microkernels as well related to issue #1332	2017-12-13 18:40:39 +01:00
Martin Kroeker	0623636c98	Use Sandybridge daxpy kernel on Haswell and Zen for now The testcase from #1332 exposes a problem in daxpy_microk_haswell-2.c that is not seen with any of the other Intel x86_64 microkernels.	2017-12-10 19:24:31 +01:00
Andrew	281a2b952f	warning cleanup (#1380 ) * dead increments in driver/level2 * dead increments in kernel/generic * part dead increments in kernel/x86_64	2017-12-05 19:54:10 +01:00
Martin Kroeker	6c77b5f267	Merge pull request #1369 from martin-frbg/dsdot Add optimized dsdot to all other x86_64 kernels that use sdot.c	2017-11-28 18:15:31 +01:00
Martin Kroeker	c92cd6d162	Add trivially optimized dsdot based on sdot	2017-11-24 20:05:27 +01:00
Martin Kroeker	cae5d9a20b	Add trivially optimized dsdot based on sdot	2017-11-24 20:04:29 +01:00

1 2 3 4 5 ...

444 Commits