OpenBLAS

Commit Graph

Author	SHA1	Message	Date
wjc404	1df9a2013d	new sgemm kernel for skylakex	2019-11-02 00:00:48 +08:00
wjc404	6ff013bae0	native support for icopy_4 90% MKL 1-thread performance.	2019-10-19 03:54:44 +08:00
wjc404	0d669e04bb	Update dgemm_kernel_8x8_skylakex.c	2019-10-18 15:00:17 +08:00
wjc404	17cdd9f9e1	some correction	2019-10-18 14:58:07 +08:00
wjc404	6bcb06fcb1	make further changes to icopy_8 easier	2019-10-18 10:47:31 +08:00
wjc404	b7315f8401	Add files via upload	2019-10-16 19:23:36 +08:00
wjc404	9b19e9e1b0	Update dgemm_kernel_8x8_skylakex.c	2019-10-16 10:14:51 +08:00
wjc404	6bd67ddbab	Update dgemm_kernel_8x8_skylakex.c	2019-10-16 03:20:08 +08:00
wjc404	844629af57	Add files via upload	2019-10-16 02:00:34 +08:00
Martin Kroeker	11c59acfb1	Keep both PGI/SUN and default code paths to avoid breaking Clang/WIndows	2019-08-28 18:07:44 +02:00
Martin Kroeker	3a55dca2dc	Make x86_64 zdot compile with PGI and Sun C again broken by #2222 as CREAL,CIMAG do not expand to a valid lvalue with these compilers	2019-08-28 11:35:31 +02:00
Martin Kroeker	9ef96b32a6	Add multithreading support to the x86_64 zdot kernel (#2222 ) * Add multithreading support copied from the ThunderX2T99 kernel. For #2221	2019-08-15 22:09:12 +02:00
Martin Kroeker	dccff2e785	Merge pull request #2206 from martin-frbg/zen-dtrmm Replace vpermpd with vpermilpd in the Haswell DTRMM kernel	2019-08-09 07:55:20 +02:00
Martin Kroeker	5c3458a6e7	Merge pull request #2199 from martin-frbg/zen-dtrsm Replace most vpermpd calls in the Haswell DTRSM_RN kernel	2019-08-09 07:55:02 +02:00
Martin Kroeker	acf6002ab2	Replace most vpermpd calls in the Haswell DTRSM_RN kernel	2019-08-03 12:40:13 +02:00
Martin Kroeker	2dfb804cb9	Replace vpermpd with vpermilpd in the Haswell DTRMM kernel to improve performance on AMD Zen (#2180) applying wjc404's improvement of the DGEMM kernel from #2186	2019-07-28 23:17:28 +02:00
Martin Kroeker	4c153ec9da	Merge pull request #2196 from wjc404/develop Add vbroadcastsd kernel to dgemm_kernel_4x8_haswell.S	2019-07-28 23:11:40 +02:00
wjc404	7eecd8e39c	Add files via upload	2019-07-28 07:39:09 +08:00
Martin Kroeker	7b0b7c11d2	Merge pull request #2190 from martin-frbg/zdot-zen Replace vpermpd with vpermilpd in the Haswell/Zen zdot microkernel	2019-07-23 16:15:08 +02:00
Martin Kroeker	28e96458e5	Replace vpermpd with vpermilpd to improve performance on Zen/Zen2 (as demonstrated by wjc404 in #2180)	2019-07-22 08:28:16 +02:00
wjc404	95fb98f556	Update dgemm_kernel_4x8_haswell.S	2019-07-21 01:10:32 +08:00
wjc404	4801c6d36b	Update dgemm_kernel_4x8_haswell.S	2019-07-21 00:47:45 +08:00
wjc404	9440fa607d	Add files via upload	2019-07-20 22:08:22 +08:00
wjc404	94db259e5b	Add files via upload	2019-07-20 22:04:41 +08:00
wjc404	f49f8047ac	Add files via upload	2019-07-20 14:33:37 +08:00
wjc404	825777faab	Update dgemm_kernel_4x8_haswell.S	2019-07-19 23:58:24 +08:00
wjc404	9c89757562	Add files via upload	2019-07-19 23:47:58 +08:00
wjc404	9b04baeaee	Update dgemm_kernel_4x8_haswell.S	2019-07-17 23:50:03 +08:00
wjc404	8a074b3965	Update dgemm_kernel_4x8_haswell.S	2019-07-17 23:47:30 +08:00
wjc404	211ab03b14	Update dgemm_kernel_4x8_haswell.S	2019-07-17 22:39:15 +08:00
wjc404	1733f927e6	Update dgemm_kernel_4x8_haswell.S	2019-07-17 21:27:41 +08:00
wjc404	182b06d6ad	Update dgemm_kernel_4x8_haswell.S	2019-07-17 17:02:35 +08:00
wjc404	7a9050d681	Update dgemm_kernel_4x8_haswell.S	2019-07-17 00:55:06 +08:00
wjc404	0ba29fd262	Update dgemm_kernel_4x8_haswell.S for zen2 replaced a bunch of vpermpd instructions with vpermilpd and vperm2f128	2019-07-17 00:46:51 +08:00
Martin Kroeker	9ea30f3788	Replace ISMIN and ISAMIN kernels on all x86_64 platforms (#2125 ) * Mark iamax_sse.S as unsuitable for MIN due to issue #2116 * Use iamax.S rather than iamax_sse.S for ISMIN/ISAMIN on all x86_64 as workaround for #2116	2019-05-09 14:42:36 +02:00
Martin Kroeker	b1561ecc68	Disable DGEMMINCOPY as well for now #1955	2019-05-05 15:52:01 +02:00
Martin Kroeker	7ed8431527	Disable the SkyLakeX DGEMMITCOPY kernel as well as a stopgap measure for https://github.com/numpy/numpy/issues/13401 as mentioned in #1955	2019-05-04 22:54:41 +02:00
Martin Kroeker	c04a729081	Add ?sum definitions for generic kernel	2019-03-31 13:55:49 +02:00
Martin Kroeker	9d717cb5ee	Add x86_64 implementation of ?sum as trivial copy of ?asum with the fabs calls removed	2019-03-30 22:27:04 +01:00
Martin Kroeker	32c7063cb0	Merge pull request #2061 from martin-frbg/martin-frbg-patch-1 Disable the AVX512 DGEMM kernel (again)	2019-03-30 21:21:38 +01:00
Martin Kroeker	e608d4f7fe	Disable the AVX512 DGEMM kernel (again) Due to as yet unresolved errors seen in #1955 and #2029	2019-03-13 22:10:28 +01:00
Celelibi	b7f59da42d	Fix crash in sgemm SSE/nano kernel on x86_64 Fix bug #2047. Signed-off-by: Celelibi <celelibi@gmail.com>	2019-03-07 16:55:13 +01:00
Andrew	6eee1beac5	move fix to right place	2019-02-24 20:41:02 +02:00
Martin Kroeker	e12cdf58ef	Merge pull request #2024 from martin-frbg/gcc9fixes4 Fix inline assembly constraints in Bulldozer TRSM kernels	2019-02-17 11:49:15 +01:00
Martin Kroeker	1860c9456d	Merge pull request #2023 from martin-frbg/gcc9fixes3 Fix inline assembly constraints in various x86_64 GEMVN kernels	2019-02-17 11:48:57 +01:00
Martin Kroeker	f9bb76d29a	Fix inline assembly constraints in Bulldozer TRSM kernels rework indices to allow marking i,as and bs as both input and output (marked operand n1 as well for simplicity). For #2009	2019-02-16 20:06:48 +01:00
Martin Kroeker	efb9038f72	Fix inline assembly constraints	2019-02-16 18:46:17 +01:00
Martin Kroeker	e976557d29	Fix inline assembly constraints rework indices to allow marking argument lda as input and output.	2019-02-16 18:36:39 +01:00
Martin Kroeker	9d8be15789	Fix inline assembly constraints rework indices to allow marking argument lda4 as input and output. For #2009	2019-02-16 18:24:11 +01:00
Martin Kroeker	d752799a0f	Merge pull request #2021 from martin-frbg/gcc9fixes2 Fix wrong constraints in inline assembly of Haswell DTRSM kernel	2019-02-16 18:05:40 +01:00
Martin Kroeker	c26c0b77a7	Fix wrong constraints in inline assembly for #2009	2019-02-15 15:08:16 +01:00
Martin Kroeker	1c6da2d03c	Merge pull request #2019 from martin-frbg/gcc9fixes Fix unannounced modification of input operand 8 (lda4) in Haswell GEMVN microkernel	2019-02-15 15:02:54 +01:00
Martin Kroeker	4255a58cd2	Rename operands to put lda on the input/output constraint list	2019-02-15 10:10:04 +01:00
Martin Kroeker	46e415b140	Save and restore input argument 8 (lda4) Fixes miscompilation with gcc9 -ftree-vectorize (related to issue #2009)	2019-02-14 22:43:18 +01:00
Bart Oldeman	69a97ca7b9	dgemv_kernel_4x4(Haswell): add missing clobbers for xmm0,xmm1,xmm2,xmm3 This fixes a crash in dblat2 when OpenBLAS is compiled using -march=znver1 -ftree-vectorize -O2 See also: https://github.com/easybuilders/easybuild-easyconfigs/issues/7180	2019-02-14 16:27:58 +00:00
Martin Kroeker	ab1630f9fa	Fix declaration of arguments in inline assembly Argument 0 is modified so should be input and output	2019-02-12 16:14:02 +01:00
Martin Kroeker	b824fa70eb	Fix declaration of assembly arguments in SSYMV and DSYMV microkernels Arguments 0 and 1 are both input and output	2019-02-12 16:00:18 +01:00
Martin Kroeker	91481a3e4e	Fix declaration of input arguments in inline assembly Argument 0 is modified as it doubles as a counter	2019-02-12 15:51:43 +01:00
Martin Kroeker	dc6ac9eab0	Fix declaration of input arguments in the x86_64 s/dGEMV_T and s/dGEMV_N kernels Arguments 0 and 1 need to be tagged as both input and output	2019-02-12 15:33:48 +01:00
Martin Kroeker	32b0f1168e	Fix declaration of input arguments in the Sandybridge GER microkernels (#1967 ) * Tag arguments 0 and 1 as both input and output	2019-01-18 08:11:39 +01:00
Martin Kroeker	b495e54310	Fix declaration of input arguments in the x86_64 SCAL microkernels (#1966 ) * Tag arguments 0 and 1 as both input and output (see #1964)	2019-01-18 08:11:07 +01:00
Martin Kroeker	d5e6940253	Fix declaration of input arguments in the x86_64 microkernels for DOT and AXPY (#1965 ) * Tag operands 0 and 1 as both input and output For #1964 (basically a continuation of coding problems first seen in #1292)	2019-01-17 23:20:32 +01:00
Arjan van de Ven	795285c587	Fix thinko in skylake beta handling casting ints is cheaper but it has a rounding, not memory casing effect, resulting in invalid outcome	2018-12-24 18:49:50 +00:00
Arjan van de Ven	d321448a63	dgemm: use dgemm_ncopy_8_skylakex.c also for Haswell The dgemm_ncopy_8_skylakex.c code is not avx512 specific and gives a nice performance boost for medium sized matrices	2018-12-16 23:09:22 +00:00
Arjan van de Ven	c43331ad0a	dgemm: Use the skylakex beta function also for haswell it's more efficient for certain tall/skinny matrices	2018-12-16 23:09:17 +00:00
Arjan van de Ven	69d206440a	Make the skylakex/haswell sgemm code compile and run even with compilers without avx2 support	2018-12-16 00:19:41 +00:00
Arjan van de Ven	0586899a10	Use sgemm_ncopy_4_skylakex.c also for Haswell sgemm_ncopy_4_skylakex.c uses SSE transpose operations where the real perf win happens; this also works great for Haswell. This gives double digit percentage gains on small and skinny matrices	2018-12-15 13:49:19 +00:00
Arjan van de Ven	00dc09ad19	Use the skylake sgemm beta code also for haswell with a few small changes it's possible to use the skylake sgemm code also for haswell, this gives a modest gain (10% range) for smallish matrixes but does wonders for very skinny matrixes	2018-12-15 13:49:13 +00:00
Arjan van de Ven	cdc668d82b	Add a "sgemm direct" mode for small matrixes OpenBLAS has a fancy algorithm for copying the input data while laying it out in a more CPU friendly memory layout. This is great for large matrixes; the cost of the copy is easily ammortized by the gains from the better memory layout. But for small matrixes (on CPUs that can do efficient unaligned loads) this copy can be a net loss. This patch adds (for SKYLAKEX initially) a "sgemm direct" mode, that bypasses the whole copy machinary for ALPHA=1/BETA=0/... standard arguments, for small matrixes only. What is small? For the non-threaded case this has been measured to be in the MNK = 28 * 512 * 512 range, while in the threaded case it's less, around MNK = 1 * 512 * 512	2018-12-13 13:47:31 +00:00
Martin Kroeker	701ea88347	Use p2align instead of align for OSX compatibility fixes #1902	2018-12-03 13:06:43 +01:00
Andrew	19c4bdd8b3	Add return value so that freebsd system clang does not err out	2018-11-25 21:35:01 +01:00
Arjan van de Ven	dcc5d6291e	skylakex: Make the sgemm/dgemm beta code robust for a N=0 or M=0 case in the threading code there are cases where N or M can become 0, and the optimized beta code did not handle this well, leading to a crash during the audit for the crash a few edge conditions on the if statements were found and fixed as well	2018-11-01 01:42:09 +00:00
Arjan van de Ven	55b244ca0d	enable the SGEMM/SKX C based kernel In QA the final bug was found so now the sklyakex sgemm C based kernel can be activated....	2018-10-12 09:30:35 +00:00
Arjan van de Ven	d4bad73834	Add a C+intrinsics version of the SGEMM/skylakex kernel for most sizes this is 1.2x to 1.4x faster than the current code	2018-10-10 01:49:22 +00:00
Arjan van de Ven	582c589727	dgemm/skylakex: replace discrete mul/add with fma very minor gains since it's not super hot code, but general principles	2018-10-06 23:13:26 +00:00
Arjan van de Ven	adbf6afa25	Add vector optimizations for ncopy as well for dgemm/skylakex	2018-10-06 21:18:12 +00:00
Arjan van de Ven	32bec8afbb	add a skylakex optimized dgemm beta function	2018-10-06 16:36:26 +00:00
Arjan van de Ven	20c5d668fe	dgemm/avx512 simplify and speed up the 4x4 kernel	2018-10-06 14:12:32 +00:00
Arjan van de Ven	6d43c51ccf	undo slow dgemm/skylake microoptimization the compare is more costly than the work	2018-10-06 14:00:37 +00:00
Arjan van de Ven	d74dc39b0f	Add optimized *copy versions for skylakex Add optimized n/t copy versions for skylakex; in the patch the tcopy is also rewritten using intrinsics; the ncopy file will be worked on in a future commit	2018-10-06 13:51:44 +00:00
Arjan van de Ven	66b43affbc	Add a 24x8 kernel to the skylakex dgemm implementation Minor gains for small matrixes, but at 512x512 and above the gain gets more significant.	2018-10-05 13:22:21 +00:00
Arjan van de Ven	1938819c25	skylake dgemm: Add a 16x8 kernel The next step for the avx512 dgemm code is adding a 16x8 kernel. In the 8x8 kernel, each FMA has a matching load (the broadcast); in the 16x8 kernel we can reuse this load for 2 FMAs, which in turn reduces pressure on the load ports of the CPU and gives a nice performance boost (in the 25% range).	2018-10-05 13:11:35 +00:00
Martin Kroeker	b7496c3638	Function name needs to be CNAME, set from outside to allow suffixing for dynamic_arch	2018-10-04 19:14:59 +02:00
Arjan van de Ven	45fe8cb0c5	Create a AVX512 enabled version of DGEMM This patch adds dgemm_kernel_4x8_skylakex.c which is * dgemm_kernel_4x8_haswell.s converted to C + intrinsics * 8x8 support added * 8x8 kernel implemented using AVX512 Performance is a work in progress, but already shows a 10% - 20% increase for a wide range of matrix sizes.	2018-10-03 14:45:25 +00:00
Martin Kroeker	375dff54fc	Merge pull request #1733 from fenrus75/dsymv Add an AVX512 enabled DSYMV (L) function	2018-08-12 18:18:36 +02:00
Martin Kroeker	a5f165275a	Merge pull request #1732 from fenrus75/dgemv Add an AVX512 enabled DGEMV (n) function	2018-08-12 18:17:42 +02:00
Martin Kroeker	8c13aa495a	Merge pull request #1730 from fenrus75/fix-sdot Fix typo in sdot function	2018-08-12 18:17:01 +02:00
Arjan van de Ven	9bec34cb67	Add an AVX512 enabled DSYMV (L) function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-11 17:46:24 +00:00
Arjan van de Ven	87bebdbd8a	Add an AVX512 enabled DGEMV (n) function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-11 17:38:12 +00:00
Arjan van de Ven	36add7570a	Fix typo in sdot function it looks like my previous pull request was short the final commit; fix a typo in sdot	2018-08-11 17:16:45 +00:00
Arjan van de Ven	cacacc8007	Add an AVX512 enabled DSCAL function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-11 17:14:57 +00:00
Martin Kroeker	1a00ef3d27	Merge pull request #1725 from fenrus75/axpy Add a AVX512 enabled SAXPY/DAXPY functions	2018-08-11 11:01:20 +02:00
Arjan van de Ven	2e99873ff7	Add a AVX512 enabled SAXPY/DAXPY functions written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-10 02:58:32 +00:00
Arjan van de Ven	00abaa865b	Add an AVX512 enabled SDOT function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-10 02:33:43 +00:00
Arjan van de Ven	7932ff3ea9	Add an AVX512 enabled DDOT function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-09 03:55:52 +00:00
Martin Kroeker	6e54b0a027	Disable the 16x2 DTRMM kernel on SkylakeX as well	2018-06-30 17:31:06 +02:00
Martin Kroeker	f0a8dc2eec	Disable the AVX512 DGEMM kernel for now due to #1643	2018-06-30 11:34:48 +02:00
Craig Donner	c2545b0fd6	Fixed a few more unnecessary calls to num_cpu_avail. I don't have as many benchmarks for these as for gemm, but it should still make a difference for small matrices.	2018-06-11 10:17:16 +01:00
Arjan van de Ven	89372e0993	Use AVX512 also for DGEMM this required switching to the generic gemm_beta code (which is faster anyway on SKX) for both DGEMM and SGEMM Performance for the not-retuned version is in the 30% range	2018-06-03 22:17:27 +00:00
Arjan van de Ven	99c7bba8e4	Initial support for SkylakeX / AVX512 This patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server) target. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set, which brings 2 basic things: 1) 512 bit wide SIMD (2x width of AVX2) 2) 32 SIMD registers (2x the number on AVX2) This initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel to AVX512VL; more will follow later but this patch aims to get the infrastructure in place for this "later". Full performance tuning has not been done yet; with more registers and wider SIMD it's in theory possible to retune the kernels but even without that there's an interesting enough performance increase (30-40% range) with just this change.	2018-06-03 07:58:52 +00:00
Martin Kroeker	840e01061f	Merge pull request #1491 from martin-frbg/ddot_mt Add multithreading support for Haswell DDOT	2018-03-27 21:43:05 +02:00
Martin Kroeker	a55694dd5b	Declare dot_compute static to avoid conflicts in multiarch builds	2018-03-16 22:23:36 +01:00
Martin Kroeker	85a41e9cdb	Add multithreading support for Haswell DDOT copied from ashwinyes' implementation in dot_thunderx2t99.c	2018-03-16 16:58:47 +01:00
Martin Kroeker	81215711a2	Re-enable DAXPY microkernels for x86_64 as the inaccuracies seen in the original testcase for #1332 appear to be due to an artefact that amplifies the very small rounding differences between FMA and discrete multiply+add	2018-03-04 19:37:03 +01:00
Martin Kroeker	497f0c3d8a	Replace .align with .p2align in the Nehalem microkernels	2018-02-26 20:58:33 +01:00
Martin Kroeker	ea37db828e	Convert .align to .p2align for OSX compatibility	2018-02-26 20:48:03 +01:00
Martin Kroeker	7c1925acec	Use .p2align instead of .align for compatibility on Sandybridge as well	2018-02-24 19:43:15 +01:00
Martin Kroeker	2359c7c1a9	Use .p2align instead of .align for portability The OSX assembler apparently mishandles the argument to decimal .align, leading to a significant loss of performance as observed in #730, #901 and most recently #1470	2018-02-24 17:50:13 +01:00
Martin Kroeker	e388459a27	Merge pull request #1419 from brada4/develop Initialize unitialized values for repeated calls	2018-01-31 23:48:34 +01:00
Andrew	4938faa822	core.IdenticalExpr clang501 checker	2018-01-19 23:15:58 +01:00
Martin Kroeker	42285d8e70	Merge pull request #1410 from brada4/develop Address warnings #1357	2018-01-06 20:02:46 +01:00
Andrew	4d0b005e5b	Eliminate remaining unused results in kernels (clang5 analyzer)	2018-01-01 20:54:39 +01:00
Martin Kroeker	b81656936f	Merge pull request #1409 from martin-frbg/issue1292-2 Tag %1 and %2 as both input and output operands	2017-12-31 20:18:48 +01:00
Martin Kroeker	b973990df2	Tag %1 and %2 as both input and output operands fix from #1292 extended to the other gemv microkernels	2017-12-31 18:03:36 +01:00
Martin Kroeker	1e31124eb0	Merge pull request #1406 from martin-frbg/issue1292 Tag %1 and %2 as both input and output	2017-12-30 14:52:03 +01:00
Martin Kroeker	723f396a20	Tag %1 and %2 as both input and output The inline assembly modifies its input operands, so mark them as output to avoid surprises with optimization. Fixes #1292	2017-12-29 23:56:41 +01:00
Martin Kroeker	43c0622e7b	Retire Piledriver/Steamroller/Excavator daxpy microkernels as well related to issue #1332	2017-12-13 18:40:39 +01:00
Martin Kroeker	0623636c98	Use Sandybridge daxpy kernel on Haswell and Zen for now The testcase from #1332 exposes a problem in daxpy_microk_haswell-2.c that is not seen with any of the other Intel x86_64 microkernels.	2017-12-10 19:24:31 +01:00
Andrew	281a2b952f	warning cleanup (#1380 ) * dead increments in driver/level2 * dead increments in kernel/generic * part dead increments in kernel/x86_64	2017-12-05 19:54:10 +01:00
Martin Kroeker	6c77b5f267	Merge pull request #1369 from martin-frbg/dsdot Add optimized dsdot to all other x86_64 kernels that use sdot.c	2017-11-28 18:15:31 +01:00
Martin Kroeker	c92cd6d162	Add trivially optimized dsdot based on sdot	2017-11-24 20:05:27 +01:00
Martin Kroeker	cae5d9a20b	Add trivially optimized dsdot based on sdot	2017-11-24 20:04:29 +01:00
Martin Kroeker	3d891c3106	Add trivially optimized dsdot based on sdot	2017-11-24 20:03:40 +01:00
Martin Kroeker	4fbdcfa823	Add trivially optimized dsdot based on sdot	2017-11-24 20:02:28 +01:00
Martin Kroeker	1bb6a96ebc	Add trivially optimized dsdot based on sdot	2017-11-24 20:01:42 +01:00
Martin Kroeker	6bd163f37a	Add trivially optimized dsdot based on sdot	2017-11-24 20:00:23 +01:00
Martin Kroeker	f0333333d1	Add trivially optimized dsdot based on sdot	2017-11-24 19:59:28 +01:00
Andrew	e89b979b2c	fix spurious compiler warning fix (no code change)	2017-11-24 18:39:04 +01:00
Andrew	7e9b29b9b8	fix spurious compiler warning (no code change)	2017-11-24 18:36:37 +01:00
Martin Kroeker	6157d0902a	Merge pull request #1358 from martin-frbg/unused_vars Clean up spurious unused variables in the kernels	2017-11-15 11:31:43 +01:00
Martin Kroeker	3fea849bbf	Remove unused variables from Haswell dtrmm and Bulldozer dtrsm	2017-11-14 23:35:10 +01:00
Martin Kroeker	8f177621bc	Remove unused variables at0...at3 from ?symv_U	2017-11-14 23:32:25 +01:00
Martin Kroeker	5f402b7759	Remove unused (loop?) variable j from the gemv_n_4 implementations	2017-11-14 23:29:42 +01:00
Martin Kroeker	a07807caac	Eliminate loop code when called as/from dsdot	2017-10-25 16:45:41 +02:00
Martin Kroeker	5e3e91d0fc	Split the microkernel workload into chunks of 32 floats for dsdot mode to limit loss of precision	2017-10-22 18:18:51 +02:00
Martin Kroeker	28c3fa8950	Add dsdot	2017-10-16 23:29:03 +02:00
Martin Kroeker	8ac87c1cb6	Implement DSDOT with unchanged sdot microkernels	2017-10-16 23:27:51 +02:00
Isuru Fernando	505b218829	Merge remote-tracking branch 'upstream/develop' into dyn	2017-08-06 19:07:00 +05:30
Isuru Fernando	1d1854032b	Add missing EXCAVATOR	2017-08-02 19:03:04 +05:30
Isuru Fernando	2c51a990ac	Fix extra whitespaces. CMake parser macro fails with it TODO: Fix the parser macro to strip trailing whitespaces	2017-08-02 18:26:57 +05:30
Isuru Fernando	ca17b4b75c	Fix complex support for MSVC headers	2017-07-28 11:50:29 +05:30
Denis Steckelmacher	c9ff735da6	Add ZEN support (tested for auto-detected static backend)	2017-03-19 15:32:50 +01:00
Martin Kroeker	a6efabf155	Replace gnu _real_ , _imag_ extensions in initializers	2017-03-13 00:38:37 +01:00
Martin Kroeker	dc34a0da96	Merge pull request #915 from mdong/small_fix_for_icc remove input from clobbered list	2017-02-23 20:00:22 +01:00
Martin Kroeker	4998e19869	Change file comments to work around clang 3.9 assembler bug	2016-10-13 16:51:08 +02:00
Martin Kroeker	16446d1d23	Remove explicit include of complex.h	2016-09-29 23:45:56 +02:00
mdong	098d8ec5d6	remove input from clobbered list	2016-06-24 16:37:58 -04:00
Werner Saar	298b13bba4	updated some kernel files for EXCAVATOR	2016-04-25 10:36:23 +02:00
Zhang Xianyi	f24d5307cf	Refs #834 . Fix zgemv config bug on Steamroller.	2016-04-12 22:26:11 +08:00
Zhang Xianyi	d4380c1fe4	Refs xianyi/OpenBLAS-CI#10 , Fix sdot for scipy test_iterative.test_convergence test failure on AMD bulldozer and piledriver.	2016-04-07 01:44:18 +08:00
Werner Saar	faa5e2e5e3	FIX: forgot the add the files cgemv_n_4.c and cgemv_t_4.c	2016-03-10 11:10:38 +01:00
Werner Saar	fdf291be30	Added optimized cgemv_n and cgemv_t kernels for bulldozer, piledriver and steamroller	2016-03-10 09:42:07 +01:00
Werner Saar	c99cc41cbd	Added optimized zgemv_n kernel for bulldozer, piledriver and steamroller	2016-03-09 14:02:03 +01:00
Werner Saar	acdff55a6a	Bugfix for ztrmv	2016-03-07 09:39:34 +01:00
Zhang Xianyi	7d6b68eb4a	Refs #786 . Revert to default assembly kernel.	2016-03-07 11:34:58 +08:00
Zhang Xianyi	8f758eeff9	Refs #786 . avoid old assembly c/zgemv kernels.	2016-03-05 08:32:03 +08:00
Zhang Xianyi	efa4f5c936	Refs #695 #783 . Replace default x86_64 cgemv_t asm kernel by C kernel.	2016-03-01 11:18:56 +08:00
Zhang Xianyi	6e7be06e07	Refs JuliaLang/julia#5728 . Fix gemv performance bug on Haswell Mac OSX. On Mac OS X, it should use .align 4 (equal to .align 16 on Linux). I didn't get the performance benefit from .align. Thus, I deleted it.	2016-02-19 17:56:07 -05:00
Zhang Xianyi	962376664d	Refs #768 . Swap the result of zdot x87 fp kernel.	2016-02-02 09:15:02 +08:00
Zhang Xianyi	c44ff4d648	Refs #714 . avoid compiling warnings.	2016-01-28 04:38:07 +08:00
Werner Saar	c8f2c5d636	added optimized trsm_kernels	2016-01-05 13:05:05 +01:00
Zhang Xianyi	69363622a8	Fix DYNAMIC_ARCH=1 bug.	2015-10-27 05:10:40 +08:00
Zhang Xianyi	f874465bb8	Use cmake to build OpenBLAS GENERIC Target on MSVC x86 64-bit. Disable CBLAS and LAPACK.	2015-08-10 14:10:44 -05:00
Zhang Xianyi	ab0a0a75fc	Merge branch 'develop' into cmake	2015-08-03 23:59:01 -05:00
Zhang Xianyi	1cf2b10224	Use pure C generic target on x86 and x86_64. make TARGET=GENERIC ?gemm3m is unimplemented on generic target.	2015-08-03 23:55:56 -05:00
Zhang Xianyi	7ac7e147d4	Fixed cmake building bugs on Linux. Disable LAPACK by default.	2015-08-04 04:37:05 +08:00
Werner Saar	e7c969e164	added optimized dtrmm_kernel for haswell	2015-06-13 16:16:29 +02:00
Werner Saar	9bd962f655	modified haswell parameter dgemm_unroll_n	2015-06-13 10:28:27 +02:00
Werner Saar	24f58c8bb1	added optimized cscal and zscal kernels for steamroller	2015-05-18 12:40:07 +02:00
Werner Saar	95b1faf667	added optimized cscal and zscal kernels for steamroller and piledriver	2015-05-18 10:50:57 +02:00
Werner Saar	2d9e406050	added optimized cscal kernel for sandybridge	2015-05-18 08:46:06 +02:00
Werner Saar	59083e3ce1	added optimized cscal kernel for bulldozer	2015-05-18 07:33:52 +02:00
wernsaar	685be40339	Merge pull request #571 from wernsaar/develop added optimized cscal and zscal functions	2015-05-17 14:09:14 +02:00
Werner Saar	31c9e399e9	added optimized cscal kernel for haswell	2015-05-17 13:44:09 +02:00
Werner Saar	7de6bb9889	added optimized zscal kernel for bulldozer	2015-05-17 11:45:19 +02:00
Werner Saar	d63034303b	added optimized zscal kernel for haswell	2015-05-16 16:41:45 +02:00
Zhang Xianyi	51ff17d46e	Add AMD Excavator target.	2015-05-13 16:16:30 -05:00
Werner Saar	18e90ee2e3	bugfix: added static to functions	2015-05-13 13:31:26 +02:00
Werner Saar	e00cccc41e	added optimized dscal kernel for piledriver	2015-05-13 13:05:35 +02:00
Werner Saar	73f09bf64f	optimized dscal kernel for increment != 1	2015-05-13 12:14:39 +02:00
Werner Saar	02e772c7e4	added optimized dscal kernel for haswell	2015-05-12 17:19:58 +02:00
Werner Saar	7aee913991	added optimized dscal kernel for sandybridge	2015-05-12 16:27:43 +02:00
Werner Saar	e50a933037	added optimized dscal kernel for bulldozer	2015-05-12 12:28:44 +02:00
Werner Saar	133c11a156	updated dgemv_n kernel for nehalem	2015-04-30 14:38:06 +02:00
Werner Saar	30f52d53df	optimized dgemv_n kernel for haswell	2015-04-30 12:11:39 +02:00
Werner Saar	5e83d80725	optimized dger kernel for sandybridge	2015-04-28 16:58:11 +02:00
Werner Saar	b2e1797dc6	added optimized sger kernel for sandybridge	2015-04-28 15:33:38 +02:00
Werner Saar	e216f686cb	optimized saxpy and daxpy for sandybridge	2015-04-28 10:18:32 +02:00
Werner Saar	fc0e0391f3	bugfixes: replaced int with BLASLONG	2015-04-24 14:30:44 +02:00
Werner Saar	c22068c406	optimized sdot.c for increments != 1	2015-04-24 13:13:20 +02:00
Werner Saar	dee100d0e4	optimized saxpy.c for increments != 1	2015-04-24 11:52:59 +02:00
Werner Saar	0273966abb	optimized daxpy kernel for increments != 1	2015-04-24 11:39:17 +02:00
Werner Saar	3a67daa954	optimized ddot.c for increments != 1	2015-04-24 10:56:55 +02:00
Werner Saar	b4f2153dcd	added optimized ssymv kernels for sandybridge	2015-04-23 12:19:24 +02:00
Werner Saar	1c4b0eeae3	added optimized ssymv kernels for haswell	2015-04-23 10:23:13 +02:00
Werner Saar	1bec9abb9a	added optimized dsymv kernels for sandybridge	2015-04-22 12:09:43 +02:00
Werner Saar	3814bf60d3	added optimized dsymv kernels for haswell	2015-04-22 10:42:50 +02:00
Werner Saar	6d0db0151f	added optimized zaxpy-kernels	2015-04-16 11:19:37 +02:00
Zhang Xianyi	37b9033c90	Merge pull request #543 from jeromerobert/develop Fix a buffer overflow with MAX_STACK_ALLOC size in dgemv_t	2015-04-15 11:18:14 -05:00
Werner Saar	13889515b3	added optimized caxpy-kernel for sandybridge	2015-04-15 16:29:25 +02:00
Werner Saar	248c9340c3	added optimized caxpy-kernel for haswell	2015-04-15 15:16:31 +02:00
Werner Saar	e9f33b4ca7	added optimized caxpy-kernel for steamroller	2015-04-15 13:49:23 +02:00
Werner Saar	f5d847122a	updated caxpy_microk_bulldozer-2.c and caxpy.c	2015-04-15 11:59:38 +02:00
Jerome Robert	a4c96eca67	Fix a buffer overflow with MAX_STACK_ALLOC size in dgemv_t Refs #478, #482, `9798481`, `fd9fd42`	2015-04-15 11:46:48 +02:00
Werner Saar	baa0363ea2	add optimized ddot-kernel for piledriver	2015-04-14 15:09:13 +02:00
Werner Saar	34ba66606a	add optimized daxpy-kernel for piledriver	2015-04-14 14:23:29 +02:00
Werner Saar	f615dc7603	added optimized saxpy kernel for steamroller	2015-04-14 09:09:39 +02:00
Werner Saar	331c417637	optimized saxpy for piledriver	2015-04-14 08:34:11 +02:00
Werner Saar	d7a17ad85d	optimized sdot-kernel for pilediver	2015-04-13 13:19:21 +02:00
Werner Saar	d35f6c63c2	add optimized daxpy-kernel for steamroller	2015-04-13 12:22:43 +02:00
Werner Saar	166d76e864	added optimized sdot-kernel for steamroller	2015-04-11 08:48:18 +02:00
Werner Saar	f9f127d838	added optimized ddot kernel for steamroller	2015-04-10 16:18:03 +02:00
wernsaar	62231ab337	Merge pull request #538 from wernsaar/develop Added optimized cdot- and zdot-kernels	2015-04-10 16:03:37 +02:00
Werner Saar	3119def9a7	updated cdot and zdot	2015-04-10 11:10:31 +02:00
Werner Saar	33b332372a	add optimized cdot- and zdot-kernel for sandybridge	2015-04-10 09:37:26 +02:00
Werner Saar	fd838c75bc	add optimized cdot- and zdot-kernel for haswell	2015-04-09 15:13:52 +02:00
Werner Saar	b57a60dac8	updated cdot and zdot for piledriver	2015-04-09 10:33:46 +02:00
Werner Saar	5c51163972	added optimized cdot- and zdot-kernel for steamroller	2015-04-09 09:45:23 +02:00
Werner Saar	9299d8cfd6	added optimized cdot- and zdot-kernels for bulldozer	2015-04-08 16:29:55 +02:00
Zhang Xianyi	0a3d3b945d	Refs #535 . Fix the wrong vector instruction in sgemm sandy bridge kernel.	2015-04-08 03:55:49 +08:00
Werner Saar	60c6dec6e6	updated some lines for bulldozer	2015-04-06 18:47:16 +02:00
Werner Saar	47898cca35	added optimized saxpy- and daxpy-kernel for sandybridge	2015-04-06 16:05:16 +02:00
Werner Saar	53bb924287	added optimized saxpy- and daxpy-kernel for haswell	2015-04-06 12:33:16 +02:00
Werner Saar	a901b065d3	added optimized ddot-kernel for sandybridge	2015-04-05 20:19:38 +02:00
Werner Saar	3937e2a0a0	add optimized sdot-kernel for sandybridge	2015-04-05 19:47:05 +02:00
Werner Saar	9707d608d5	removed double definition line	2015-04-05 18:35:34 +02:00
Werner Saar	701b9d7556	added optimized sdot- and ddot-kernel for HASWELL	2015-04-05 17:57:53 +02:00
Zhang Xianyi	41aad0407f	Merge pull request #482 from jeromerobert/develop Allow to do gemv and ger buffer allocation on the stack	2015-01-02 02:26:17 +08:00
Werner Saar	ddf983d643	added optimizations for steamroller	2014-12-30 20:14:45 +08:00
Werner Saar	4319769b79	added target processor STEAMROLLER	2014-12-28 20:16:46 +08:00
Jerome Robert	e9d9a8eae3	Allow to do gemv and ger buffer allocation on the stack ger and gemv call blas_memory_alloc/free which in their turn call blas_lock. blas_lock create thread contention when matrices are small and the number of thread is high enough. We avoid call blas_memory_alloc by replacing it with stack allocation. This can be enabled with: make -DMAX_STACK_ALLOC=2048 The given size (in byte) must be high enough to avoid thread contention and small enough to avoid stack overflow. Fix #478	2014-12-27 14:33:12 +01:00
Werner Saar	587e16fba3	Ref #458 : Backport, sandybrigde uses nehalem zgemm kernel	2014-12-22 17:01:18 +01:00
Werner Saar	6261342de3	small optimization on dgemm_kernel for N=1	2014-12-18 20:35:51 +01:00
Werner Saar	bc5fff7085	changed inline assembler labels to short form	2014-12-07 12:38:54 +01:00
Zhang Xianyi	0cf29ba6d2	Fixed a bug of sgemm sandy bridge kernel. Reported by Julia project. JuliaLang/julia#9084	2014-12-03 17:38:41 +08:00
Zhang Xianyi	2fb02626da	Update organization info.	2014-11-25 15:28:58 +08:00
Zhang Xianyi	a85c2785ae	Refs #467 . Added generic kernel file for x86_64.	2014-11-24 15:34:48 +08:00
wernsaar	b7c9566eea	removed obsolete gemv kernel files	2014-09-14 11:00:53 +02:00
wernsaar	6df1b0be81	optimized zgemv_n_microk_sandy-4.c	2014-09-14 10:21:22 +02:00
wernsaar	2ac1e076c1	added optimized zgemv_n kernel for sandybridge	2014-09-14 09:02:05 +02:00
wernsaar	9908b6031c	bugfix in KERNEL.PILEDRIVER	2014-09-13 16:26:53 +02:00
wernsaar	8f100a14f2	optimized cgemv_t kernel for haswell	2014-09-13 16:13:27 +02:00
wernsaar	53b5726b04	added optimized cgemv_t kernel for haswell	2014-09-13 15:14:12 +02:00
wernsaar	1a352b24e6	updated KERNEL.HASWELL	2014-09-13 12:23:27 +02:00
wernsaar	5194818d4b	updated zgemv_t_4.c	2014-09-13 09:48:34 +02:00
wernsaar	8a39cdb1c1	added optimized zgemv_t kernel for haswell	2014-09-13 09:47:07 +02:00
wernsaar	0a1390f2d8	enabled optimized zgemv_t kernel for bulldozer	2014-09-12 17:43:47 +02:00
wernsaar	a8b0812feb	optimized zgemv_t for bulldozer	2014-09-12 17:42:25 +02:00
wernsaar	a0fb68ab42	added optimized zgemv_t kernel for bulldozer	2014-09-12 17:04:22 +02:00
wernsaar	44c11165d5	bugfix in cgemv_t_4.c	2014-09-12 14:12:24 +02:00
wernsaar	564be4eb72	added optimized cgemv_t kernel	2014-09-12 13:38:01 +02:00
wernsaar	107c3ea7d5	added optimized zgemv_t routine	2014-09-12 12:35:20 +02:00
wernsaar	bb8d698335	optimized zgemv_n_microk_haswell-4.c for small size	2014-09-11 13:44:55 +02:00
wernsaar	e0192a6914	bugfix in zgemv_n_4.c	2014-09-11 13:18:00 +02:00
wernsaar	bced4594bb	added optimized zgemv_n kernel	2014-09-11 12:34:57 +02:00
wernsaar	cafba99b6b	bufix in cgemv_n_microk_haswell-4.c	2014-09-11 11:12:44 +02:00
wernsaar	ac8f232b2a	more optimizations	2014-09-11 10:25:48 +02:00
wernsaar	f98e1244c4	optimized cgemv_n_4.c	2014-09-10 19:26:14 +02:00
wernsaar	be95700b30	added optimized cgemv_kernel for haswell	2014-09-10 14:11:24 +02:00
wernsaar	4aa534ae93	added cgemv_n kernel, optimized for small sizes	2014-09-10 13:45:13 +02:00
wernsaar	baa46e4fba	added and tested optimized dgemv_n kernel for haswell	2014-09-09 16:17:45 +02:00
wernsaar	faab7a181d	added optimized dgemv_n kernel for haswell	2014-09-09 15:32:32 +02:00
wernsaar	8109d8232c	optimized dgemv_t kernel for haswell	2014-09-09 14:38:08 +02:00
wernsaar	debc6d1a05	bugfix in KERNEL.HASWELL	2014-09-09 14:04:44 +02:00
wernsaar	e73a0113ec	added optimized gemv kernels	2014-09-09 13:54:55 +02:00
wernsaar	44f2bf9bae	added optimized dgemv_t kernel for haswell	2014-09-09 13:34:22 +02:00
wernsaar	cd34e9701b	removed obsolete files	2014-09-08 19:15:31 +02:00
wernsaar	658939faaa	optimized dgemv_n kernel for small sizes	2014-09-08 15:22:35 +02:00
wernsaar	c4d9d4e5f8	added haswell optimized kernel	2014-09-08 12:25:16 +02:00
wernsaar	7c0a94ff47	bugfix in sgemv_n_microk_haswell-4.c	2014-09-08 10:54:33 +02:00
wernsaar	cbbc80aad3	added optimized sgemv_t kernel for haswell	2014-09-08 10:13:39 +02:00
wernsaar	2be5c7a640	bugfix for windows	2014-09-07 21:48:42 +02:00
wernsaar	80f7786875	enabled optimized sgemv kernels for piledriver	2014-09-07 21:13:57 +02:00
wernsaar	553e275407	optimized sgemv_n kernel for sandybridge	2014-09-07 20:53:30 +02:00
wernsaar	7b3932b3f3	optimized sgemv_n kernel for nehalem	2014-09-07 19:20:08 +02:00
wernsaar	75207b1148	optimized sgemv_n for very small size of m	2014-09-07 18:23:48 +02:00
wernsaar	274828fa50	optimizations for very small sizes	2014-09-07 13:45:03 +02:00
wernsaar	5ae1731fe6	better optimzations for sgemv_t kernel	2014-09-06 21:28:57 +02:00
wernsaar	c8eaf3ae2d	optimized sgemv_t_4 kernel for very small sizes	2014-09-06 19:41:57 +02:00
wernsaar	3a7ab47ee9	optimized sgemv_t	2014-09-06 18:34:25 +02:00
wernsaar	cf5544b417	optimization for small size	2014-09-06 13:17:56 +02:00
wernsaar	d143f84dd2	added optimized sgemv_n kernel for haswell	2014-09-06 12:08:48 +02:00
wernsaar	a64fe9bcc9	added optimized sgemv_n kernel for sandybridge	2014-09-06 08:41:53 +02:00
wernsaar	6df7a88930	optimized sgemv_t for sandybridge	2014-09-05 10:22:50 +02:00
wernsaar	53de943690	bugfix for sgemv_n_4.c	2014-09-04 18:55:52 +02:00
wernsaar	7f910010a0	optimized sgemv_n kernel for small sizes	2014-09-04 13:09:27 +02:00
wernsaar	3a5d8dbff9	optimized sgemv_n_4.c	2014-09-03 15:34:30 +02:00
wernsaar	2a60c6d4b0	optimized sgemv_n for small sizes	2014-09-03 14:48:45 +02:00
wernsaar	0fc560ba23	bugfix for buffer overflow	2014-09-03 10:13:47 +02:00
wernsaar	f3b50dcf5b	removed obsolete instructions from sgemv_t_4.c	2014-09-02 13:35:41 +02:00
wernsaar	93eaba959d	optimized sgemv_t for bulldozer	2014-09-02 12:42:36 +02:00
wernsaar	9570e56965	optimized sgemv_t_4.c for small sizes	2014-09-01 15:11:37 +02:00
wernsaar	bc99faef1b	optimized sgemv_t_4.c for uneven sizes	2014-08-31 14:33:15 +02:00
wernsaar	848c0f16f7	optimized sgemv_t_4.c for small size	2014-08-31 13:23:44 +02:00
wernsaar	53e6dbf6ca	optimized sgemv_t kernel for small sizes	2014-08-30 13:36:27 +02:00
wernsaar	20cd850125	modification for clang compiler	2014-08-27 09:00:20 +02:00
wernsaar	3885eebdb8	added optimized zaxpy bulldozer kernel	2014-08-25 15:52:35 +02:00
wernsaar	ee74445155	added optimized caxpy kernel for bulldozer	2014-08-25 14:53:28 +02:00
wernsaar	9d2ace8bac	added optimized daxpy kernel for bulldozer	2014-08-24 10:57:12 +02:00
wernsaar	b55f997302	added optimized daxpy kernel for nehalem	2014-08-23 17:53:07 +02:00

... 4 5 6 7 8 ...

766 Commits