OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Arjan van de Ven	5b708e5eb1	sgemm/dgemm: add a way for an arch kernel to specify prefered sizes The current gemm threading code can make very unfortunate choices, for example on my 10 core system a 1024x1024x1024 matrix multiply ends up chunking into blocks of 102... which is not a vector friendly size and performance ends up horrible. this patch adds a helper define where an architecture can specify a preference for size multiples. This is different from existing defines that are minimum sizes and such. The performance increase with this patch for the 1024x1024x1024 sgemm is 2.3x (!!)	2018-11-01 01:43:20 +00:00
Ashwin Sekhar T K	d50abc8903	ARM64: Move parameters from parameter.c to param.h Remove the runtime setting of P, Q, R parameters for targets ARMV8, THUNDERX2T99. Instead set them as constants in param.h at compile time.	2018-10-22 01:45:51 -07:00
Ashwin Sekhar T K	21f46a1cf2	ARM64: Use THUNDERX2T99 Neon Kernels for ARMV8 Currently the generic ARMV8 target uses C implementations for many routines. Replace these with the neon implementations written for THUNDERX2T99 target which are upto 6x faster for certain routines.	2018-10-17 10:44:37 -07:00
Martin Kroeker	4cf7315a5d	Adjust ARMV8 SGEMM unrolling when using the C fallback kernel_2x2 for IOS	2018-09-06 21:41:54 +02:00
Arjan van de Ven	6eb4b9ae7c	Tune HASWELL SWITCH_RATIO as well Similar to the SKYLAKEX patch, 32 seems to work best (much better than 4 or 16) Before (4) Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 15554.3 7.2 0.2% 30353.8 3.7 0.3% 64 x 64 30346.8 8.7 1.6% 63495.0 4.1 -0.1% 65 x 65 81668.1 3.4 -123.3% 82705.2 3.3 -21.2% 80 x 80 105045.9 4.9 -95.5% 115226.0 4.5 -2.2% 96 x 96 152461.2 5.8 -74.3% 148156.3 6.0 16.4% 112 x 112 188505.2 7.5 -42.2% 171187.3 8.2 36.4% 128 x 128 257884.0 8.1 -39.5% 224764.8 9.3 46.0% Intermediate (16) Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 15565.7 7.2 0.2% 30378.9 3.7 0.2% 64 x 64 30430.2 8.7 1.3% 63046.4 4.2 0.6% 65 x 65 27306.0 10.1 25.3% 38879.2 7.1 43.0% 80 x 80 51008.7 10.1 5.1% 61007.6 8.4 45.9% 96 x 96 70856.7 12.5 19.0% 83403.1 10.6 53.0% 112 x 112 84769.9 16.6 36.0% 99920.1 14.1 62.9% 128 x 128 84213.2 25.0 54.5% 113024.2 18.6 72.8% After (32) Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 15537.3 7.2 0.3% 30537.0 3.6 -0.3% 64 x 64 30352.7 8.7 1.6% 62597.8 4.2 1.3% 65 x 65 36857.0 7.5 -0.8% 56167.6 4.9 17.7% 80 x 80 42552.6 12.1 20.8% 69536.7 7.4 38.3% 96 x 96 52101.5 17.1 40.5% 91016.1 9.7 48.7% 112 x 112 63853.7 22.1 51.8% 110507.4 12.7 58.9% 128 x 128 73966.1 28.4 60.0% 163146.4 12.9 60.8%	2018-06-17 17:08:36 +00:00
Arjan van de Ven	5c6f008365	Tune param.h for SkylakeX param.h defines a per-platform SWITCH_RATIO, which is used as a measure for how fine grained the blocks for gemm need to be split up. Many platforms define this to 4. The reality is that the gemm low level implementation for SkylakeX likes bigger blocks due to the nature of SIMD... by tuning the SWITCH_RATIO to 32 the threading performance improves significantly: Before Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7% 64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0% 65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3% 80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6% 96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4% 112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6% 128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3% After Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10666.3 10.6 0.4% 18236.9 6.2 -1.4% 64 x 64 20410.1 13.0 1.8% 39925.8 6.6 1.7% 65 x 65 34983.0 7.9 -30.2% 51494.6 5.4 2.0% 80 x 80 39769.1 13.0 -4.4% 63805.2 8.1 12.0% 96 x 96 45169.6 19.7 26.7% 80065.8 11.1 29.8% 112 x 112 57026.1 24.7 38.7% 99535.5 14.2 44.1% 128 x 128 64789.8 32.5 51.3% 117407.2 17.9 54.6% With this change, threading starts to be a win already at 96x96	2018-06-17 15:47:50 +00:00
Arjan van de Ven	99c7bba8e4	Initial support for SkylakeX / AVX512 This patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server) target. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set, which brings 2 basic things: 1) 512 bit wide SIMD (2x width of AVX2) 2) 32 SIMD registers (2x the number on AVX2) This initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel to AVX512VL; more will follow later but this patch aims to get the infrastructure in place for this "later". Full performance tuning has not been done yet; with more registers and wider SIMD it's in theory possible to retune the kernels but even without that there's an interesting enough performance increase (30-40% range) with just this change.	2018-06-03 07:58:52 +00:00
Martin Kroeker	d94d7baf7e	Add mips32r2 api target	2018-05-02 20:17:26 +02:00
Shivraj Patil	e3d844b062	Added mips I6500 core Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>	2017-09-22 11:57:43 +05:30
Gian-Carlo Pascutto	832a272784	Revert Zen param.h to Haswell values (instead of Excavator).	2017-04-18 12:40:25 +02:00
Denis Steckelmacher	c9ff735da6	Add ZEN support (tested for auto-detected static backend)	2017-03-19 15:32:50 +01:00
Martin Kroeker	cd135e2b59	Merge pull request #1130 from quickwritereader/develop Blas 3 for single precision	2017-03-15 10:00:52 +01:00
Abdurrauf	08786c4b95	strmm and ctrmm	2017-03-13 01:23:16 +04:00
Abdurrauf	82e80fa82b	initial strmm(sgemm). not tuned yet	2017-03-06 04:27:40 +04:00
Martin Kroeker	ffc1d6c468	Merge pull request #1108 from ashwinyes/develop_20170203_thunderx2t99 Optimized Implementations for ThunderX2T99	2017-02-28 16:02:19 +01:00
Ashwin Sekhar T K	19ba133383	THUNDERX2T99: Add Optimized ZGEMM Implementation	2017-02-28 05:31:41 +00:00
Abdurrauf	0d96b0e2a7	Merge branch 'z13' into develop	2017-02-26 06:17:33 +04:00
Abdurrauf	848cb27b1e	ztrmm kernel.	2017-02-26 06:14:12 +04:00
Ashwin Sekhar T K	2757b49767	THUNDERX2T99: Add Optimized CGEMM Implementation	2017-01-30 17:44:26 +05:30
Ashwin Sekhar T K	f279ff4789	THUNDERX2T99: Add Optimized SGEMM Implementation	2017-01-16 21:44:33 +05:30
Ashwin Sekhar T K	4b55fae337	ARM64: Add Cavium THUNDERX2T99 Target	2017-01-11 11:18:40 +05:30
Andrew Pinski	fb200c7245	ARM64: Add Cavium THUNDERX Target	2017-01-10 15:01:37 +05:30
Ashwin Sekhar T K	4713e7c47f	ARM64: Add the VULCAN Target	2017-01-10 15:01:17 +05:30
Zhang Xianyi	b678471d65	Merge branch 'z13' into develop Conflicts: CONTRIBUTORS.md	2017-01-09 05:52:42 -05:00
Abdurrauf	6418667818	dtrmm and dgemm for z13	2017-01-04 19:32:33 +04:00
Shivraj Patil	9687437928	MIPS n32 ABI and build time mips simd support check Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>	2016-08-10 17:44:22 +05:30
Shivraj Patil	d1c6469283	MIPS n32 ABI support, MSA support detection and rename ARCH, ARCHFLAGS Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>	2016-08-08 11:58:01 +05:30
Shivraj Patil	beb1d076a4	Added MSA optimization for GEMV_N, GEMV_T, ASUM, DOT functions Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>	2016-07-15 18:38:25 +05:30
Zhang Xianyi	8a592ee386	Merge pull request #924 from ashwinyes/develop_aarch64_improvements_20160714 Improvements to Aarch64 kernels	2016-07-14 15:47:55 -04:00
Ashwin Sekhar T K	0a5ff9f9f9	Improvements to TRMM and GEMM kernels	2016-07-14 13:56:04 +05:30
Shivraj Patil	57df7956ee	Added CGEMM, ZGEMM, STRMM, DTRMM, CTRMM, ZTRMM. Updated macros in SGEMM, DGEMM, STRMM. Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>	2016-06-28 17:51:10 +05:30
Shivraj Patil	c4ba40e308	SGEMM optimization for MIPS P5600 and I6400 using MSA. Unrolled k loop in DGEMM kernel function Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>	2016-05-19 11:04:42 +05:30
Werner Saar	88011f625d	Merge pull request #876 from wernsaar/develop optimized dgemm on power8 for 20 threads	2016-05-16 14:52:40 +02:00
Werner Saar	8310d4d3f7	optimized dgemm for 20 threads	2016-05-16 14:14:25 +02:00
Shivraj Patil	085cf236c2	conflict resolved by syncing with 'xianyi:develop' Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>	2016-05-04 11:07:14 +05:30
Shivraj Patil	b7b3d8ec8e	DGEMM optimization for MIPS P5600 and I6400 using MSA Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>	2016-05-03 14:42:26 +05:30
Zhang Xianyi	cd7af5260a	Merge pull request #847 from sva-img/develop MIPS P5600(32 bit) and I6400(64 bit) cores support added.	2016-04-29 11:44:36 -04:00
Werner Saar	782f75ba94	optimized param.h for POWER8	2016-04-27 15:48:09 +02:00
Werner Saar	0d0c6f7d7d	optimized dgemm for POWER8	2016-04-27 14:01:08 +02:00
Werner Saar	40ac64ae4f	updated param.h for EXCAVATOR	2016-04-25 10:40:04 +02:00
Werner Saar	089aad57f7	updated param.h for POWER8	2016-04-23 14:26:24 +02:00
Werner Saar	879a51165f	Optimized zgemm and tested zgemm again	2016-04-22 13:07:12 +02:00
Shivraj Patil	2c3dfe2bf3	MIPS P5600(32 bit) and I6400(64 bit) cores support added. Seperated mips and mips64 files. Configurations support for mips 32 bit. Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>	2016-04-22 14:03:18 +05:30
Werner Saar	3c6294ca3d	added optimized sgemm_tcopy for power8	2016-04-19 16:08:54 +02:00
Zhang Xianyi	dd43661cfd	Init IBM z system (s390x) porting.	2016-04-15 18:02:24 -04:00
Werner Saar	e173c51c04	updated zgemm- and ztrmm-kernel for POWER8	2016-04-08 09:05:37 +02:00
Werner Saar	9c42f0374a	Updated cgemm- and sgemm-kernel for POWER8 SMP	2016-04-07 15:08:15 +02:00
Werner Saar	a51102e9b7	bugfixes for sgemm- and cgemm-kernel	2016-04-06 11:15:21 +02:00
Werner Saar	c5b1fbcb2e	updated optimized cgemm- and ctrmm-kernel for POWER8	2016-04-04 09:12:08 +02:00
Werner Saar	6a9bbfc227	updated sgemm- and strmm-kernel for POWER8	2016-04-02 17:16:36 +02:00

1 2 3

126 Commits