OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Chris Sidebottom	32f2fafde7	Propagate SWITCH_RATIO to DYNAMIC_ARCH builds Previously dynamic builds were either using the default SWITCH_RATIO or one from the higher level architecture; this patch ensures the dynamic builds can use this parameter as well.	2023-04-17 15:34:12 +01:00
Honglin Zhu	4989e039a5	Define SBGEMM_ALIGN_K for DYNAMIC_ARCH build	2022-10-27 14:10:26 +08:00
Honglin Zhu	b00d5b9746	New sbgemm implementation for Neoverse N2 1. Use UZP instructions but not gather load and scatter store instructions to get lower latency. 2. Padding k to a power of 4.	2022-10-26 15:09:41 +08:00
Wangyang Guo	3dc6052c7e	initial support for Sapphire Rapids platform	2021-10-12 01:30:40 -07:00
Martin Kroeker	2f8220d757	Add sbgemm	2021-09-14 16:14:43 +02:00
Martin Kroeker	307c4c0786	Fix typo	2021-06-16 13:41:16 +02:00
Martin Kroeker	e83df93975	Work around another recent macro name collision with winnt.h	2021-06-16 12:32:34 +02:00
Martin Kroeker	a554712439	remove extra/intermediate size step for min_jj introduced in PR747	2020-12-08 21:01:36 +01:00
Martin Kroeker	5d26223f4a	remove extra/intermediate size step of min_jj from PR747	2020-12-08 20:59:56 +01:00
Martin Kroeker	d3ff1f889f	Convert ifndefs to ifneq	2020-11-22 16:27:17 +01:00
Rajalakshmi Srinivasaraghavan	b5d30b390d	Fix build issues with bfloat16 This patch fixes compilation errors due to recent renaming from SH to SB with BUILD_BFLOAT16.	2020-10-13 11:00:22 -05:00
Martin Kroeker	006c7f6671	Change "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-12 00:06:06 +02:00
Martin Kroeker	886a8e3190	Adapt for supporting only a subset of variable types	2020-10-11 14:57:32 +02:00
Martin Kroeker	ac653c94f3	Merge branch 'develop' into issue2588-cmake	2020-10-11 13:57:07 +02:00
Martin Kroeker	988a6f429e	Add BUILD_vartype defines	2020-09-22 23:23:33 +02:00
Martin Kroeker	e5e2fbd593	Support building only selected types	2020-09-22 23:21:30 +02:00
y00512012	06cf73a239	fix a bug of trmm	2020-09-22 16:47:10 +08:00
Martin Kroeker	ddec244a5a	Merge pull request #2838 from austinpagan/gordon_trmm Adding performance patch for trmm, just like trsm (#2836)	2020-09-15 21:17:48 +02:00
fossum	dfeca46098	Adding performance patch for trmm, just like #2836	2020-09-15 08:59:50 -05:00
fossum	274d6e015b	Fixing a performance bug in trsm_[LR].c.	2020-09-14 13:10:48 -05:00
Martin Kroeker	330044d821	Fix potentiol domain error in sqrt	2020-09-05 09:44:33 +02:00
Chen, Guobing	e740c4873d	Enable COOPERLAKE build target Enable new build target platform -- COOPERLAKE. This target platform supports all the SKYLAKEX supported ISAs + avx512bf16. So all the SKYLAKEX specific kernels/drivers and related code are now extended to be also active on COOPERLAKE. Besides, new BF16 related kernels are active under this target.	2020-08-13 06:18:00 +08:00
Martin Kroeker	ce45af8151	Update conditional for atomics to use HAVE_C11	2020-07-18 17:09:56 +00:00
Martin Kroeker	6f38de06d2	Update conditional for atomics to use HAVE_C11	2020-07-18 17:09:01 +00:00
Martin Kroeker	5dd14e3d48	Make building the bfloat16 functions conditional on option BUILD_HALF (#2590 ) * make building the bfloat16 BLAS functions conditional on BUILD_HALF * pass the BUILD_HALF option to gensymbol * Pass BUILD_HALF as a compiler define for dynamic_arch builds	2020-05-01 09:58:30 +02:00
Rajalakshmi Srinivasaraghavan	7eb55504b1	RFC : Add half precision gemm for bfloat16 in OpenBLAS This patch adds support for bfloat16 data type matrix multiplication kernel. For architectures that don't support bfloat16, it is defined as unsigned short (2 bytes). Default unroll sizes can be changed as per architecture as done for SGEMM and for now 8 and 4 are used for M and N. Size of ncopy/tcopy can be changed as per architecture requirement and for now, size 2 is used. Added shgemm in kernel/power/KERNEL.POWER9 and tested in powerpc64le and powerpc64. For reference, added a small test compare_sgemm_shgemm.c to compare sgemm and shgemm output. This patch does not cover OpenBLAS test, benchmark and lapack tests for shgemm. Complex type implementation can be discussed and added once this is approved.	2020-04-14 14:55:08 -05:00
Ali Saidi	97ce6bbce2	Fix barriers in level3_thread	2020-02-29 17:45:17 +00:00
wjc404	2f96a2c55b	Update trmm_R.c	2020-02-05 10:15:02 +08:00
wjc404	833bd0f8ff	Update trmm_L.c	2020-02-05 10:09:41 +08:00
wjc404	77b8f49556	Update level3_thread.c	2020-02-04 20:33:08 +08:00
wjc404	1c3e20ce48	Update level3.c	2020-02-04 20:30:23 +08:00
wjc404	e9fb8f62b1	Update level3_gemm3m_thread.c	2020-01-22 17:40:03 +00:00
wjc404	4c35b8dbaa	Update gemm3m_level3.c	2019-12-27 18:03:01 +08:00
Martin Kroeker	f3065a0eed	Fix race conditions in multithreaded GEMM3M by adding barriers (and a mutex lock for the non-OpenMP case) like it was already done for GEMM in level3_thread.c some time ago	2019-11-23 19:54:56 +01:00
Martin Kroeker	f343ed65b5	Avoid taking the root of a negative number Fixes #1924 where numpy 1.17+ would report the (transient) FE_INVALID exception raised for the domain error.	2018-12-22 22:30:29 +01:00
Martin Kroeker	f72fdf525c	Merge pull request #1875 from martin-frbg/issue1851 Serialize accesses to parallelized level3 functions from multiple cal…	2018-11-25 20:53:46 +01:00
Martin Kroeker	113cb00b95	fix missing parenthesis	2018-11-19 21:01:36 +01:00
Martin Kroeker	5192651706	Add CriticalSection handling instead of mutexes for Windows	2018-11-19 17:58:22 +01:00
Martin Kroeker	2e6fae2aad	Serialize accesses to parallelized level3 functions from multiple callers for #1851	2018-11-19 14:02:50 +01:00
Arjan van de Ven	5b708e5eb1	sgemm/dgemm: add a way for an arch kernel to specify prefered sizes The current gemm threading code can make very unfortunate choices, for example on my 10 core system a 1024x1024x1024 matrix multiply ends up chunking into blocks of 102... which is not a vector friendly size and performance ends up horrible. this patch adds a helper define where an architecture can specify a preference for size multiples. This is different from existing defines that are minimum sizes and such. The performance increase with this patch for the 1024x1024x1024 sgemm is 2.3x (!!)	2018-11-01 01:43:20 +00:00
Martin Kroeker	5f2a3c05cd	Revert "Rewrite &= -> = and simplify the initial blocking phase."	2018-07-03 21:42:28 +02:00
Craig Donner	0144068537	Rewrite &= -> = and simplify the initial blocking phase.	2018-06-25 15:08:55 +01:00
Arjan van de Ven	73de17664d	Add missing barriers in gemm scheduler a few places in the gemm scheduler code were missing barriers; the code likely worked OK due to heavy use of volatile / _Atomic but there's no reason to get this incorrect	2018-06-17 17:50:43 +00:00
Arjan van de Ven	d148ec4ea1	Don't use _Atomic for jobs sometimes... The use of _Atomic leads to really bad code generation in the compiler (on x86, you get 2 "mfence" memory barriers around each access with gcc8, despite x86 being ordered and cache coherent). But there's a fallback in the code that just uses volatile which is more than plenty in practice. If we're nervous about cross thread synchronization for these variables, we should make the YIELD function be a compiler/memory barrier instead. performance before (after last commit) Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10630.0 10.6 0.7% 18112.8 6.2 -0.7% 64 x 64 20374.8 13.0 1.9% 40487.0 6.5 0.4% 65 x 65 141955.2 1.9 -428.3% 146708.8 1.9 -179.2% 80 x 80 178921.1 2.9 -369.6% 186032.7 2.8 -156.6% 96 x 96 205436.2 4.3 -233.4% 224513.1 3.9 -97.0% 112 x 112 244408.2 5.8 -162.7% 262158.7 5.4 -47.1% 128 x 128 321334.5 6.5 -141.3% 333829.0 6.3 -29.2% Performance with this patch (roughly a 2x improvement): Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7% 64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0% 65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3% 80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6% 96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4% 112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6% 128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3%	2018-06-17 15:39:15 +00:00
Arjan van de Ven	9e162146a9	Only initialize the part of the jobs array that will get used The jobs array is getting initialized in O(compiled cpus^2) complexity. Distros and people with bigger systems will use pretty high values (128 or 256 or more) for this value, leading to interesting bubbles in performance. Baseline (single threaded performance) gets roughly 13 - 15 multiplications per cycle in the interesting range (threading kicks in at 65x65 mult by 65x65). The hardware is capable of 32 multiplications per cycle theoretically. Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10703.9 10.6 0.0% 17990.6 6.3 0.0% 64 x 64 20778.4 12.8 0.0% 40629.2 6.5 0.0% 65 x 65 26869.9 10.3 0.0% 52545.7 5.3 0.0% 80 x 80 38104.5 13.5 0.0% 72492.7 7.1 0.0% 96 x 96 61626.4 14.4 0.0% 113983.8 7.8 0.0% 112 x 112 91803.8 15.3 0.0% 180987.3 7.8 0.0% 128 x 128 133161.4 15.8 0.0% 258374.3 8.1 0.0% When threading is turned on TARGET=SKYLAKEX F_COMPILER=GFORTRAN SHARED=1 DYNAMIC_THREADS=1 USE_OPENMP=0 NUM_THREADS=128 Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10725.9 10.5 -0.2% 18134.9 6.2 -0.8% 64 x 64 20500.6 12.9 1.3% 40929.1 6.5 -0.7% 65 x 65 2040832.1 0.1 -7495.2% 2097633.6 0.1 -3892.0% 80 x 80 2063129.1 0.2 -5314.4% 2119925.2 0.2 -2824.3% 96 x 96 2070374.5 0.4 -3259.6% 2173604.4 0.4 -1806.9% 112 x 112 2111721.5 0.7 -2169.6% 2263330.8 0.6 -1170.0% 128 x 128 2276181.5 0.9 -1609.3% 2377228.9 0.9 -820.1% There is a deep deep cliff once you hit 65x65 With this patch Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10630.0 10.6 0.7% 18112.8 6.2 -0.7% 64 x 64 20374.8 13.0 1.9% 40487.0 6.5 0.4% 65 x 65 141955.2 1.9 -428.3% 146708.8 1.9 -179.2% 80 x 80 178921.1 2.9 -369.6% 186032.7 2.8 -156.6% 96 x 96 205436.2 4.3 -233.4% 224513.1 3.9 -97.0% 112 x 112 244408.2 5.8 -162.7% 262158.7 5.4 -47.1% 128 x 128 321334.5 6.5 -141.3% 333829.0 6.3 -29.2% The cliff is very significantly reduced. (more to follow)	2018-06-17 15:32:03 +00:00
Martin Kroeker	a91f1587b9	Work around name clash with Windows10's winnt.h fixes #1503	2018-05-31 13:26:00 +02:00
Zhiyong Dang	3716267124	Change _STDC_VERSION__ to __STDC_VERSION__ Change-Id: Id3fa4e8d9eedd4ef7230df69b611e7f397301a42	2018-05-11 12:15:08 +08:00
Martin Kroeker	6a99fcce94	Use _Atomic instead of volatile for thread safety where C11 is supported Suggested by dodomorandi in #660	2018-03-10 00:03:49 +01:00
Andrew	11a627c54e	remove surplus parentheses to silence clang5	2018-01-01 20:56:26 +01:00
Andrew	bfc2a88594	remove unused buffer	2017-12-22 00:55:40 +01:00

1 2 3

110 Commits