OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Martin Kroeker	f72fdf525c	Merge pull request #1875 from martin-frbg/issue1851 Serialize accesses to parallelized level3 functions from multiple cal…	2018-11-25 20:53:46 +01:00
Martin Kroeker	113cb00b95	fix missing parenthesis	2018-11-19 21:01:36 +01:00
Martin Kroeker	5192651706	Add CriticalSection handling instead of mutexes for Windows	2018-11-19 17:58:22 +01:00
Martin Kroeker	2e6fae2aad	Serialize accesses to parallelized level3 functions from multiple callers for #1851	2018-11-19 14:02:50 +01:00
Arjan van de Ven	5b708e5eb1	sgemm/dgemm: add a way for an arch kernel to specify prefered sizes The current gemm threading code can make very unfortunate choices, for example on my 10 core system a 1024x1024x1024 matrix multiply ends up chunking into blocks of 102... which is not a vector friendly size and performance ends up horrible. this patch adds a helper define where an architecture can specify a preference for size multiples. This is different from existing defines that are minimum sizes and such. The performance increase with this patch for the 1024x1024x1024 sgemm is 2.3x (!!)	2018-11-01 01:43:20 +00:00
Martin Kroeker	5f2a3c05cd	Revert "Rewrite &= -> = and simplify the initial blocking phase."	2018-07-03 21:42:28 +02:00
Craig Donner	0144068537	Rewrite &= -> = and simplify the initial blocking phase.	2018-06-25 15:08:55 +01:00
Arjan van de Ven	73de17664d	Add missing barriers in gemm scheduler a few places in the gemm scheduler code were missing barriers; the code likely worked OK due to heavy use of volatile / _Atomic but there's no reason to get this incorrect	2018-06-17 17:50:43 +00:00
Arjan van de Ven	d148ec4ea1	Don't use _Atomic for jobs sometimes... The use of _Atomic leads to really bad code generation in the compiler (on x86, you get 2 "mfence" memory barriers around each access with gcc8, despite x86 being ordered and cache coherent). But there's a fallback in the code that just uses volatile which is more than plenty in practice. If we're nervous about cross thread synchronization for these variables, we should make the YIELD function be a compiler/memory barrier instead. performance before (after last commit) Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10630.0 10.6 0.7% 18112.8 6.2 -0.7% 64 x 64 20374.8 13.0 1.9% 40487.0 6.5 0.4% 65 x 65 141955.2 1.9 -428.3% 146708.8 1.9 -179.2% 80 x 80 178921.1 2.9 -369.6% 186032.7 2.8 -156.6% 96 x 96 205436.2 4.3 -233.4% 224513.1 3.9 -97.0% 112 x 112 244408.2 5.8 -162.7% 262158.7 5.4 -47.1% 128 x 128 321334.5 6.5 -141.3% 333829.0 6.3 -29.2% Performance with this patch (roughly a 2x improvement): Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7% 64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0% 65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3% 80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6% 96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4% 112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6% 128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3%	2018-06-17 15:39:15 +00:00
Arjan van de Ven	9e162146a9	Only initialize the part of the jobs array that will get used The jobs array is getting initialized in O(compiled cpus^2) complexity. Distros and people with bigger systems will use pretty high values (128 or 256 or more) for this value, leading to interesting bubbles in performance. Baseline (single threaded performance) gets roughly 13 - 15 multiplications per cycle in the interesting range (threading kicks in at 65x65 mult by 65x65). The hardware is capable of 32 multiplications per cycle theoretically. Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10703.9 10.6 0.0% 17990.6 6.3 0.0% 64 x 64 20778.4 12.8 0.0% 40629.2 6.5 0.0% 65 x 65 26869.9 10.3 0.0% 52545.7 5.3 0.0% 80 x 80 38104.5 13.5 0.0% 72492.7 7.1 0.0% 96 x 96 61626.4 14.4 0.0% 113983.8 7.8 0.0% 112 x 112 91803.8 15.3 0.0% 180987.3 7.8 0.0% 128 x 128 133161.4 15.8 0.0% 258374.3 8.1 0.0% When threading is turned on TARGET=SKYLAKEX F_COMPILER=GFORTRAN SHARED=1 DYNAMIC_THREADS=1 USE_OPENMP=0 NUM_THREADS=128 Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10725.9 10.5 -0.2% 18134.9 6.2 -0.8% 64 x 64 20500.6 12.9 1.3% 40929.1 6.5 -0.7% 65 x 65 2040832.1 0.1 -7495.2% 2097633.6 0.1 -3892.0% 80 x 80 2063129.1 0.2 -5314.4% 2119925.2 0.2 -2824.3% 96 x 96 2070374.5 0.4 -3259.6% 2173604.4 0.4 -1806.9% 112 x 112 2111721.5 0.7 -2169.6% 2263330.8 0.6 -1170.0% 128 x 128 2276181.5 0.9 -1609.3% 2377228.9 0.9 -820.1% There is a deep deep cliff once you hit 65x65 With this patch Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10630.0 10.6 0.7% 18112.8 6.2 -0.7% 64 x 64 20374.8 13.0 1.9% 40487.0 6.5 0.4% 65 x 65 141955.2 1.9 -428.3% 146708.8 1.9 -179.2% 80 x 80 178921.1 2.9 -369.6% 186032.7 2.8 -156.6% 96 x 96 205436.2 4.3 -233.4% 224513.1 3.9 -97.0% 112 x 112 244408.2 5.8 -162.7% 262158.7 5.4 -47.1% 128 x 128 321334.5 6.5 -141.3% 333829.0 6.3 -29.2% The cliff is very significantly reduced. (more to follow)	2018-06-17 15:32:03 +00:00
Martin Kroeker	a91f1587b9	Work around name clash with Windows10's winnt.h fixes #1503	2018-05-31 13:26:00 +02:00
Zhiyong Dang	3716267124	Change _STDC_VERSION__ to __STDC_VERSION__ Change-Id: Id3fa4e8d9eedd4ef7230df69b611e7f397301a42	2018-05-11 12:15:08 +08:00
Martin Kroeker	6a99fcce94	Use _Atomic instead of volatile for thread safety where C11 is supported Suggested by dodomorandi in #660	2018-03-10 00:03:49 +01:00
Andrew	11a627c54e	remove surplus parentheses to silence clang5	2018-01-01 20:56:26 +01:00
Andrew	bfc2a88594	remove unused buffer	2017-12-22 00:55:40 +01:00
Andrew	ef95cd471f	elminate unread variable, after reiteration 3 of them (clang4)	2017-11-25 02:54:37 +01:00
Martin Kroeker	db72ad8f6a	Merge pull request #1320 from timmoon10/develop 2D thread distribution for multi-threaded GEMMs	2017-10-08 23:31:33 +02:00
Martin Kroeker	514d237257	Merge pull request #1279 from xsacha/develop CMake improvements	2017-10-06 21:13:45 +02:00
Tim Moon	30486a356c	Reduce number of data partitions in n.	2017-10-04 12:37:49 -07:00
Tim Moon	9de52b489a	Cleaning up and documenting multi-threaded GEMM code.	2017-10-03 16:32:08 -07:00
Tim Moon	860dcfc703	Use 2D thread distribution for small GEMMs. Allows maximum use of available cores if one of M and N is small and the other is large.	2017-10-03 13:43:39 -07:00
Tim Moon	6aaa107865	Reducing threads for multi-threaded GEMMs on small matrices.	2017-09-27 19:25:33 -07:00
Sacha Refshauge	37858d1146	Fix threading usage in CMake: s/SMP/USE_THREAD/	2017-08-19 15:07:42 +10:00
Isuru Fernando	d245caa49a	Support out-of-source build	2017-08-01 15:16:14 +05:30
Martin Kroeker	49e62c0e77	fixed syrk_thread.c taken from wernsaar Stride calculation fix copied from https://github.com/wernsaar/OpenBLAS/commit/88900e1	2017-07-06 17:30:12 +02:00
Werner Saar	a2672d5589	prepared driver/level3 functions for UNROLL values, that are not a power of two	2017-01-09 10:38:15 +01:00
John Biddiscombe	053044ae4d	Replace CMAKE_SOURCE_DIR/CMAKE_BINARY_DIR with PROJECT_SOURCE_DIR/PROJECT_BINARY_DIR If OpenBLAS is built using add_subdirectory(OpenBlas) as part of another project then the paths set by CMAKE_XXX_DIR are relative to the parent project and not the OpenBLAS project.	2016-05-25 09:13:28 +02:00
Zhang Xianyi	d06b92906a	Add gemm3m building for CMake.	2016-02-12 05:02:51 +08:00
Werner Saar	b07d733a71	added updates for syrk and syr2k	2016-01-21 13:16:44 +01:00
Zhang Xianyi	055b481386	Fixed CMake bug for single core.	2016-01-15 06:42:54 +08:00
Ralph Campbell	fbc21266e6	Minor C code fixes in driver/	2015-11-09 14:15:49 +05:30
Zhang Xianyi	d8392c1245	Fixe cmake config bugs.	2015-10-20 04:30:55 +08:00
Zhang Xianyi	f874465bb8	Use cmake to build OpenBLAS GENERIC Target on MSVC x86 64-bit. Disable CBLAS and LAPACK.	2015-08-10 14:10:44 -05:00
Hank Anderson	9eaea02f33	Added additional gemm defines for complex types.	2015-02-25 09:39:11 -06:00
Hank Anderson	0d8e227ea7	Changed strategy for setting preprocessor definitions. Instead of generating separate object files for each permutation of defines for a source file, GenerateNamedObjects now writes an entirely new source file and inserts the defines as #define c statements. This solves a problem I ran into with ar.exe where it was refusing to link objects that had the same filename despite having different paths.	2015-02-24 12:26:33 -06:00
Hank Anderson	371071d461	Added CONJ defines for trmm/trsm.	2015-02-21 10:59:02 -06:00
Hank Anderson	8a143516e3	Added alternate_name to a couple of the name mangling schemes. Added zherk_k sources to driver/level3.	2015-02-20 17:03:33 -06:00
Hank Anderson	e5897ecb9b	Added zherk_kernel.c objects to driver/level3.	2015-02-19 16:19:56 -06:00
Hank Anderson	4662a0b13a	Changed generate functions to iterate through a list of float types. This will generate obj files for SINGLE/DOUBLE/COMPLEX/DOUBLE COMPLEX.	2015-02-15 17:44:37 -06:00
Hank Anderson	e74462a3f5	Moved declarations to start of functions to satisfy MSVC C89 implementation.	2015-02-11 11:16:57 -06:00
Hank Anderson	056ba26755	Changed a number of inline calls to use __inline. MSVC doesn't inmplement C99, so can't use the inline keyword. __inline appears to work in MSVC and GCC.	2015-02-11 11:13:17 -06:00
Hank Anderson	e8c39138c6	Removed return value from GenerateNamedObjects. It sets DBLAS_OBJS directly to save a bunch of list appending in the CMakeLists.txt files.	2015-02-09 12:28:09 -06:00
Hank Anderson	627d5e7401	Added SMP objects to driver/level3.	2015-02-05 12:22:48 -06:00
Hank Anderson	943fa2fb58	Fixed object names in level2.	2015-02-05 10:49:11 -06:00
Hank Anderson	461e691127	Codes when define is absent are now a parameter to AllCombinations. The level3 object names should now be correct.	2015-02-05 09:23:47 -06:00
Hank Anderson	cfaf1c678f	Added option to append define codes with an underscore. Fixed the code array not getting reset on subsequent AllCombinations calls.	2015-02-05 09:17:18 -06:00
Hank Anderson	0d7bad1f35	Changed GenerateObjects to append combination codes (e.g. dtrmm_TU).	2015-02-05 09:02:54 -06:00
Hank Anderson	d11bde60d0	DOUBLE define for DBLAS objects is now set in main CMakeLists.txt. Since the objects are the same, could generate SINGLE/COMPLEX/etc here without having to rewrite all the object enumeration code again.	2015-02-02 15:00:44 -06:00
Hank Anderson	5057a4b4df	Added openblas add_library call that uses DBLAS_OBJS ojbects.	2015-01-30 15:21:21 -06:00
Hank Anderson	d3dcdddf75	Moved functions into util cmake file.	2015-01-30 13:47:40 -06:00

1 2

75 Commits