OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Martin Kroeker	eb71d61c7c	Expose CBLAS interface to BLAS extensions iXamin	2018-06-23 13:31:09 +02:00
Martin Kroeker	9cf22b7d91	Build cblas_iXamin interfaces	2018-06-23 13:27:30 +02:00
Martin Kroeker	cc66743b66	Merge pull request #1634 from oon3m0oo/develop Fix data races reported by TSAN.	2018-06-21 21:01:03 +02:00
oon3m0oo	2aa0a5804e	Use BLAS rather than CBLAS in test_fork.c (#1626 ) This is handy for people not using lapack.	2018-06-21 18:47:45 +02:00
Craig Donner	28c28ed275	Fix data races reported by TSAN.	2018-06-21 16:41:02 +01:00
oon3m0oo	a399d00425	Further improvements to memory.c. (#1625 ) - Compiler TLS is now used only used when the compiler supports it - If compiler TLS is unsupported, we use platform-specific TLS - Only one variable (an index) is now in TLS - We only access TLS once per alloc, and never when freeing - Allocation / release info is now stored within the allocation itself, by over-allocating; this saves having external structures do the bookkeeping, and reduces some of the redundant data that was being stored (such as addresses) - We never hit the alloc lock when not using SMP or when using OpenMP (that was my fault) - Now that there are fewer tracking structures I think this is a bit easier to read than before	2018-06-20 22:04:03 +02:00
Martin Kroeker	f66b9c8826	Merge pull request #1630 from martin-frbg/x86-march Add -march=skylake-avx512 to flags if target is skylake x	2018-06-20 21:51:57 +02:00
Martin Kroeker	2946c46024	Merge pull request #1631 from oon3m0oo/stack Avoid declaring arrays of size 0 when making large stack allocations.	2018-06-20 21:51:38 +02:00
Craig Donner	05978528c3	Avoid declaring arrays of size 0 when making large stack allocations.	2018-06-20 17:03:18 +01:00
Martin Kroeker	ef6f0b645e	Merge pull request #1629 from martin-frbg/issue1628 Make gfortran link libomp for clang in the tests; avoid two typical gotchas with NOFORTRAN	2018-06-20 16:41:13 +02:00
Martin Kroeker	0c5b7b400b	Add -march=skylake-avx512 to flags if target is skylake x	2018-06-20 15:16:19 +02:00
Martin Kroeker	952541e840	Need to use filter-out to handle NOFORTRAN not set	2018-06-20 13:20:30 +02:00
Martin Kroeker	9369d3e6e5	Modify NOFORTRAN tests to always check the value; fix rewriting of NO_FORTRAN	2018-06-19 23:28:06 +02:00
Martin Kroeker	10b70c904d	Handle erroneous user settings NOFORTRAN=0 and NO_FORTRAN	2018-06-19 20:53:19 +02:00
Martin Kroeker	6a5ab083b7	Handle special case of gfortran+clang+OpenMP	2018-06-19 20:47:33 +02:00
Martin Kroeker	1f9e4f3193	Handle special case of gfortran+clang+OpenMP	2018-06-19 20:46:36 +02:00
Martin Kroeker	5a6a2bed9a	Merge pull request #1623 from fenrus75/fast-thread Initialize only the required subset of the jobs array, fix barriers and improve switch ratio on SkylakeX and Haswell. For issue #1622	2018-06-18 09:02:40 +02:00
Martin Kroeker	2d8cc7193a	Support upcoming Intel Cannon Lake CPUs as Skylake X (#1621 ) * Support upcoming Cannon Lake as Skylake X	2018-06-17 23:38:14 +02:00
Arjan van de Ven	2ddc96c9e5	make WMB / MB safer on x86-64 make it so that if (foo) RMB; else MB; is always done correctly and without syntax surprises	2018-06-17 18:06:24 +00:00
Arjan van de Ven	7e39ffe113	On x86-64, make MB/WMB compiler barriers Whie on x86(64) one does not normally need full memory barriers, it's good practice to at least use compiler barriers for places where on other architectures memory barriers are used; this prevents the compiler from over-optimizing.	2018-06-17 17:53:15 +00:00
Arjan van de Ven	73de17664d	Add missing barriers in gemm scheduler a few places in the gemm scheduler code were missing barriers; the code likely worked OK due to heavy use of volatile / _Atomic but there's no reason to get this incorrect	2018-06-17 17:50:43 +00:00
Arjan van de Ven	6eb4b9ae7c	Tune HASWELL SWITCH_RATIO as well Similar to the SKYLAKEX patch, 32 seems to work best (much better than 4 or 16) Before (4) Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 15554.3 7.2 0.2% 30353.8 3.7 0.3% 64 x 64 30346.8 8.7 1.6% 63495.0 4.1 -0.1% 65 x 65 81668.1 3.4 -123.3% 82705.2 3.3 -21.2% 80 x 80 105045.9 4.9 -95.5% 115226.0 4.5 -2.2% 96 x 96 152461.2 5.8 -74.3% 148156.3 6.0 16.4% 112 x 112 188505.2 7.5 -42.2% 171187.3 8.2 36.4% 128 x 128 257884.0 8.1 -39.5% 224764.8 9.3 46.0% Intermediate (16) Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 15565.7 7.2 0.2% 30378.9 3.7 0.2% 64 x 64 30430.2 8.7 1.3% 63046.4 4.2 0.6% 65 x 65 27306.0 10.1 25.3% 38879.2 7.1 43.0% 80 x 80 51008.7 10.1 5.1% 61007.6 8.4 45.9% 96 x 96 70856.7 12.5 19.0% 83403.1 10.6 53.0% 112 x 112 84769.9 16.6 36.0% 99920.1 14.1 62.9% 128 x 128 84213.2 25.0 54.5% 113024.2 18.6 72.8% After (32) Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 15537.3 7.2 0.3% 30537.0 3.6 -0.3% 64 x 64 30352.7 8.7 1.6% 62597.8 4.2 1.3% 65 x 65 36857.0 7.5 -0.8% 56167.6 4.9 17.7% 80 x 80 42552.6 12.1 20.8% 69536.7 7.4 38.3% 96 x 96 52101.5 17.1 40.5% 91016.1 9.7 48.7% 112 x 112 63853.7 22.1 51.8% 110507.4 12.7 58.9% 128 x 128 73966.1 28.4 60.0% 163146.4 12.9 60.8%	2018-06-17 17:08:36 +00:00
Arjan van de Ven	5c6f008365	Tune param.h for SkylakeX param.h defines a per-platform SWITCH_RATIO, which is used as a measure for how fine grained the blocks for gemm need to be split up. Many platforms define this to 4. The reality is that the gemm low level implementation for SkylakeX likes bigger blocks due to the nature of SIMD... by tuning the SWITCH_RATIO to 32 the threading performance improves significantly: Before Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7% 64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0% 65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3% 80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6% 96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4% 112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6% 128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3% After Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10666.3 10.6 0.4% 18236.9 6.2 -1.4% 64 x 64 20410.1 13.0 1.8% 39925.8 6.6 1.7% 65 x 65 34983.0 7.9 -30.2% 51494.6 5.4 2.0% 80 x 80 39769.1 13.0 -4.4% 63805.2 8.1 12.0% 96 x 96 45169.6 19.7 26.7% 80065.8 11.1 29.8% 112 x 112 57026.1 24.7 38.7% 99535.5 14.2 44.1% 128 x 128 64789.8 32.5 51.3% 117407.2 17.9 54.6% With this change, threading starts to be a win already at 96x96	2018-06-17 15:47:50 +00:00
Arjan van de Ven	d148ec4ea1	Don't use _Atomic for jobs sometimes... The use of _Atomic leads to really bad code generation in the compiler (on x86, you get 2 "mfence" memory barriers around each access with gcc8, despite x86 being ordered and cache coherent). But there's a fallback in the code that just uses volatile which is more than plenty in practice. If we're nervous about cross thread synchronization for these variables, we should make the YIELD function be a compiler/memory barrier instead. performance before (after last commit) Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10630.0 10.6 0.7% 18112.8 6.2 -0.7% 64 x 64 20374.8 13.0 1.9% 40487.0 6.5 0.4% 65 x 65 141955.2 1.9 -428.3% 146708.8 1.9 -179.2% 80 x 80 178921.1 2.9 -369.6% 186032.7 2.8 -156.6% 96 x 96 205436.2 4.3 -233.4% 224513.1 3.9 -97.0% 112 x 112 244408.2 5.8 -162.7% 262158.7 5.4 -47.1% 128 x 128 321334.5 6.5 -141.3% 333829.0 6.3 -29.2% Performance with this patch (roughly a 2x improvement): Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7% 64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0% 65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3% 80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6% 96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4% 112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6% 128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3%	2018-06-17 15:39:15 +00:00
Arjan van de Ven	9e162146a9	Only initialize the part of the jobs array that will get used The jobs array is getting initialized in O(compiled cpus^2) complexity. Distros and people with bigger systems will use pretty high values (128 or 256 or more) for this value, leading to interesting bubbles in performance. Baseline (single threaded performance) gets roughly 13 - 15 multiplications per cycle in the interesting range (threading kicks in at 65x65 mult by 65x65). The hardware is capable of 32 multiplications per cycle theoretically. Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10703.9 10.6 0.0% 17990.6 6.3 0.0% 64 x 64 20778.4 12.8 0.0% 40629.2 6.5 0.0% 65 x 65 26869.9 10.3 0.0% 52545.7 5.3 0.0% 80 x 80 38104.5 13.5 0.0% 72492.7 7.1 0.0% 96 x 96 61626.4 14.4 0.0% 113983.8 7.8 0.0% 112 x 112 91803.8 15.3 0.0% 180987.3 7.8 0.0% 128 x 128 133161.4 15.8 0.0% 258374.3 8.1 0.0% When threading is turned on TARGET=SKYLAKEX F_COMPILER=GFORTRAN SHARED=1 DYNAMIC_THREADS=1 USE_OPENMP=0 NUM_THREADS=128 Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10725.9 10.5 -0.2% 18134.9 6.2 -0.8% 64 x 64 20500.6 12.9 1.3% 40929.1 6.5 -0.7% 65 x 65 2040832.1 0.1 -7495.2% 2097633.6 0.1 -3892.0% 80 x 80 2063129.1 0.2 -5314.4% 2119925.2 0.2 -2824.3% 96 x 96 2070374.5 0.4 -3259.6% 2173604.4 0.4 -1806.9% 112 x 112 2111721.5 0.7 -2169.6% 2263330.8 0.6 -1170.0% 128 x 128 2276181.5 0.9 -1609.3% 2377228.9 0.9 -820.1% There is a deep deep cliff once you hit 65x65 With this patch Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10630.0 10.6 0.7% 18112.8 6.2 -0.7% 64 x 64 20374.8 13.0 1.9% 40487.0 6.5 0.4% 65 x 65 141955.2 1.9 -428.3% 146708.8 1.9 -179.2% 80 x 80 178921.1 2.9 -369.6% 186032.7 2.8 -156.6% 96 x 96 205436.2 4.3 -233.4% 224513.1 3.9 -97.0% 112 x 112 244408.2 5.8 -162.7% 262158.7 5.4 -47.1% 128 x 128 321334.5 6.5 -141.3% 333829.0 6.3 -29.2% The cliff is very significantly reduced. (more to follow)	2018-06-17 15:32:03 +00:00
Martin Kroeker	47bf0dba8f	Add build-time option for OMP scheduler; document MULTITHREAD_THRESHOLD range (#1620 ) * Allow choosing the OpenMP scheduler and add range hint for GEMM_MULTITHREAD_THRESHOLD * Amended description of GEMM_MULTITHREAD_THRESHOLD to reflect #742 making it track floating point operations rather than matrix size	2018-06-15 11:25:05 +02:00
Martin Kroeker	12603b7dbb	Merge pull request #1618 from oon3m0oo/less_locking Remove the need for most locking in memory.c.	2018-06-15 00:10:29 +02:00
Craig Donner	bf40f806ef	Remove the need for most locking in memory.c. Using thread local storage for tracking memory allocations means that threads no longer have to lock at all when doing memory allocations / frees. This particularly helps the gemm driver since it does an allocation per invocation. Even without threading at all, this helps, since even calling a lock with no contention has a cost: Before this change, no threading: ``` ---------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------- BM_SGEMM/4 102 ns 102 ns 13504412 BM_SGEMM/6 175 ns 175 ns 7997580 BM_SGEMM/8 205 ns 205 ns 6842073 BM_SGEMM/10 266 ns 266 ns 5294919 BM_SGEMM/16 478 ns 478 ns 2963441 BM_SGEMM/20 690 ns 690 ns 2144755 BM_SGEMM/32 1906 ns 1906 ns 716981 BM_SGEMM/40 2983 ns 2983 ns 473218 BM_SGEMM/64 9421 ns 9422 ns 148450 BM_SGEMM/72 12630 ns 12631 ns 112105 BM_SGEMM/80 15845 ns 15846 ns 89118 BM_SGEMM/90 25675 ns 25676 ns 54332 BM_SGEMM/100 29864 ns 29865 ns 47120 BM_SGEMM/112 37841 ns 37842 ns 36717 BM_SGEMM/128 56531 ns 56532 ns 25361 BM_SGEMM/140 75886 ns 75888 ns 18143 BM_SGEMM/150 98493 ns 98496 ns 14299 BM_SGEMM/160 102620 ns 102622 ns 13381 BM_SGEMM/170 135169 ns 135173 ns 10231 BM_SGEMM/180 146170 ns 146172 ns 9535 BM_SGEMM/189 190226 ns 190231 ns 7397 BM_SGEMM/200 194513 ns 194519 ns 7210 BM_SGEMM/256 396561 ns 396573 ns 3531 ``` with this change: ``` ---------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------- BM_SGEMM/4 95 ns 95 ns 14500387 BM_SGEMM/6 166 ns 166 ns 8381763 BM_SGEMM/8 196 ns 196 ns 7277044 BM_SGEMM/10 256 ns 256 ns 5515721 BM_SGEMM/16 463 ns 463 ns 3025197 BM_SGEMM/20 636 ns 636 ns 2070213 BM_SGEMM/32 1885 ns 1885 ns 739444 BM_SGEMM/40 2969 ns 2969 ns 472152 BM_SGEMM/64 9371 ns 9372 ns 148932 BM_SGEMM/72 12431 ns 12431 ns 112919 BM_SGEMM/80 15615 ns 15616 ns 89978 BM_SGEMM/90 25397 ns 25398 ns 55041 BM_SGEMM/100 29445 ns 29446 ns 47540 BM_SGEMM/112 37530 ns 37531 ns 37286 BM_SGEMM/128 55373 ns 55375 ns 25277 BM_SGEMM/140 76241 ns 76241 ns 18259 BM_SGEMM/150 102196 ns 102200 ns 13736 BM_SGEMM/160 101521 ns 101525 ns 13556 BM_SGEMM/170 136182 ns 136184 ns 10567 BM_SGEMM/180 146861 ns 146864 ns 9035 BM_SGEMM/189 192632 ns 192632 ns 7231 BM_SGEMM/200 198547 ns 198555 ns 6995 BM_SGEMM/256 392316 ns 392330 ns 3539 ``` Before, when built with USE_THREAD=1, GEMM_MULTITHREAD_THRESHOLD = 4, the cost of small matrix operations was overshadowed by thread locking (look smaller than 32) even when not explicitly spawning threads: ``` ---------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------- BM_SGEMM/4 328 ns 328 ns 4170562 BM_SGEMM/6 396 ns 396 ns 3536400 BM_SGEMM/8 418 ns 418 ns 3330102 BM_SGEMM/10 491 ns 491 ns 2863047 BM_SGEMM/16 710 ns 710 ns 2028314 BM_SGEMM/20 871 ns 871 ns 1581546 BM_SGEMM/32 2132 ns 2132 ns 657089 BM_SGEMM/40 3197 ns 3196 ns 437969 BM_SGEMM/64 9645 ns 9645 ns 144987 BM_SGEMM/72 35064 ns 32881 ns 50264 BM_SGEMM/80 37661 ns 35787 ns 42080 BM_SGEMM/90 36507 ns 36077 ns 40091 BM_SGEMM/100 32513 ns 31850 ns 48607 BM_SGEMM/112 41742 ns 41207 ns 37273 BM_SGEMM/128 67211 ns 65095 ns 21933 BM_SGEMM/140 68263 ns 67943 ns 19245 BM_SGEMM/150 121854 ns 115439 ns 10660 BM_SGEMM/160 116826 ns 115539 ns 10000 BM_SGEMM/170 126566 ns 122798 ns 11960 BM_SGEMM/180 130088 ns 127292 ns 11503 BM_SGEMM/189 120309 ns 116634 ns 13162 BM_SGEMM/200 114559 ns 110993 ns 10000 BM_SGEMM/256 217063 ns 207806 ns 6417 ``` and after, it's gone (note this includes my other change which reduces calls to num_cpu_avail): ``` ---------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------- BM_SGEMM/4 95 ns 95 ns 12347650 BM_SGEMM/6 166 ns 166 ns 8259683 BM_SGEMM/8 193 ns 193 ns 7162210 BM_SGEMM/10 258 ns 258 ns 5415657 BM_SGEMM/16 471 ns 471 ns 2981009 BM_SGEMM/20 666 ns 666 ns 2148002 BM_SGEMM/32 1903 ns 1903 ns 738245 BM_SGEMM/40 2969 ns 2969 ns 473239 BM_SGEMM/64 9440 ns 9440 ns 148442 BM_SGEMM/72 37239 ns 33330 ns 46813 BM_SGEMM/80 57350 ns 55949 ns 32251 BM_SGEMM/90 36275 ns 36249 ns 42259 BM_SGEMM/100 31111 ns 31008 ns 45270 BM_SGEMM/112 43782 ns 40912 ns 34749 BM_SGEMM/128 67375 ns 64406 ns 22443 BM_SGEMM/140 76389 ns 67003 ns 21430 BM_SGEMM/150 72952 ns 71830 ns 19793 BM_SGEMM/160 97039 ns 96858 ns 11498 BM_SGEMM/170 123272 ns 122007 ns 11855 BM_SGEMM/180 126828 ns 126505 ns 11567 BM_SGEMM/189 115179 ns 114665 ns 11044 BM_SGEMM/200 89289 ns 87259 ns 16147 BM_SGEMM/256 226252 ns 222677 ns 7375 ``` I've also tested this with ThreadSanitizer and found no data races during execution. I'm not sure why 200 is always faster than it's neighbors, we must be hitting some optimal cache size or something.	2018-06-14 16:54:58 +01:00
Martin Kroeker	ed682a4a0c	Merge pull request #1619 from martin-frbg/issue1580 Update OSX deployment target to 10.8	2018-06-14 17:48:51 +02:00
Martin Kroeker	fcb77ab129	Update OSX deployment target to 10.8 fixes #1580	2018-06-14 16:57:58 +02:00
Martin Kroeker	26e1cfb653	Merge pull request #1607 from martin-frbg/dynarch Move some x86_64 DYNAMIC_ARCH targets to new DYNAMIC_OLDER option	2018-06-14 16:52:55 +02:00
Martin Kroeker	c628c6fa59	Merge pull request #1612 from oon3m0oo/cpus Fixed a few more unnecessary calls to num_cpu_avail.	2018-06-14 16:51:31 +02:00
Martin Kroeker	67d81ab49d	Merge pull request #1609 from martin-frbg/issue1529 Create OpenBLASConfig.cmake in cmake builds as well	2018-06-12 23:00:24 +02:00
Martin Kroeker	2f957947a6	Merge pull request #1613 from xianyi/revert-1600-noyield Revert "Use usleep instead of sched_yield by default"	2018-06-11 17:14:49 +02:00
Martin Kroeker	de8fff671d	Revert "Use usleep instead of sched_yield by default"	2018-06-11 17:05:27 +02:00
Martin Kroeker	6f71c0fce4	Return a somewhat sane default value for L2 cache size if cpuid retur… (#1611 ) * Return a somewhat sane default value for L2 cache size if cpuid returned something unexpected Fixes #1610, the KVM hypervisor on Google Chromebooks returning zero for CPUID 0x80000006, causing DYNAMIC_ARCH builds of OpenBLAS to hang	2018-06-11 13:26:19 +02:00
Craig Donner	c2545b0fd6	Fixed a few more unnecessary calls to num_cpu_avail. I don't have as many benchmarks for these as for gemm, but it should still make a difference for small matrices.	2018-06-11 10:17:16 +01:00
Martin Kroeker	e65f451409	include CMakePackageConfigHelpers	2018-06-10 15:09:43 +02:00
Martin Kroeker	02634b549b	Add template for OpenBLASConfig.cmake	2018-06-10 09:25:46 +02:00
Martin Kroeker	0bea6bb9e7	Create OpenBLASConfig.cmake from cmake as well	2018-06-10 09:24:37 +02:00
Martin Kroeker	3313e4b946	Merge pull request #1608 from martin-frbg/issue874 Enable parallel make on MS Windows by default	2018-06-09 19:57:33 +02:00
Martin Kroeker	e9cd11768c	Enable parallel make on MS Windows by default fixes #874	2018-06-09 17:54:36 +02:00
Martin Kroeker	63f7395fb4	Move some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option	2018-06-09 16:31:38 +02:00
Martin Kroeker	1cbd8f3ae4	Move some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option	2018-06-09 16:30:46 +02:00
Martin Kroeker	6c2d90ba77	Move some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option	2018-06-09 16:29:17 +02:00
Martin Kroeker	0297b3211a	Merge pull request #1605 from oon3m0oo/develop Improve performance of GEMM for small matrices when SMP is defined.	2018-06-09 12:42:34 +02:00
Craig Donner	66316b9f4c	Improve performance of GEMM for small matrices when SMP is defined. Always checking num_cpu_avail() regardless of whether threading will actually be used adds noticeable overhead for small matrices. Most other uses of num_cpu_avail() do so only if threading will be used, so do the same here.	2018-06-07 15:29:13 +01:00
Martin Kroeker	6adc4b7b36	Merge pull request #1601 from martin-frbg/zaxpy Use a single thread for small input size in zaxpy	2018-06-07 14:09:58 +02:00
Martin Kroeker	2ade0ef085	Merge pull request #1600 from martin-frbg/noyield Use usleep instead of sched_yield by default	2018-06-07 12:42:00 +02:00
Martin Kroeker	e8880c1699	Use a single thread for small input size copies daxpy improvement from #27, see #1560	2018-06-07 10:26:55 +02:00

... 7 8 9 10 11 ...

3414 Commits All Branches Search

3414 Commits

All Branches