OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Martin Kroeker	24e344038d	Merge pull request #1654 from martin-frbg/avx512check Add compiler option to avx512 test and hide test output	2018-07-01 01:17:03 +02:00
Martin Kroeker	4e9c34018e	Fix apparent off-by-one error in calculation of MAX_ALLOCATING_THREADS fixes #1641	2018-06-30 23:57:50 +02:00
Martin Kroeker	f5243e8e1f	Add compiler option to avx512 test and hide test output	2018-06-30 23:47:44 +02:00
Martin Kroeker	ba8388cee0	Merge pull request #1651 from martin-frbg/avx512-nodgemm Disable the 16x2 DTRMM kernel on SkylakeX as well	2018-06-30 17:48:03 +02:00
Martin Kroeker	6e54b0a027	Disable the 16x2 DTRMM kernel on SkylakeX as well	2018-06-30 17:31:06 +02:00
Martin Kroeker	40c8cbc3bf	Merge pull request #1650 from martin-frbg/avx512-nodgemm Disable the AVX512 DGEMM kernel for now	2018-06-30 13:05:46 +02:00
Martin Kroeker	d3c9eb4c7d	Merge pull request #1639 from martin-frbg/dyn_list Add DYNAMIC_LIST option for user-defined list of dynamic targets	2018-06-30 13:05:30 +02:00
Martin Kroeker	f0a8dc2eec	Disable the AVX512 DGEMM kernel for now due to #1643	2018-06-30 11:34:48 +02:00
Martin Kroeker	cc92257ea6	Update Makefile	2018-06-27 00:09:21 +02:00
Martin Kroeker	2aba1b1658	Merge branch 'develop' into nofort	2018-06-27 00:07:32 +02:00
Martin Kroeker	8396e9e777	Handle NOFORTRAN=0	2018-06-27 00:00:27 +02:00
Martin Kroeker	bfad307ed7	Merge pull request #1647 from martin-frbg/armv7-dot Remove premature exits from ARMV7 xdot codes	2018-06-26 22:27:30 +02:00
Martin Kroeker	b83e4c60c7	Remove premature exit for INC_X or INC_Y zero	2018-06-26 20:46:42 +02:00
Martin Kroeker	e344db269b	Remove premature exit for INC_X or INC_Y zero	2018-06-26 20:45:57 +02:00
Martin Kroeker	545b82efd3	Remove premature exit for INC_X or INC_Y zero	2018-06-26 20:45:00 +02:00
Martin Kroeker	e322a951fe	Remove premature exit for INC_X or INC_Y zero	2018-06-26 20:44:13 +02:00
Martin Kroeker	ff2f171036	Merge pull request #1644 from martin-frbg/revert-filterout Revert changes to NOFORTRAN handling in Makefile	2018-06-26 10:15:15 +02:00
Martin Kroeker	092175cfec	Revert changes to NOFORTRAN handling from `952541e`	2018-06-26 08:09:52 +02:00
Martin Kroeker	750162a05f	Try gradual fallback for cores not in the dynamic core list	2018-06-25 21:02:31 +02:00
Martin Kroeker	e6d93f20f1	Merge pull request #2 from martin-frbg/develop merge develop	2018-06-25 20:48:10 +02:00
Martin Kroeker	c38c65eb65	Merge pull request #1 from xianyi/develop Merge xianyi:develop into develop	2018-06-25 20:45:56 +02:00
Martin Kroeker	ce3651516f	Merge pull request #1642 from oon3m0oo/develop Rewrite &= -> = and simplify the initial blocking phase.	2018-06-25 19:23:40 +02:00
Craig Donner	0144068537	Rewrite &= -> = and simplify the initial blocking phase.	2018-06-25 15:08:55 +01:00
Martin Kroeker	1833a67071	Add support for a user-defined list of dynamic targets	2018-06-23 19:42:15 +02:00
Martin Kroeker	0b2b83d9ed	Add support for a user-defined list of dynamic targets	2018-06-23 19:41:32 +02:00
Martin Kroeker	62cf769aa6	Merge pull request #1638 from martin-frbg/issue1637 Expose the CBLAS interface to the IxAMIN functions and have make build it	2018-06-23 15:01:02 +02:00
Martin Kroeker	eb71d61c7c	Expose CBLAS interface to BLAS extensions iXamin	2018-06-23 13:31:09 +02:00
Martin Kroeker	9cf22b7d91	Build cblas_iXamin interfaces	2018-06-23 13:27:30 +02:00
Martin Kroeker	cc66743b66	Merge pull request #1634 from oon3m0oo/develop Fix data races reported by TSAN.	2018-06-21 21:01:03 +02:00
oon3m0oo	2aa0a5804e	Use BLAS rather than CBLAS in test_fork.c (#1626 ) This is handy for people not using lapack.	2018-06-21 18:47:45 +02:00
Craig Donner	28c28ed275	Fix data races reported by TSAN.	2018-06-21 16:41:02 +01:00
oon3m0oo	a399d00425	Further improvements to memory.c. (#1625 ) - Compiler TLS is now used only used when the compiler supports it - If compiler TLS is unsupported, we use platform-specific TLS - Only one variable (an index) is now in TLS - We only access TLS once per alloc, and never when freeing - Allocation / release info is now stored within the allocation itself, by over-allocating; this saves having external structures do the bookkeeping, and reduces some of the redundant data that was being stored (such as addresses) - We never hit the alloc lock when not using SMP or when using OpenMP (that was my fault) - Now that there are fewer tracking structures I think this is a bit easier to read than before	2018-06-20 22:04:03 +02:00
Martin Kroeker	f66b9c8826	Merge pull request #1630 from martin-frbg/x86-march Add -march=skylake-avx512 to flags if target is skylake x	2018-06-20 21:51:57 +02:00
Martin Kroeker	2946c46024	Merge pull request #1631 from oon3m0oo/stack Avoid declaring arrays of size 0 when making large stack allocations.	2018-06-20 21:51:38 +02:00
Craig Donner	05978528c3	Avoid declaring arrays of size 0 when making large stack allocations.	2018-06-20 17:03:18 +01:00
Martin Kroeker	ef6f0b645e	Merge pull request #1629 from martin-frbg/issue1628 Make gfortran link libomp for clang in the tests; avoid two typical gotchas with NOFORTRAN	2018-06-20 16:41:13 +02:00
Martin Kroeker	0c5b7b400b	Add -march=skylake-avx512 to flags if target is skylake x	2018-06-20 15:16:19 +02:00
Martin Kroeker	952541e840	Need to use filter-out to handle NOFORTRAN not set	2018-06-20 13:20:30 +02:00
Martin Kroeker	9369d3e6e5	Modify NOFORTRAN tests to always check the value; fix rewriting of NO_FORTRAN	2018-06-19 23:28:06 +02:00
Martin Kroeker	10b70c904d	Handle erroneous user settings NOFORTRAN=0 and NO_FORTRAN	2018-06-19 20:53:19 +02:00
Martin Kroeker	6a5ab083b7	Handle special case of gfortran+clang+OpenMP	2018-06-19 20:47:33 +02:00
Martin Kroeker	1f9e4f3193	Handle special case of gfortran+clang+OpenMP	2018-06-19 20:46:36 +02:00
Martin Kroeker	5a6a2bed9a	Merge pull request #1623 from fenrus75/fast-thread Initialize only the required subset of the jobs array, fix barriers and improve switch ratio on SkylakeX and Haswell. For issue #1622	2018-06-18 09:02:40 +02:00
Martin Kroeker	2d8cc7193a	Support upcoming Intel Cannon Lake CPUs as Skylake X (#1621 ) * Support upcoming Cannon Lake as Skylake X	2018-06-17 23:38:14 +02:00
Arjan van de Ven	2ddc96c9e5	make WMB / MB safer on x86-64 make it so that if (foo) RMB; else MB; is always done correctly and without syntax surprises	2018-06-17 18:06:24 +00:00
Arjan van de Ven	7e39ffe113	On x86-64, make MB/WMB compiler barriers Whie on x86(64) one does not normally need full memory barriers, it's good practice to at least use compiler barriers for places where on other architectures memory barriers are used; this prevents the compiler from over-optimizing.	2018-06-17 17:53:15 +00:00
Arjan van de Ven	73de17664d	Add missing barriers in gemm scheduler a few places in the gemm scheduler code were missing barriers; the code likely worked OK due to heavy use of volatile / _Atomic but there's no reason to get this incorrect	2018-06-17 17:50:43 +00:00
Arjan van de Ven	6eb4b9ae7c	Tune HASWELL SWITCH_RATIO as well Similar to the SKYLAKEX patch, 32 seems to work best (much better than 4 or 16) Before (4) Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 15554.3 7.2 0.2% 30353.8 3.7 0.3% 64 x 64 30346.8 8.7 1.6% 63495.0 4.1 -0.1% 65 x 65 81668.1 3.4 -123.3% 82705.2 3.3 -21.2% 80 x 80 105045.9 4.9 -95.5% 115226.0 4.5 -2.2% 96 x 96 152461.2 5.8 -74.3% 148156.3 6.0 16.4% 112 x 112 188505.2 7.5 -42.2% 171187.3 8.2 36.4% 128 x 128 257884.0 8.1 -39.5% 224764.8 9.3 46.0% Intermediate (16) Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 15565.7 7.2 0.2% 30378.9 3.7 0.2% 64 x 64 30430.2 8.7 1.3% 63046.4 4.2 0.6% 65 x 65 27306.0 10.1 25.3% 38879.2 7.1 43.0% 80 x 80 51008.7 10.1 5.1% 61007.6 8.4 45.9% 96 x 96 70856.7 12.5 19.0% 83403.1 10.6 53.0% 112 x 112 84769.9 16.6 36.0% 99920.1 14.1 62.9% 128 x 128 84213.2 25.0 54.5% 113024.2 18.6 72.8% After (32) Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 15537.3 7.2 0.3% 30537.0 3.6 -0.3% 64 x 64 30352.7 8.7 1.6% 62597.8 4.2 1.3% 65 x 65 36857.0 7.5 -0.8% 56167.6 4.9 17.7% 80 x 80 42552.6 12.1 20.8% 69536.7 7.4 38.3% 96 x 96 52101.5 17.1 40.5% 91016.1 9.7 48.7% 112 x 112 63853.7 22.1 51.8% 110507.4 12.7 58.9% 128 x 128 73966.1 28.4 60.0% 163146.4 12.9 60.8%	2018-06-17 17:08:36 +00:00
Arjan van de Ven	5c6f008365	Tune param.h for SkylakeX param.h defines a per-platform SWITCH_RATIO, which is used as a measure for how fine grained the blocks for gemm need to be split up. Many platforms define this to 4. The reality is that the gemm low level implementation for SkylakeX likes bigger blocks due to the nature of SIMD... by tuning the SWITCH_RATIO to 32 the threading performance improves significantly: Before Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7% 64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0% 65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3% 80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6% 96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4% 112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6% 128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3% After Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10666.3 10.6 0.4% 18236.9 6.2 -1.4% 64 x 64 20410.1 13.0 1.8% 39925.8 6.6 1.7% 65 x 65 34983.0 7.9 -30.2% 51494.6 5.4 2.0% 80 x 80 39769.1 13.0 -4.4% 63805.2 8.1 12.0% 96 x 96 45169.6 19.7 26.7% 80065.8 11.1 29.8% 112 x 112 57026.1 24.7 38.7% 99535.5 14.2 44.1% 128 x 128 64789.8 32.5 51.3% 117407.2 17.9 54.6% With this change, threading starts to be a win already at 96x96	2018-06-17 15:47:50 +00:00
Arjan van de Ven	d148ec4ea1	Don't use _Atomic for jobs sometimes... The use of _Atomic leads to really bad code generation in the compiler (on x86, you get 2 "mfence" memory barriers around each access with gcc8, despite x86 being ordered and cache coherent). But there's a fallback in the code that just uses volatile which is more than plenty in practice. If we're nervous about cross thread synchronization for these variables, we should make the YIELD function be a compiler/memory barrier instead. performance before (after last commit) Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10630.0 10.6 0.7% 18112.8 6.2 -0.7% 64 x 64 20374.8 13.0 1.9% 40487.0 6.5 0.4% 65 x 65 141955.2 1.9 -428.3% 146708.8 1.9 -179.2% 80 x 80 178921.1 2.9 -369.6% 186032.7 2.8 -156.6% 96 x 96 205436.2 4.3 -233.4% 224513.1 3.9 -97.0% 112 x 112 244408.2 5.8 -162.7% 262158.7 5.4 -47.1% 128 x 128 321334.5 6.5 -141.3% 333829.0 6.3 -29.2% Performance with this patch (roughly a 2x improvement): Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7% 64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0% 65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3% 80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6% 96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4% 112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6% 128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3%	2018-06-17 15:39:15 +00:00

... 6 7 8 9 10 ...

3390 Commits All Branches Search

3390 Commits

All Branches