OpenBLAS/driver/level3
Arjan van de Ven d148ec4ea1 Don't use _Atomic for jobs sometimes...
The use of _Atomic leads to really bad code generation in the compiler
(on x86, you get 2 "mfence" memory barriers around each access with gcc8, despite
x86 being ordered and cache coherent). But there's a fallback in the code that
just uses volatile which is more than plenty in practice.

If we're nervous about cross thread synchronization for these variables, we should
make the YIELD function be a compiler/memory barrier instead.

performance before (after last commit)

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10630.0   10.6       0.7%                             18112.8      6.2      -0.7%
  64 x 64               20374.8   13.0       1.9%                             40487.0      6.5       0.4%
  65 x 65              141955.2    1.9    -428.3%                            146708.8      1.9    -179.2%
  80 x 80              178921.1    2.9    -369.6%                            186032.7      2.8    -156.6%
  96 x 96              205436.2    4.3    -233.4%                            224513.1      3.9     -97.0%
 112 x 112             244408.2    5.8    -162.7%                            262158.7      5.4     -47.1%
 128 x 128             321334.5    6.5    -141.3%                            333829.0      6.3     -29.2%

Performance with this patch (roughly a 2x improvement):

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10756.0   10.5      -0.5%                             18296.7      6.1      -1.7%
  64 x 64               20490.0   12.9       1.4%                             40615.0      6.5       0.0%
  65 x 65               83528.3    3.3    -210.9%                             96319.0      2.9     -83.3%
  80 x 80              101453.5    5.1    -166.3%                            128021.7      4.0     -76.6%
  96 x 96              149795.1    5.9    -143.1%                            168059.4      5.3     -47.4%
 112 x 112             191481.2    7.3    -105.8%                            204165.0      6.9     -14.6%
 128 x 128             265019.2    7.9     -99.0%                            272006.4      7.7      -5.3%
2018-06-17 15:39:15 +00:00
..
CMakeLists.txt Fix threading usage in CMake: s/SMP/USE_THREAD/ 2017-08-19 15:07:42 +10:00
Makefile Work around name clash with Windows10's winnt.h 2018-05-31 13:26:00 +02:00
gemm.c Import GotoBLAS2 1.13 BSD version codes. 2011-01-24 14:54:24 +00:00
gemm3m.c Import GotoBLAS2 1.13 BSD version codes. 2011-01-24 14:54:24 +00:00
gemm3m_level3.c prepared driver/level3 functions for UNROLL values, that are not a power of two 2017-01-09 10:38:15 +01:00
gemm_thread_m.c Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
gemm_thread_mn.c Minor C code fixes in driver/ 2015-11-09 14:15:49 +05:30
gemm_thread_n.c Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
gemm_thread_variable.c Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
hemm3m_k.c Import GotoBLAS2 1.13 BSD version codes. 2011-01-24 14:54:24 +00:00
level3.c remove surplus parentheses to silence clang5 2018-01-01 20:56:26 +01:00
level3_gemm3m_thread.c Change _STDC_VERSION__ to __STDC_VERSION__ 2018-05-11 12:15:08 +08:00
level3_syr2k.c remove surplus parentheses to silence clang5 2018-01-01 20:56:26 +01:00
level3_syrk.c remove surplus parentheses to silence clang5 2018-01-01 20:56:26 +01:00
level3_syrk_threaded.c Change _STDC_VERSION__ to __STDC_VERSION__ 2018-05-11 12:15:08 +08:00
level3_thread.c Don't use _Atomic for jobs sometimes... 2018-06-17 15:39:15 +00:00
symm3m_k.c Import GotoBLAS2 1.13 BSD version codes. 2011-01-24 14:54:24 +00:00
symm_k.c Import GotoBLAS2 1.13 BSD version codes. 2011-01-24 14:54:24 +00:00
syr2k_k.c Changed a number of inline calls to use __inline. 2015-02-11 11:13:17 -06:00
syr2k_kernel.c Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
syrk_k.c Changed a number of inline calls to use __inline. 2015-02-11 11:13:17 -06:00
syrk_kernel.c prepared driver/level3 functions for UNROLL values, that are not a power of two 2017-01-09 10:38:15 +01:00
syrk_thread.c fixed syrk_thread.c taken from wernsaar 2017-07-06 17:30:12 +02:00
trmm_L.c optimizations for trmm 2014-07-25 10:00:23 +02:00
trmm_R.c Moved declarations to start of functions to satisfy MSVC C89 implementation. 2015-02-11 11:16:57 -06:00
trsm_L.c Moved declarations to start of functions to satisfy MSVC C89 implementation. 2015-02-11 11:16:57 -06:00
trsm_R.c Moved declarations to start of functions to satisfy MSVC C89 implementation. 2015-02-11 11:16:57 -06:00
zhemm_k.c Import GotoBLAS2 1.13 BSD version codes. 2011-01-24 14:54:24 +00:00
zher2k_k.c Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
zher2k_kernel.c prepared driver/level3 functions for UNROLL values, that are not a power of two 2017-01-09 10:38:15 +01:00
zherk_beta.c Import GotoBLAS2 1.13 BSD version codes. 2011-01-24 14:54:24 +00:00
zherk_k.c Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
zherk_kernel.c prepared driver/level3 functions for UNROLL values, that are not a power of two 2017-01-09 10:38:15 +01:00
zsyrk_beta.c Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00