The current gemm threading code can make very unfortunate choices, for
example on my 10 core system a 1024x1024x1024 matrix multiply ends up
chunking into blocks of 102... which is not a vector friendly size
and performance ends up horrible.
this patch adds a helper define where an architecture can specify
a preference for size multiples.
This is different from existing defines that are minimum sizes and such.
The performance increase with this patch for the 1024x1024x1024 sgemm
is 2.3x (!!)
a few places in the gemm scheduler code were missing barriers;
the code likely worked OK due to heavy use of volatile / _Atomic
but there's no reason to get this incorrect
The use of _Atomic leads to really bad code generation in the compiler
(on x86, you get 2 "mfence" memory barriers around each access with gcc8, despite
x86 being ordered and cache coherent). But there's a fallback in the code that
just uses volatile which is more than plenty in practice.
If we're nervous about cross thread synchronization for these variables, we should
make the YIELD function be a compiler/memory barrier instead.
performance before (after last commit)
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 10630.0 10.6 0.7% 18112.8 6.2 -0.7%
64 x 64 20374.8 13.0 1.9% 40487.0 6.5 0.4%
65 x 65 141955.2 1.9 -428.3% 146708.8 1.9 -179.2%
80 x 80 178921.1 2.9 -369.6% 186032.7 2.8 -156.6%
96 x 96 205436.2 4.3 -233.4% 224513.1 3.9 -97.0%
112 x 112 244408.2 5.8 -162.7% 262158.7 5.4 -47.1%
128 x 128 321334.5 6.5 -141.3% 333829.0 6.3 -29.2%
Performance with this patch (roughly a 2x improvement):
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7%
64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0%
65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3%
80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6%
96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4%
112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6%
128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3%
The jobs array is getting initialized in O(compiled cpus^2) complexity.
Distros and people with bigger systems will use pretty high values
(128 or 256 or more) for this value, leading to interesting bubbles
in performance.
Baseline (single threaded performance) gets roughly 13 - 15 multiplications per cycle
in the interesting range (threading kicks in at 65x65 mult by 65x65).
The hardware is capable of 32 multiplications per cycle theoretically.
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 10703.9 10.6 0.0% 17990.6 6.3 0.0%
64 x 64 20778.4 12.8 0.0% 40629.2 6.5 0.0%
65 x 65 26869.9 10.3 0.0% 52545.7 5.3 0.0%
80 x 80 38104.5 13.5 0.0% 72492.7 7.1 0.0%
96 x 96 61626.4 14.4 0.0% 113983.8 7.8 0.0%
112 x 112 91803.8 15.3 0.0% 180987.3 7.8 0.0%
128 x 128 133161.4 15.8 0.0% 258374.3 8.1 0.0%
When threading is turned on
TARGET=SKYLAKEX F_COMPILER=GFORTRAN SHARED=1 DYNAMIC_THREADS=1 USE_OPENMP=0 NUM_THREADS=128
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 10725.9 10.5 -0.2% 18134.9 6.2 -0.8%
64 x 64 20500.6 12.9 1.3% 40929.1 6.5 -0.7%
65 x 65 2040832.1 0.1 -7495.2% 2097633.6 0.1 -3892.0%
80 x 80 2063129.1 0.2 -5314.4% 2119925.2 0.2 -2824.3%
96 x 96 2070374.5 0.4 -3259.6% 2173604.4 0.4 -1806.9%
112 x 112 2111721.5 0.7 -2169.6% 2263330.8 0.6 -1170.0%
128 x 128 2276181.5 0.9 -1609.3% 2377228.9 0.9 -820.1%
There is a deep deep cliff once you hit 65x65
With this patch
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 10630.0 10.6 0.7% 18112.8 6.2 -0.7%
64 x 64 20374.8 13.0 1.9% 40487.0 6.5 0.4%
65 x 65 141955.2 1.9 -428.3% 146708.8 1.9 -179.2%
80 x 80 178921.1 2.9 -369.6% 186032.7 2.8 -156.6%
96 x 96 205436.2 4.3 -233.4% 224513.1 3.9 -97.0%
112 x 112 244408.2 5.8 -162.7% 262158.7 5.4 -47.1%
128 x 128 321334.5 6.5 -141.3% 333829.0 6.3 -29.2%
The cliff is very significantly reduced.
(more to follow)
If OpenBLAS is built using add_subdirectory(OpenBlas) as part of another project
then the paths set by CMAKE_XXX_DIR are relative to the parent project
and not the OpenBLAS project.
Instead of generating separate object files for each permutation of
defines for a source file, GenerateNamedObjects now writes an entirely
new source file and inserts the defines as #define c statements.
This solves a problem I ran into with ar.exe where it was refusing to
link objects that had the same filename despite having different paths.