Running make DYNAMIC_ARCH=1 on POWER 8 BE with gcc10.2 version, gives
the following error due to the difference in UNROLL_M/N.
'No rule to make target 'dgemm_incopy_POWER10.o', needed by kernel'
Enable new build target platform -- COOPERLAKE. This target platform
supports all the SKYLAKEX supported ISAs + avx512bf16. So all the
SKYLAKEX specific kernels/drivers and related code are now extended
to be also active on COOPERLAKE. Besides, new BF16 related kernels
are active under this target.
We recently changed the register blocking for SGEMM on s390x to 16x4.
However, we did not adjust Q to a multiple of 16 and thus fell back to
the 8x4 kernel at each block's margin, without need. Adjust P and Q to
multiples of 16 to employ the faster 16x4 kernel for complete full-sized
blocks.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
This patch introduces new optimized version of SHGEMM kernel
using power10 Matrix-Multiply Assist (MMA) feature introduced in
POWER ISA v3.1. This patch makes use of new POWER10 compute instructions
for matrix multiplication operation.
Tested on simulator and there are no new test failures.
Change register blocking for SGEMM (and STRMM) on z14 from 8x4 to 16x4
by adjusting SGEMM_DEFAULT_UNROLL_M and choosing the appropriate copy
implementations. Actually make KERNEL.Z14 more flexible, so that the
change in param.h suffices. As a result, performance for SGEMM improves
by around 30% on z15.
On z14, FP SIMD instructions can operate on float-sized scalars in
vector registers, while z13 could do that for double-sized scalars only.
Thus, we can double the amount of elements of C that are held in
registers in an SGEMM kernel.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
As shown in #2538, default buffersizes on some platforms were smaller than required in memory.c
and the requirement could never be fulfilled for a calculated GEMM_R on PPC given the fomula used
There is currently no simple way to query cache sizes on ARMV8, so this takes the number of cores as a trivial indication if the target is a server-class device with a big cache, or just a single-board toy or smartphone.
fixes:
gemm.c: In function 'sgemm_':
../common_param.h:981:18: error: 'SGEMM_DEFAULT_P' undeclared (first use in this function)
#define SGEMM_P SGEMM_DEFAULT_P
^
OpenBLAS has a fancy algorithm for copying the input data while laying
it out in a more CPU friendly memory layout.
This is great for large matrixes; the cost of the copy is easily
ammortized by the gains from the better memory layout.
But for small matrixes (on CPUs that can do efficient unaligned loads) this
copy can be a net loss.
This patch adds (for SKYLAKEX initially) a "sgemm direct" mode, that bypasses
the whole copy machinary for ALPHA=1/BETA=0/... standard arguments,
for small matrixes only.
What is small? For the non-threaded case this has been measured to be
in the M*N*K = 28 * 512 * 512 range, while in the threaded case it's
less, around M*N*K = 1 * 512 * 512
ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode
(which is not right because TX2 is ARMv8.1) as well as requiring a few
redundancies in the defines, making it harder to maintain and understand
what core has what. A few other minor issues were also fixed.
Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX,
ThunderX2, and XGene.
Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester.
A summary:
* Removed TX2 code from ARMv8 build, to make sure it is compatible with
all ARMv8 cores, not just v8.1. Also, the TX2 code has actually
harmed performance on big cores.
* Commoned up ARMv8 architectures' defines in params.h, to make sure
that all will benefit from ARMv8 settings, in addition to their own.
* Adding a few more cores, using ARMv8's include strategy, to benefit
from compiler optimisations using mtune. Also updated cache
information from the manuals, making sure we set good conservative
values by default. Removed Vulcan, as it's an alias to TX2.
* Auto-detecting most of those cores, but also updating the forced
compilation in getarch.c, to make sure the parameters are the same
whether compiled natively or forced arch.
Benefits:
* ARMv8 build is now guaranteed to work on all ARMv8 cores
* Improved performance for ARMv8 builds on some cores (A72, Falkor,
ThunderX1 and 2: up to 11%) over current develop
* Improved performance for *all* cores comparing to develop branch
before TX2's patch (9% ~ 36%)
* ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than
current develop's branch and 8% faster than deveop before tx2 patches
Issues:
* Regression from current develop branch for A53 (-12%) and A57 (-3%)
with ARMv8 builds, but still faster than before TX2's commit (+15%
and +24% respectively). This can be improved with a simplification of
TX2's code, to be done in future patches. At least the code is
guaranteed to be ARMv8.0 now.
Comments:
* CortexA57 builds are unchanged on A57 hardware from develop's branch,
which makes sense, as it's untouched.
* CortexA72 builds improve over A57 on A72 hardware, even if they're
using the same includes due to new compiler tunning in the makefile.
The current gemm threading code can make very unfortunate choices, for
example on my 10 core system a 1024x1024x1024 matrix multiply ends up
chunking into blocks of 102... which is not a vector friendly size
and performance ends up horrible.
this patch adds a helper define where an architecture can specify
a preference for size multiples.
This is different from existing defines that are minimum sizes and such.
The performance increase with this patch for the 1024x1024x1024 sgemm
is 2.3x (!!)
Currently the generic ARMV8 target uses C implementations
for many routines. Replace these with the neon implementations
written for THUNDERX2T99 target which are upto 6x faster for
certain routines.
param.h defines a per-platform SWITCH_RATIO, which is used as a measure for how fine
grained the blocks for gemm need to be split up. Many platforms define this to 4.
The reality is that the gemm low level implementation for SkylakeX likes bigger blocks
due to the nature of SIMD... by tuning the SWITCH_RATIO to 32 the threading performance
improves significantly:
Before
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7%
64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0%
65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3%
80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6%
96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4%
112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6%
128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3%
After
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 10666.3 10.6 0.4% 18236.9 6.2 -1.4%
64 x 64 20410.1 13.0 1.8% 39925.8 6.6 1.7%
65 x 65 34983.0 7.9 -30.2% 51494.6 5.4 2.0%
80 x 80 39769.1 13.0 -4.4% 63805.2 8.1 12.0%
96 x 96 45169.6 19.7 26.7% 80065.8 11.1 29.8%
112 x 112 57026.1 24.7 38.7% 99535.5 14.2 44.1%
128 x 128 64789.8 32.5 51.3% 117407.2 17.9 54.6%
With this change, threading starts to be a win already at 96x96
This patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server)
target. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set,
which brings 2 basic things:
1) 512 bit wide SIMD (2x width of AVX2)
2) 32 SIMD registers (2x the number on AVX2)
This initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel
to AVX512VL; more will follow later but this patch aims to get the infrastructure
in place for this "later".
Full performance tuning has not been done yet; with more registers and wider SIMD
it's in theory possible to retune the kernels but even without that there's an
interesting enough performance increase (30-40% range) with just this change.