Move "direct SGEMM" functionality out of the SkylakeX SGEMM kernel and make it available
(on x86_64 targets only for now) in DYNAMIC_ARCH builds
* Add sgemm_direct targets in the kernel Makefile.L3 and CMakeLists.txt
* Add direct_sgemm functions to the gotoblas struct in common_param.h
* Move sgemm_direct_performant helper to separate file
* Update gemm.c to macros for sgemm_direct to support dynamic_arch naming via common_s,h
* (Conditionally) add sgemm_direct functions in setparam-ref.c
Enable new build target platform -- COOPERLAKE. This target platform
supports all the SKYLAKEX supported ISAs + avx512bf16. So all the
SKYLAKEX specific kernels/drivers and related code are now extended
to be also active on COOPERLAKE. Besides, new BF16 related kernels
are active under this target.
For small register blockings that are too small to fill up vector
registers with column vectors, we currently use a generic code block.
Replace that with instantiations of the generic code as individual
functions, so that the compiler can optimize each one separately.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
Improve performance of SGEMM and DGEMM on z14 and z15 by unrolling and
interleaving the inner loop of the SGEMM 16x4 and DGEMM 8x4 blocks.
Specifically, we explicitly interleave vector register loads and
computation of two iterations.
Note that this change only adds one C function, since SGEMM 16x4 and
DGEMM 8x4 actually map to the same C code: they both hold intermediate
results in a 4x4 grid of vector registers, and the C implementation is
built around that.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
Making use of new vector pair POWER10 instructions in dgemv_n and dgemv_t.
Also adding a new block 4x128 to make use of Matrix-Multiply Assist (MMA)
feature introduced in POWER ISA v3.1. Tested on simulator and there
are no new test failures.
As gcc defaults to -malign-power, removing that option. Also
adding -fno-integrated-as to use GNU assembler for powerpc
assembly optimization files. Fixed other compilation errors
reported in dgemv_t.c file.
There is a recent compiler change in __builtin_mma_disassemble_acc() which
affects the order of storing result in POWER10. Also removing new LDFLAG
-mno-power10-stub as it is handled by linker automatically.
This patch introduces new optimized version of SHGEMM kernel
using power10 Matrix-Multiply Assist (MMA) feature introduced in
POWER ISA v3.1. This patch makes use of new POWER10 compute instructions
for matrix multiplication operation.
Tested on simulator and there are no new test failures.
This patch introduces new optimized version of ZGEMM kernel
using power10 Matrix-Multiply Assist (MMA) feature introduced in
POWER ISA v3.1. This patch makes use of new POWER10 compute instructions
for matrix multiplication operation.
Tested on simulator and there are no new test failures.
Cycles count reduced by 30-50% compared to POWER9 version depending on
M/N/K sizes.
This patch introduces new optimized version of SGEMM, CGEMM and DGEMM
using power10 Matrix-Multiply Assist (MMA) feature introduced in
POWER ISA v3.1. This patch makes use of new POWER10 compute instructions
for matrix multiplication operation.
Tested on simulator and there are no new test failures.
Cycles count reduced by 30-50% compared to POWER9 version depending on
M/N/K sizes.
MMA GCC patch for reference:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=8ee2640bfdc62f835ec9740278f948034bc7d9f1