Commit Graph

76 Commits

Author SHA1 Message Date
Marius Hillenbrand 22aa81f3e5 s390x: fix cscal and zscal implementations
The implementation of complex scalar * vector multiplication for Z14
makes some LAPACK tests fail because the numerical differences to the
reference implementation exceed the threshold (as can be seen by running
make lapack-test and replacing kernel/zarch/cscal.c with a generic
implementation for comparison).

The complex multiplication uses terms of the form a * b + c * d for both
real and imaginary parts. The assembly code (and compiler-emitted code
as well) uses fused multiply add operations for the second product and
sum. The results can be "surprising", for example when both terms in the
imaginary part nearly cancel each other out. In that case, the second
product contributes more digits to the sum than the first product that
has been rounded before.

One option is to use separate multiplications (which then round the same
way) and a distinct add. Change the code to pursue that path, by (1)
requesting the compiler not to contract the operations into FMAs and (2)
replacing the assembly kernel with corresponding vectorized C code
(where change 1 also applies).

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-21 13:10:05 +02:00
Marius Hillenbrand f91057cbad s390x: move common vector definitions and utils into header
... to facilitate reuse beyond gemm_vec.c and avoid code duplication.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-21 11:32:08 +02:00
Marius Hillenbrand 2ee5b899ce s390x: enable S/DGEMM block with explicit loop unrolling + interleaving with clang
The code for SGEMM 16x4 and DGEMM 8x4 blocks on z14 and z15 uses
explicit unrolling and interleaving to improve performance. The code
employs an empty inline asm statement with operands that constrain the
compiler's instruction scheduling and thereby enforce proper overlapping
of load and compute phases. Fix an ifdef to apply that for clang builds,
as well.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-02 13:49:31 +02:00
Marius Hillenbrand 87e5bbd887 s390x: avoid variable-length arrays in struct for asm operands
... since it is not required and clang does not support that gcc
extension. Instead, use a variable-length array directly for these
operands.

Note that, while the actual inline assembly code does not directly use
these memory operands, they serve to inform the compiler that it cannot
reorder reads or writes to/from the input and output data across the
inline asm statements.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-02 13:49:31 +02:00
Marius Hillenbrand b9b3265ec8 s390x: avoid inline assembly for vector loads for clang
... since clang does not support the instruction format for inline
assembly and also it is not required for current versions of clang.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-02 13:49:30 +02:00
Marius Hillenbrand a1616a0b86 s390x: replace nop with "nop 0" in inline assembly
... as a bandaid for building with clang until LLVM's internal assembler
supports nops without operand.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-02 13:49:30 +02:00
Marius Hillenbrand 60ef193258 s390x: use "lghi" for immediate values to fix build with clang
Some of the kernels written in assembly utilize a "load address"
instruction for loading an immediate value into a register. That is
both unnecessarily complex and LLVM's assembler does not understand that
specific syntax. Thus, replace with the appropriate "load immediate"
instruction, which is also clearer to read.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-02 13:49:30 +02:00
Marius Hillenbrand 07c334e7be s390x: Factor out small block sizes for SGEMM/DGEMM on z14
For small register blockings that are too small to fill up vector
registers with column vectors, we currently use a generic code block.
Replace that with instantiations of the generic code as individual
functions, so that the compiler can optimize each one separately.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-08-11 12:56:39 +02:00
Marius Hillenbrand e2828e30aa s390x: Optimize SGEMM/DGEMM blocks for z14 with explicit loop unrolling/interleaving
Improve performance of SGEMM and DGEMM on z14 and z15 by unrolling and
interleaving the inner loop of the SGEMM 16x4 and DGEMM 8x4 blocks.
Specifically, we explicitly interleave vector register loads and
computation of two iterations.

Note that this change only adds one C function, since SGEMM 16x4 and
DGEMM 8x4 actually map to the same C code: they both hold intermediate
results in a 4x4 grid of vector registers, and the C implementation is
built around that.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-08-11 12:55:42 +02:00
Marius Hillenbrand 89fe17f20e s390x: Use new sgemm kernel also for DGEMM and DTRMM on Z14
Apply our new GEMM kernel implementation, written in C with vector intrinsics,
also for DGEMM and DTRMM on Z14 and newer (i.e., architectures with FP32 SIMD
instructions). As a result, we gain around 10% in performance on z15, in
addition to improving maintainability.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-05-20 10:23:35 +02:00
Marius Hillenbrand bdd795ed03 s390x/GEMM: replace 0-init with peeled first iteration
... since it gains another ~2% of SGEMM and DGEMM performance on z15;
also, the code just called for that cleanup.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-05-20 10:23:35 +02:00
Marius Hillenbrand 2840432e49 s390x: improvise vector alignment hints for older compilers
Introduce inline assembly so that we can employ vector loads with
alignment hints on older compilers (pre gcc-9), since these are still
used in distributions such as RHEL 8 and Ubuntu 18.04 LTS.

Informing the hardware about alignment can speed up vector loads. For
that purpose, we can encode hints about 8-byte or 16-byte alignment of
the memory operand into the opcodes. gcc-9 and newer automatically emit
such hints, where applicable. Add a bit of inline assembly that achieves
the same for older compilers. Since an older binutils may not know about
the additional operand for the hints, we explicitly encode the opcode in
hex.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-05-14 15:36:03 +02:00
Marius Hillenbrand 1b0b4349a1 s390x/Z14: Change register blocking for SGEMM to 16x4
Change register blocking for SGEMM (and STRMM) on z14 from 8x4 to 16x4
by adjusting SGEMM_DEFAULT_UNROLL_M and choosing the appropriate copy
implementations. Actually make KERNEL.Z14 more flexible, so that the
change in param.h suffices. As a result, performance for SGEMM improves
by around 30% on z15.

On z14, FP SIMD instructions can operate on float-sized scalars in
vector registers, while z13 could do that for double-sized scalars only.
Thus, we can double the amount of elements of C that are held in
registers in an SGEMM kernel.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-05-12 15:59:51 +02:00
Marius Hillenbrand 71b6eaf459 s390x: Use new sgemm kernel also for strmm on Z14 and newer
Employ the newly added GEMM kernel also for STRMM on Z14. The
implementation in C with vector intrinsics exploits FP32 SIMD operations
and thereby gains performance over the existing assembly code. Extend
the implementation for handling triangular matrix multiplication,
accordingly. As added benefit, the more flexible C code enables us to
adjust register blocking in the subsequent commit.

Tested via make -C test / ctest / utest and by a couple of additional
unit tests that exercise blocking.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-05-12 15:59:51 +02:00
Marius Hillenbrand 43c0d4f312 s390x: Add vectorized sgemm kernel for Z14 and newer
Add a new GEMM kernel implementation to exploit the FP32 SIMD
operations introduced with z14 and employ it for SGEMM on z14 and newer
architectures.

The SIMD extensions introduced with z13 support operations on
double-sized scalars in vector registers. Thus, the existing SGEMM code
would extend floats to doubles before operating on them. z14 extended
SIMD support to operations on 32-bit floats. By employing these
instructions, we can operate on twice the number of scalars per
instruction (four floats in each vector registers) and avoid the
conversion operations.

The code is written in C with explicit vectorization. In experiments,
this kernel improves performance on z14 and z15 by around 2x over the
current implementation in assembly. The flexibilty of the C code paves
the way for adjustments in subsequent commits.

Tested via make -C test / ctest / utest and by a couple of additional
unit tests that exercise blocking (e.g., partial register blocks with
fewer than UNROLL_M rows and/or fewer than UNROLL_N columns).

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-05-12 15:59:51 +02:00
int_13h 96ad579428 add in runtime cpu detection for zarch (#2349)
add in runtime cpu detection for zarch
2019-12-31 18:03:27 +01:00
Andreas Arnez d117dfd505 Change bad usage of "asum" to "sum" in ZARCH versions of ?sum
The ZARCH implementations of ?sum contain a cut & paste-error: An inline
assembly argument is named "sum", but the assembly references "asum"
instead.  The mismatch causes a build error.  This is fixed.
2019-11-21 13:49:13 +01:00
Martin Kroeker 246ca29679
Add ZARCH implementation of ?sum
as trivial copies of the respective ?asum kernels with the ABS and vflpsb calls removed
2019-03-30 22:49:05 +01:00
maamountki 0a54c98b9d
[ZARCH] Modify constraints 2019-02-13 21:06:25 +02:00
maamountki bec54ae366
[ZARCH] Fix caxpy 2019-02-13 12:54:35 +02:00
maamountki f583674109
[ZARCH] Fix cgemv_t_4 2019-02-12 13:12:28 +02:00
maamountki 77fe70019f
[ZARCH] Fix constraints and source code formatting 2019-02-11 16:01:13 +02:00
maamountki 7039770165
[ZARCH] Undo the last commit 2019-02-06 20:11:44 +02:00
maamountki 11a43e8116
[ZARCH] Set alignment hint for vl/vst 2019-02-05 19:17:08 +02:00
maamountki 61526480f9
[ZARCH] Fix copy constraint 2019-02-05 07:51:19 +02:00
maamountki 81daf6bc38
[ZARCH] Format source code, Fix constraints 2019-02-05 07:30:38 +02:00
Martin Kroeker 874df65491
Fix incorrect sgemv results for IBM z14
part of PR #1993 that was inadvertently misplaced into the toplevel directory
2019-02-01 12:58:59 +01:00
Martin Kroeker 877023e1e1
Fix precision of zarch DSDOT
from patch provided by aarnez in #991
2019-01-31 21:22:26 +01:00
Martin Kroeker 265142edd5
Fix typo in the zarch min/max kernels
from patch provided by aarnez in #991
2019-01-31 21:21:40 +01:00
maamountki 29416cb5a3
[ZARCH] Add Z13 version for max/min functions 2019-01-31 19:11:11 +02:00
maamountki 48b9b94f7f
[ZARCH] Improve loading performance for camax/icamax 2019-01-31 18:52:11 +02:00
maamountki fcd814a8d2
[ZARCH] Fix bug in max/min functions 2019-01-29 17:59:38 +02:00
maamountki dc4d3bccd5
[ZARCH] Fix icamax/icamin 2019-01-29 03:47:49 +02:00
maamountki c7143c1019
[ZARCH] Fix iamax/imax single precision 2019-01-28 17:52:23 +02:00
maamountki 04873bb174
[ZARCH] Undo the last commit 2019-01-28 17:32:24 +02:00
maamountki c8ef9fb220
[ZARCH] Fix bug in iamax/iamin/imax/imin 2019-01-28 17:16:18 +02:00
maamountki b111829226
[ZARCH] Update max/min functions 2019-01-21 15:56:04 +02:00
maamountki b815a04c87
[ZARCH] fix a bug in max/min functions 2019-01-15 21:04:22 +02:00
maamountki 1a7925b3a3
[ZARCH] Update dgemv_n_4.c 2019-01-11 17:43:11 +02:00
maamountki 406f835f00
[ZARCH] update cgemv_n_4.c 2019-01-11 17:39:17 +02:00
maamountki 621dedb37b
[ZARCH] Update cgemv_t_4.c 2019-01-11 17:37:11 +02:00
maamountki b731e8246f
Update sgemv_t_4.c 2019-01-11 17:14:04 +02:00
maamountki ecc31b743f
Update dgemv_t_4.c 2019-01-11 17:13:02 +02:00
maamountki 5d89d6b143
[ZARCH] fix sgemv_n_4.c 2019-01-11 17:08:24 +02:00
maamountki 67432b23c2
[ZARCH] fix cgemv_n_4.c 2019-01-11 16:44:46 +02:00
maamountki be66f5d5c2
[ZARCH] fix data prefetch type in sdot 2019-01-09 16:50:07 +02:00
maamountki c2ffef8156
[ZARCH] fix data prefetch type in ddot 2019-01-09 16:49:44 +02:00
maamountki e7455f500c
[ZARCH] fix dsdot.c 2019-01-09 16:33:54 +02:00
maamountki 3eafcfa650
[ZARCH] fix cgemv_n_4.c 2019-01-09 07:43:45 +02:00
maamountki 94cd946b96
[ZARCH] fix cgemv_n_4.c 2019-01-04 17:45:56 +02:00