OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Marius Hillenbrand	77ea73f5e5	s390x: for clang use fp-contract=on instead of fast Make clang slightly more cautious when contracting floating-point operations (e.g., when applying fused multiply add) by setting -ffp-contract=on (instead of fast). Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-21 11:32:08 +02:00
Marius Hillenbrand	095f4e6964	s390x: allow clang to emit fused multiply-adds (replicates gcc's default behavior) gcc's default setting for floating-point expression contraction is "fast", which allows the compiler to emit fused multiply adds instead of separate multiplies and adds (amongst others). Fused multiply-adds, which assembly kernels typically apply, also bring a significant performance advantage to the C implementation for matrix-matrix multiplication on s390x. To enable that performance advantage for builds with clang, add -ffp-contract=fast to the compiler options. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:31 +02:00
Marius Hillenbrand	43c0d4f312	s390x: Add vectorized sgemm kernel for Z14 and newer Add a new GEMM kernel implementation to exploit the FP32 SIMD operations introduced with z14 and employ it for SGEMM on z14 and newer architectures. The SIMD extensions introduced with z13 support operations on double-sized scalars in vector registers. Thus, the existing SGEMM code would extend floats to doubles before operating on them. z14 extended SIMD support to operations on 32-bit floats. By employing these instructions, we can operate on twice the number of scalars per instruction (four floats in each vector registers) and avoid the conversion operations. The code is written in C with explicit vectorization. In experiments, this kernel improves performance on z14 and z15 by around 2x over the current implementation in assembly. The flexibilty of the C code paves the way for adjustments in subsequent commits. Tested via make -C test / ctest / utest and by a couple of additional unit tests that exercise blocking (e.g., partial register blocks with fewer than UNROLL_M rows and/or fewer than UNROLL_N columns). Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
maamountki	23229011db	[ZARCH] Z14 support, BLAS 1/2 single precision implementations, Some missing double precision implementations, Gemv optimization	2018-08-06 18:20:40 +03:00
Abdurrauf	6418667818	dtrmm and dgemm for z13	2017-01-04 19:32:33 +04:00
Zhang Xianyi	dd43661cfd	Init IBM z system (s390x) porting.	2016-04-15 18:02:24 -04:00

6 Commits