OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Marius Hillenbrand	22aa81f3e5	s390x: fix cscal and zscal implementations The implementation of complex scalar * vector multiplication for Z14 makes some LAPACK tests fail because the numerical differences to the reference implementation exceed the threshold (as can be seen by running make lapack-test and replacing kernel/zarch/cscal.c with a generic implementation for comparison). The complex multiplication uses terms of the form a * b + c * d for both real and imaginary parts. The assembly code (and compiler-emitted code as well) uses fused multiply add operations for the second product and sum. The results can be "surprising", for example when both terms in the imaginary part nearly cancel each other out. In that case, the second product contributes more digits to the sum than the first product that has been rounded before. One option is to use separate multiplications (which then round the same way) and a distinct add. Change the code to pursue that path, by (1) requesting the compiler not to contract the operations into FMAs and (2) replacing the assembly kernel with corresponding vectorized C code (where change 1 also applies). Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-21 13:10:05 +02:00
Marius Hillenbrand	f91057cbad	s390x: move common vector definitions and utils into header ... to facilitate reuse beyond gemm_vec.c and avoid code duplication. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-21 11:32:08 +02:00
Marius Hillenbrand	2ee5b899ce	s390x: enable S/DGEMM block with explicit loop unrolling + interleaving with clang The code for SGEMM 16x4 and DGEMM 8x4 blocks on z14 and z15 uses explicit unrolling and interleaving to improve performance. The code employs an empty inline asm statement with operands that constrain the compiler's instruction scheduling and thereby enforce proper overlapping of load and compute phases. Fix an ifdef to apply that for clang builds, as well. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:31 +02:00
Marius Hillenbrand	87e5bbd887	s390x: avoid variable-length arrays in struct for asm operands ... since it is not required and clang does not support that gcc extension. Instead, use a variable-length array directly for these operands. Note that, while the actual inline assembly code does not directly use these memory operands, they serve to inform the compiler that it cannot reorder reads or writes to/from the input and output data across the inline asm statements. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:31 +02:00
Marius Hillenbrand	b9b3265ec8	s390x: avoid inline assembly for vector loads for clang ... since clang does not support the instruction format for inline assembly and also it is not required for current versions of clang. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:30 +02:00
Marius Hillenbrand	a1616a0b86	s390x: replace nop with "nop 0" in inline assembly ... as a bandaid for building with clang until LLVM's internal assembler supports nops without operand. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:30 +02:00
Marius Hillenbrand	60ef193258	s390x: use "lghi" for immediate values to fix build with clang Some of the kernels written in assembly utilize a "load address" instruction for loading an immediate value into a register. That is both unnecessarily complex and LLVM's assembler does not understand that specific syntax. Thus, replace with the appropriate "load immediate" instruction, which is also clearer to read. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:30 +02:00
Marius Hillenbrand	07c334e7be	s390x: Factor out small block sizes for SGEMM/DGEMM on z14 For small register blockings that are too small to fill up vector registers with column vectors, we currently use a generic code block. Replace that with instantiations of the generic code as individual functions, so that the compiler can optimize each one separately. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-08-11 12:56:39 +02:00
Marius Hillenbrand	e2828e30aa	s390x: Optimize SGEMM/DGEMM blocks for z14 with explicit loop unrolling/interleaving Improve performance of SGEMM and DGEMM on z14 and z15 by unrolling and interleaving the inner loop of the SGEMM 16x4 and DGEMM 8x4 blocks. Specifically, we explicitly interleave vector register loads and computation of two iterations. Note that this change only adds one C function, since SGEMM 16x4 and DGEMM 8x4 actually map to the same C code: they both hold intermediate results in a 4x4 grid of vector registers, and the C implementation is built around that. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-08-11 12:55:42 +02:00
Marius Hillenbrand	89fe17f20e	s390x: Use new sgemm kernel also for DGEMM and DTRMM on Z14 Apply our new GEMM kernel implementation, written in C with vector intrinsics, also for DGEMM and DTRMM on Z14 and newer (i.e., architectures with FP32 SIMD instructions). As a result, we gain around 10% in performance on z15, in addition to improving maintainability. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-20 10:23:35 +02:00
Marius Hillenbrand	bdd795ed03	s390x/GEMM: replace 0-init with peeled first iteration ... since it gains another ~2% of SGEMM and DGEMM performance on z15; also, the code just called for that cleanup. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-20 10:23:35 +02:00
Marius Hillenbrand	2840432e49	s390x: improvise vector alignment hints for older compilers Introduce inline assembly so that we can employ vector loads with alignment hints on older compilers (pre gcc-9), since these are still used in distributions such as RHEL 8 and Ubuntu 18.04 LTS. Informing the hardware about alignment can speed up vector loads. For that purpose, we can encode hints about 8-byte or 16-byte alignment of the memory operand into the opcodes. gcc-9 and newer automatically emit such hints, where applicable. Add a bit of inline assembly that achieves the same for older compilers. Since an older binutils may not know about the additional operand for the hints, we explicitly encode the opcode in hex. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-14 15:36:03 +02:00
Marius Hillenbrand	1b0b4349a1	s390x/Z14: Change register blocking for SGEMM to 16x4 Change register blocking for SGEMM (and STRMM) on z14 from 8x4 to 16x4 by adjusting SGEMM_DEFAULT_UNROLL_M and choosing the appropriate copy implementations. Actually make KERNEL.Z14 more flexible, so that the change in param.h suffices. As a result, performance for SGEMM improves by around 30% on z15. On z14, FP SIMD instructions can operate on float-sized scalars in vector registers, while z13 could do that for double-sized scalars only. Thus, we can double the amount of elements of C that are held in registers in an SGEMM kernel. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
Marius Hillenbrand	71b6eaf459	s390x: Use new sgemm kernel also for strmm on Z14 and newer Employ the newly added GEMM kernel also for STRMM on Z14. The implementation in C with vector intrinsics exploits FP32 SIMD operations and thereby gains performance over the existing assembly code. Extend the implementation for handling triangular matrix multiplication, accordingly. As added benefit, the more flexible C code enables us to adjust register blocking in the subsequent commit. Tested via make -C test / ctest / utest and by a couple of additional unit tests that exercise blocking. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
Marius Hillenbrand	43c0d4f312	s390x: Add vectorized sgemm kernel for Z14 and newer Add a new GEMM kernel implementation to exploit the FP32 SIMD operations introduced with z14 and employ it for SGEMM on z14 and newer architectures. The SIMD extensions introduced with z13 support operations on double-sized scalars in vector registers. Thus, the existing SGEMM code would extend floats to doubles before operating on them. z14 extended SIMD support to operations on 32-bit floats. By employing these instructions, we can operate on twice the number of scalars per instruction (four floats in each vector registers) and avoid the conversion operations. The code is written in C with explicit vectorization. In experiments, this kernel improves performance on z14 and z15 by around 2x over the current implementation in assembly. The flexibilty of the C code paves the way for adjustments in subsequent commits. Tested via make -C test / ctest / utest and by a couple of additional unit tests that exercise blocking (e.g., partial register blocks with fewer than UNROLL_M rows and/or fewer than UNROLL_N columns). Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
int_13h	96ad579428	add in runtime cpu detection for zarch (#2349 ) add in runtime cpu detection for zarch	2019-12-31 18:03:27 +01:00
Andreas Arnez	d117dfd505	Change bad usage of "asum" to "sum" in ZARCH versions of ?sum The ZARCH implementations of ?sum contain a cut & paste-error: An inline assembly argument is named "sum", but the assembly references "asum" instead. The mismatch causes a build error. This is fixed.	2019-11-21 13:49:13 +01:00
Martin Kroeker	246ca29679	Add ZARCH implementation of ?sum as trivial copies of the respective ?asum kernels with the ABS and vflpsb calls removed	2019-03-30 22:49:05 +01:00
maamountki	0a54c98b9d	[ZARCH] Modify constraints	2019-02-13 21:06:25 +02:00
maamountki	bec54ae366	[ZARCH] Fix caxpy	2019-02-13 12:54:35 +02:00
maamountki	f583674109	[ZARCH] Fix cgemv_t_4	2019-02-12 13:12:28 +02:00
maamountki	77fe70019f	[ZARCH] Fix constraints and source code formatting	2019-02-11 16:01:13 +02:00
maamountki	7039770165	[ZARCH] Undo the last commit	2019-02-06 20:11:44 +02:00
maamountki	11a43e8116	[ZARCH] Set alignment hint for vl/vst	2019-02-05 19:17:08 +02:00
maamountki	61526480f9	[ZARCH] Fix copy constraint	2019-02-05 07:51:19 +02:00
maamountki	81daf6bc38	[ZARCH] Format source code, Fix constraints	2019-02-05 07:30:38 +02:00
Martin Kroeker	874df65491	Fix incorrect sgemv results for IBM z14 part of PR #1993 that was inadvertently misplaced into the toplevel directory	2019-02-01 12:58:59 +01:00
Martin Kroeker	877023e1e1	Fix precision of zarch DSDOT from patch provided by aarnez in #991	2019-01-31 21:22:26 +01:00
Martin Kroeker	265142edd5	Fix typo in the zarch min/max kernels from patch provided by aarnez in #991	2019-01-31 21:21:40 +01:00
maamountki	29416cb5a3	[ZARCH] Add Z13 version for max/min functions	2019-01-31 19:11:11 +02:00
maamountki	48b9b94f7f	[ZARCH] Improve loading performance for camax/icamax	2019-01-31 18:52:11 +02:00
maamountki	fcd814a8d2	[ZARCH] Fix bug in max/min functions	2019-01-29 17:59:38 +02:00
maamountki	dc4d3bccd5	[ZARCH] Fix icamax/icamin	2019-01-29 03:47:49 +02:00
maamountki	c7143c1019	[ZARCH] Fix iamax/imax single precision	2019-01-28 17:52:23 +02:00
maamountki	04873bb174	[ZARCH] Undo the last commit	2019-01-28 17:32:24 +02:00
maamountki	c8ef9fb220	[ZARCH] Fix bug in iamax/iamin/imax/imin	2019-01-28 17:16:18 +02:00
maamountki	b111829226	[ZARCH] Update max/min functions	2019-01-21 15:56:04 +02:00
maamountki	b815a04c87	[ZARCH] fix a bug in max/min functions	2019-01-15 21:04:22 +02:00
maamountki	1a7925b3a3	[ZARCH] Update dgemv_n_4.c	2019-01-11 17:43:11 +02:00
maamountki	406f835f00	[ZARCH] update cgemv_n_4.c	2019-01-11 17:39:17 +02:00
maamountki	621dedb37b	[ZARCH] Update cgemv_t_4.c	2019-01-11 17:37:11 +02:00
maamountki	b731e8246f	Update sgemv_t_4.c	2019-01-11 17:14:04 +02:00
maamountki	ecc31b743f	Update dgemv_t_4.c	2019-01-11 17:13:02 +02:00
maamountki	5d89d6b143	[ZARCH] fix sgemv_n_4.c	2019-01-11 17:08:24 +02:00
maamountki	67432b23c2	[ZARCH] fix cgemv_n_4.c	2019-01-11 16:44:46 +02:00
maamountki	be66f5d5c2	[ZARCH] fix data prefetch type in sdot	2019-01-09 16:50:07 +02:00
maamountki	c2ffef8156	[ZARCH] fix data prefetch type in ddot	2019-01-09 16:49:44 +02:00
maamountki	e7455f500c	[ZARCH] fix dsdot.c	2019-01-09 16:33:54 +02:00
maamountki	3eafcfa650	[ZARCH] fix cgemv_n_4.c	2019-01-09 07:43:45 +02:00
maamountki	94cd946b96	[ZARCH] fix cgemv_n_4.c	2019-01-04 17:45:56 +02:00

1 2

76 Commits