OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Marius Hillenbrand	89fe17f20e	s390x: Use new sgemm kernel also for DGEMM and DTRMM on Z14 Apply our new GEMM kernel implementation, written in C with vector intrinsics, also for DGEMM and DTRMM on Z14 and newer (i.e., architectures with FP32 SIMD instructions). As a result, we gain around 10% in performance on z15, in addition to improving maintainability. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-20 10:23:35 +02:00
Marius Hillenbrand	bdd795ed03	s390x/GEMM: replace 0-init with peeled first iteration ... since it gains another ~2% of SGEMM and DGEMM performance on z15; also, the code just called for that cleanup. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-20 10:23:35 +02:00
Marius Hillenbrand	2840432e49	s390x: improvise vector alignment hints for older compilers Introduce inline assembly so that we can employ vector loads with alignment hints on older compilers (pre gcc-9), since these are still used in distributions such as RHEL 8 and Ubuntu 18.04 LTS. Informing the hardware about alignment can speed up vector loads. For that purpose, we can encode hints about 8-byte or 16-byte alignment of the memory operand into the opcodes. gcc-9 and newer automatically emit such hints, where applicable. Add a bit of inline assembly that achieves the same for older compilers. Since an older binutils may not know about the additional operand for the hints, we explicitly encode the opcode in hex. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-14 15:36:03 +02:00
Marius Hillenbrand	1b0b4349a1	s390x/Z14: Change register blocking for SGEMM to 16x4 Change register blocking for SGEMM (and STRMM) on z14 from 8x4 to 16x4 by adjusting SGEMM_DEFAULT_UNROLL_M and choosing the appropriate copy implementations. Actually make KERNEL.Z14 more flexible, so that the change in param.h suffices. As a result, performance for SGEMM improves by around 30% on z15. On z14, FP SIMD instructions can operate on float-sized scalars in vector registers, while z13 could do that for double-sized scalars only. Thus, we can double the amount of elements of C that are held in registers in an SGEMM kernel. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
Marius Hillenbrand	71b6eaf459	s390x: Use new sgemm kernel also for strmm on Z14 and newer Employ the newly added GEMM kernel also for STRMM on Z14. The implementation in C with vector intrinsics exploits FP32 SIMD operations and thereby gains performance over the existing assembly code. Extend the implementation for handling triangular matrix multiplication, accordingly. As added benefit, the more flexible C code enables us to adjust register blocking in the subsequent commit. Tested via make -C test / ctest / utest and by a couple of additional unit tests that exercise blocking. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
Marius Hillenbrand	43c0d4f312	s390x: Add vectorized sgemm kernel for Z14 and newer Add a new GEMM kernel implementation to exploit the FP32 SIMD operations introduced with z14 and employ it for SGEMM on z14 and newer architectures. The SIMD extensions introduced with z13 support operations on double-sized scalars in vector registers. Thus, the existing SGEMM code would extend floats to doubles before operating on them. z14 extended SIMD support to operations on 32-bit floats. By employing these instructions, we can operate on twice the number of scalars per instruction (four floats in each vector registers) and avoid the conversion operations. The code is written in C with explicit vectorization. In experiments, this kernel improves performance on z14 and z15 by around 2x over the current implementation in assembly. The flexibilty of the C code paves the way for adjustments in subsequent commits. Tested via make -C test / ctest / utest and by a couple of additional unit tests that exercise blocking (e.g., partial register blocks with fewer than UNROLL_M rows and/or fewer than UNROLL_N columns). Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
int_13h	96ad579428	add in runtime cpu detection for zarch (#2349 ) add in runtime cpu detection for zarch	2019-12-31 18:03:27 +01:00
Andreas Arnez	d117dfd505	Change bad usage of "asum" to "sum" in ZARCH versions of ?sum The ZARCH implementations of ?sum contain a cut & paste-error: An inline assembly argument is named "sum", but the assembly references "asum" instead. The mismatch causes a build error. This is fixed.	2019-11-21 13:49:13 +01:00
Martin Kroeker	246ca29679	Add ZARCH implementation of ?sum as trivial copies of the respective ?asum kernels with the ABS and vflpsb calls removed	2019-03-30 22:49:05 +01:00
maamountki	0a54c98b9d	[ZARCH] Modify constraints	2019-02-13 21:06:25 +02:00
maamountki	bec54ae366	[ZARCH] Fix caxpy	2019-02-13 12:54:35 +02:00
maamountki	f583674109	[ZARCH] Fix cgemv_t_4	2019-02-12 13:12:28 +02:00
maamountki	77fe70019f	[ZARCH] Fix constraints and source code formatting	2019-02-11 16:01:13 +02:00
maamountki	7039770165	[ZARCH] Undo the last commit	2019-02-06 20:11:44 +02:00
maamountki	11a43e8116	[ZARCH] Set alignment hint for vl/vst	2019-02-05 19:17:08 +02:00
maamountki	61526480f9	[ZARCH] Fix copy constraint	2019-02-05 07:51:19 +02:00
maamountki	81daf6bc38	[ZARCH] Format source code, Fix constraints	2019-02-05 07:30:38 +02:00
Martin Kroeker	874df65491	Fix incorrect sgemv results for IBM z14 part of PR #1993 that was inadvertently misplaced into the toplevel directory	2019-02-01 12:58:59 +01:00
Martin Kroeker	877023e1e1	Fix precision of zarch DSDOT from patch provided by aarnez in #991	2019-01-31 21:22:26 +01:00
Martin Kroeker	265142edd5	Fix typo in the zarch min/max kernels from patch provided by aarnez in #991	2019-01-31 21:21:40 +01:00
maamountki	29416cb5a3	[ZARCH] Add Z13 version for max/min functions	2019-01-31 19:11:11 +02:00
maamountki	48b9b94f7f	[ZARCH] Improve loading performance for camax/icamax	2019-01-31 18:52:11 +02:00
maamountki	fcd814a8d2	[ZARCH] Fix bug in max/min functions	2019-01-29 17:59:38 +02:00
maamountki	dc4d3bccd5	[ZARCH] Fix icamax/icamin	2019-01-29 03:47:49 +02:00
maamountki	c7143c1019	[ZARCH] Fix iamax/imax single precision	2019-01-28 17:52:23 +02:00
maamountki	04873bb174	[ZARCH] Undo the last commit	2019-01-28 17:32:24 +02:00
maamountki	c8ef9fb220	[ZARCH] Fix bug in iamax/iamin/imax/imin	2019-01-28 17:16:18 +02:00
maamountki	b111829226	[ZARCH] Update max/min functions	2019-01-21 15:56:04 +02:00
maamountki	b815a04c87	[ZARCH] fix a bug in max/min functions	2019-01-15 21:04:22 +02:00
maamountki	1a7925b3a3	[ZARCH] Update dgemv_n_4.c	2019-01-11 17:43:11 +02:00
maamountki	406f835f00	[ZARCH] update cgemv_n_4.c	2019-01-11 17:39:17 +02:00
maamountki	621dedb37b	[ZARCH] Update cgemv_t_4.c	2019-01-11 17:37:11 +02:00
maamountki	b731e8246f	Update sgemv_t_4.c	2019-01-11 17:14:04 +02:00
maamountki	ecc31b743f	Update dgemv_t_4.c	2019-01-11 17:13:02 +02:00
maamountki	5d89d6b143	[ZARCH] fix sgemv_n_4.c	2019-01-11 17:08:24 +02:00
maamountki	67432b23c2	[ZARCH] fix cgemv_n_4.c	2019-01-11 16:44:46 +02:00
maamountki	be66f5d5c2	[ZARCH] fix data prefetch type in sdot	2019-01-09 16:50:07 +02:00
maamountki	c2ffef8156	[ZARCH] fix data prefetch type in ddot	2019-01-09 16:49:44 +02:00
maamountki	e7455f500c	[ZARCH] fix dsdot.c	2019-01-09 16:33:54 +02:00
maamountki	3eafcfa650	[ZARCH] fix cgemv_n_4.c	2019-01-09 07:43:45 +02:00
maamountki	94cd946b96	[ZARCH] fix cgemv_n_4.c	2019-01-04 17:45:56 +02:00
maamountki	1aa840a0a2	[ZARCH] fix sgemv_t_4.c	2019-01-04 01:38:18 +02:00
maamountki	e6c0e39492	Optimize Zgemv	2018-08-13 12:23:40 +03:00
maamountki	23229011db	[ZARCH] Z14 support, BLAS 1/2 single precision implementations, Some missing double precision implementations, Gemv optimization	2018-08-06 18:20:40 +03:00
Martin Kroeker	c7b55b6082	Merge pull request #1499 from quickwritereader/develop Implemented missing vsx simd kernels for power8 blas1/2 double. z13 modifications	2018-03-27 21:43:23 +02:00
QWR QWR	28ca97015d	power8:Added initial zgemv_(t\|n) ,i(d\|z)amax,i(d\|z)amin,dgemv_t(transposed),zrot z13: improved zgemv_(t\|n)_4,zscal,zaxpy	2018-03-27 14:54:41 +00:00
Martin Kroeker	22167170b3	Merge pull request #1477 from quickwritereader/develop Power8 blas3 copy-pack routines	2018-02-28 18:46:54 +01:00
Martin Kroeker	58f236ad73	Use generic/dot.c for DSDOT on zarch	2018-02-25 19:52:14 +01:00
Martin Kroeker	e207107150	Use generic/dot.c for DSDOT on z13 The implementation in arm/dot.c has lower precision, as shown by the utest for dsdot.	2018-02-25 19:51:25 +01:00
the mslm	c5425daa6b	power8 ?gemm_tcopy save/restore	2018-02-16 23:36:46 +00:00

1 2

67 Commits