OpenBLAS

Commit Graph

Author	SHA1	Message	Date
张丹枫	9df79ae9a3	update sgemm and strmm kernel selecting strategy	2020-05-20 22:26:58 +08:00
张丹枫	a1fc6041cd	use general register to speedup	2020-05-20 22:26:58 +08:00
张丹枫	edb423d772	align general register using to strmm_kernel_8x8	2020-05-20 22:26:58 +08:00
zhangdanfeng	0e6eb8c247	sgemm kernel use sgemm_kernel_8x8_cortexa53 Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>	2020-05-20 22:26:58 +08:00
zhangdanfeng	d475db29c6	optimized for cortex-a53 Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>	2020-05-20 22:26:58 +08:00
Martin Kroeker	729ac6bd4a	Merge pull request #2623 from mhillenibm/zarch_dgemm_z14 s390x: Use new sgemm kernel also for DGEMM and DTRMM on Z14 (+ small cleanup)	2020-05-20 14:51:04 +02:00
Marius Hillenbrand	89fe17f20e	s390x: Use new sgemm kernel also for DGEMM and DTRMM on Z14 Apply our new GEMM kernel implementation, written in C with vector intrinsics, also for DGEMM and DTRMM on Z14 and newer (i.e., architectures with FP32 SIMD instructions). As a result, we gain around 10% in performance on z15, in addition to improving maintainability. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-20 10:23:35 +02:00
Marius Hillenbrand	bdd795ed03	s390x/GEMM: replace 0-init with peeled first iteration ... since it gains another ~2% of SGEMM and DGEMM performance on z15; also, the code just called for that cleanup. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-20 10:23:35 +02:00
Martin Kroeker	e1038ea836	Merge pull request #2622 from martin-frbg/issue2619 Improve declaration of LAPACKE_get_nancheck	2020-05-19 23:07:22 +02:00
Martin Kroeker	6baa9a778d	Improve declaration of LAPACKE_get_nancheck	2020-05-19 17:59:31 +02:00
Martin Kroeker	cf46c9f84e	Merge pull request #2617 from martin-frbg/issue2616 Add workaround for unhandled gmake jobserver flags in c_check/f_check	2020-05-18 13:23:58 +02:00
Martin Kroeker	55602fce56	Ignore spurious all-numeric library names derived from mishandled jobserver flags	2020-05-17 15:28:14 +02:00
Martin Kroeker	3d5e159e7a	Ignore spurious all-numeric library names derived from mishandled jobserver flags	2020-05-17 15:26:57 +02:00
Martin Kroeker	2931feb575	Merge pull request #58 from xianyi/develop rebase	2020-05-17 15:23:32 +02:00
Martin Kroeker	20245ded5f	Merge pull request #2615 from mhillenibm/z14_alignment_hints s390x: improvise vector alignment hints for older compilers	2020-05-14 21:06:34 +02:00
Marius Hillenbrand	2840432e49	s390x: improvise vector alignment hints for older compilers Introduce inline assembly so that we can employ vector loads with alignment hints on older compilers (pre gcc-9), since these are still used in distributions such as RHEL 8 and Ubuntu 18.04 LTS. Informing the hardware about alignment can speed up vector loads. For that purpose, we can encode hints about 8-byte or 16-byte alignment of the memory operand into the opcodes. gcc-9 and newer automatically emit such hints, where applicable. Add a bit of inline assembly that achieves the same for older compilers. Since an older binutils may not know about the additional operand for the hints, we explicitly encode the opcode in hex. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-14 15:36:03 +02:00
Martin Kroeker	ea78106c71	Merge pull request #2614 from mhillenibm/gemm_vec_z14 s390x: Improve performance of SGEMM and STRMM on z14 and newer	2020-05-13 15:09:23 +02:00
Marius Hillenbrand	cb9dc36dd5	Update CONTRIBUTORS.md Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 16:14:00 +02:00
Marius Hillenbrand	1b0b4349a1	s390x/Z14: Change register blocking for SGEMM to 16x4 Change register blocking for SGEMM (and STRMM) on z14 from 8x4 to 16x4 by adjusting SGEMM_DEFAULT_UNROLL_M and choosing the appropriate copy implementations. Actually make KERNEL.Z14 more flexible, so that the change in param.h suffices. As a result, performance for SGEMM improves by around 30% on z15. On z14, FP SIMD instructions can operate on float-sized scalars in vector registers, while z13 could do that for double-sized scalars only. Thus, we can double the amount of elements of C that are held in registers in an SGEMM kernel. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
Marius Hillenbrand	71b6eaf459	s390x: Use new sgemm kernel also for strmm on Z14 and newer Employ the newly added GEMM kernel also for STRMM on Z14. The implementation in C with vector intrinsics exploits FP32 SIMD operations and thereby gains performance over the existing assembly code. Extend the implementation for handling triangular matrix multiplication, accordingly. As added benefit, the more flexible C code enables us to adjust register blocking in the subsequent commit. Tested via make -C test / ctest / utest and by a couple of additional unit tests that exercise blocking. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
Marius Hillenbrand	43c0d4f312	s390x: Add vectorized sgemm kernel for Z14 and newer Add a new GEMM kernel implementation to exploit the FP32 SIMD operations introduced with z14 and employ it for SGEMM on z14 and newer architectures. The SIMD extensions introduced with z13 support operations on double-sized scalars in vector registers. Thus, the existing SGEMM code would extend floats to doubles before operating on them. z14 extended SIMD support to operations on 32-bit floats. By employing these instructions, we can operate on twice the number of scalars per instruction (four floats in each vector registers) and avoid the conversion operations. The code is written in C with explicit vectorization. In experiments, this kernel improves performance on z14 and z15 by around 2x over the current implementation in assembly. The flexibilty of the C code paves the way for adjustments in subsequent commits. Tested via make -C test / ctest / utest and by a couple of additional unit tests that exercise blocking (e.g., partial register blocks with fewer than UNROLL_M rows and/or fewer than UNROLL_N columns). Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
Marius Hillenbrand	d7c1677c20	Update CONTRIBUTORS.md, adding myself Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 11:09:28 +02:00
Marius Hillenbrand	0dbe61a612	s390x: choose SIMD kernels at run-time based on OS and compiler support Extend and simplify the run-time detection for dynamic architecture support for z to check HW_CAP and only use SIMD features if advertised by the OS. While at it, also honor the env variable LD_HWCAP_MASK and do not use the CPU features masked there. Note that we can only use the SIMD features on z13 or newer (i.e., Vector Facility or Vector-Enhancements Facilities) when the operating system supports properly context-switching the vector registers. The OS advertises that support as a bit in the HW_CAP value in the auxiliary vector. While all recent Linux kernels have that support, we should maintain compatibility with older versions that may still be in use. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 11:01:16 +02:00
Marius Hillenbrand	62cf391cbb	s390x: only build kernels supported by gcc with dynamic arch support When building with dynamic arch support, only build kernels for architectures that are supported by the gcc we are building with. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 11:01:16 +02:00
Marius Hillenbrand	8c338616f9	s390x: gate dynamic arch detection on gcc version and add generic When building OpenBLAS with DYNAMIC_ARCH=1 on s390x (aka zarch), make sure to include support for systems without the facilities introduced with z13 (i.e., zarch_generic). Adjust runtime detection to fallback to that generic code when running on a unknown platform other than Z13 through Z15. When detecting a Z13 or newer system, add a check for gcc support for the architecture-specific features before selecting the respective kernel. Fallback to Z13 or generic code, in case. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 11:01:16 +02:00
Martin Kroeker	f94c53ec0a	Merge pull request #2612 from RajalakshmiSR/testshgemm Improve shgemm test	2020-05-12 08:34:02 +02:00
Rajalakshmi Srinivasaraghavan	8efba9b7c0	Improve shgemm test This patch adds another check to test shgemm results.	2020-05-11 17:15:10 -05:00
Martin Kroeker	4fffa556d8	Merge pull request #2611 from RajalakshmiSR/bench_half Include shgemm in benchtest	2020-05-11 21:08:41 +02:00
Rajalakshmi Srinivasaraghavan	ce90e2bd3f	Include shgemm in benchtest This patch is to enable benchtest for half precision gemm when BUILD_HALF is set during make.	2020-05-11 09:57:46 -05:00
Martin Kroeker	948b6712ba	Merge pull request #2610 from martin-frbg/issue2552-3 Temporary workaround for excessive LAPACK test failures with COMPLEX on Skylake-X	2020-05-10 13:10:31 +02:00
Martin Kroeker	2271c3506b	Work around excessive LAPACK test failures on Skylake-X Something in the plain C parts of x86_64 cscal.c and zscal.c appears to be miscompiled by both gfortran9 and ifort when compiling for skylakex-avx512, even when the optimized Haswell microkernel is not in use.	2020-05-09 23:49:18 +02:00
Martin Kroeker	db00b21445	Merge pull request #2609 from martin-frbg/issue2552-2 Correct ifort options	2020-05-09 21:33:02 +02:00
Martin Kroeker	58d26b4448	Correct ifort options to same as suggested by reference-lapack	2020-05-09 17:15:36 +02:00
Martin Kroeker	8e47d14053	Merge pull request #2608 from martin-frbg/issue2604 Handle trailing whitespace and empty variables in KERNEL files	2020-05-09 16:36:14 +02:00
Martin Kroeker	cd10b35fe9	Handle trailing spaces and empty condition variables	2020-05-09 13:42:33 +02:00
Martin Kroeker	9472dd99cd	Merge pull request #57 from xianyi/develop rebase	2020-05-09 13:20:44 +02:00
Martin Kroeker	7181665452	Merge pull request #2605 from RajalakshmiSR/cmake-power Fix cmake compilation issue - POWER9	2020-05-09 11:29:28 +02:00
Rajalakshmi Srinivasaraghavan	bd9ff820bc	Fix cmake compilation issue - POWER9 This patch removes extra space in the sgemmotcopy filename thereby allowing it to create entry in kernel/Makefile created by cmake.	2020-05-08 20:31:56 -05:00
Martin Kroeker	63e45def70	Merge pull request #2603 from martin-frbg/issue2552 Add FFLAGS_DRV entry to the generated make.inc to fix lapack-test failure with Intel compilers	2020-05-08 22:08:39 +02:00
Martin Kroeker	ec0f228632	Add FFLAGS_DRV to the generated make.inc to fix lapack-test on x86_64 with icc/ifort fixes #2552	2020-05-08 18:06:12 +02:00
Martin Kroeker	90e2941c61	Merge pull request #56 from xianyi/develop rebase	2020-05-07 22:43:48 +02:00
Martin Kroeker	10d5f3c87b	Merge pull request #2602 from ashwinyes/thunderx2_develop DAXPY Optimizations for ThunderX2	2020-05-07 22:06:41 +02:00
Ashwin Sekhar T K	8353cb245a	ARM64: Improve DAXPY for ThunderX2 Improve performance of DAXPY for ThunderX2 when the vector fits in L1 Cache.	2020-05-07 09:22:50 -07:00
Martin Kroeker	ec2dd7b875	Merge pull request #2601 from martin-frbg/issue818 Undefine NAME/CNAME etc in Makefile.system before defining them	2020-05-07 10:12:33 +02:00
Martin Kroeker	4e82eb9f8a	Undefine ASMNAME/NAME/CNAME before defining them to avoid redefinition warning when environment variables like CFLAGS are being used (fixes #818)	2020-05-07 00:31:32 +02:00
Martin Kroeker	61300bb735	Merge pull request #55 from xianyi/develop rebase	2020-05-07 00:27:14 +02:00
Martin Kroeker	33e9b12464	Merge pull request #2597 from martin-frbg/appleclang Use Clang 9.0.0 miscompilation fix for corresponding AppleClang version as well	2020-05-05 13:55:08 +02:00
Martin Kroeker	90dba9f716	Duplicate earlier Clang 9.0.0 workaround for corresponding Apple Clang version As discussed on the original PR #2329, the "Apple Clang 11.0.3" that appears to be based the same LLVM release produces the same miscompilation of this file.	2020-05-05 10:44:50 +02:00
Martin Kroeker	424d551e01	Merge pull request #53 from xianyi/develop rebase	2020-05-01 15:18:46 +02:00
Martin Kroeker	596f5df9e8	Merge pull request #2591 from RajalakshmiSR/testhalf Add test for shgemm	2020-05-01 09:59:39 +02:00

... 57 58 59 60 61 ...

7452 Commits All Branches Search

7452 Commits

All Branches