Commit Graph

1400 Commits

Author SHA1 Message Date
张丹枫 a1fc6041cd use general register to speedup 2020-05-20 22:26:58 +08:00
张丹枫 edb423d772 align general register using to strmm_kernel_8x8 2020-05-20 22:26:58 +08:00
zhangdanfeng 0e6eb8c247 sgemm kernel use sgemm_kernel_8x8_cortexa53
Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>
2020-05-20 22:26:58 +08:00
zhangdanfeng d475db29c6 optimized for cortex-a53
Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>
2020-05-20 22:26:58 +08:00
Marius Hillenbrand 89fe17f20e s390x: Use new sgemm kernel also for DGEMM and DTRMM on Z14
Apply our new GEMM kernel implementation, written in C with vector intrinsics,
also for DGEMM and DTRMM on Z14 and newer (i.e., architectures with FP32 SIMD
instructions). As a result, we gain around 10% in performance on z15, in
addition to improving maintainability.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-05-20 10:23:35 +02:00
Marius Hillenbrand bdd795ed03 s390x/GEMM: replace 0-init with peeled first iteration
... since it gains another ~2% of SGEMM and DGEMM performance on z15;
also, the code just called for that cleanup.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-05-20 10:23:35 +02:00
Marius Hillenbrand 2840432e49 s390x: improvise vector alignment hints for older compilers
Introduce inline assembly so that we can employ vector loads with
alignment hints on older compilers (pre gcc-9), since these are still
used in distributions such as RHEL 8 and Ubuntu 18.04 LTS.

Informing the hardware about alignment can speed up vector loads. For
that purpose, we can encode hints about 8-byte or 16-byte alignment of
the memory operand into the opcodes. gcc-9 and newer automatically emit
such hints, where applicable. Add a bit of inline assembly that achieves
the same for older compilers. Since an older binutils may not know about
the additional operand for the hints, we explicitly encode the opcode in
hex.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-05-14 15:36:03 +02:00
Marius Hillenbrand 1b0b4349a1 s390x/Z14: Change register blocking for SGEMM to 16x4
Change register blocking for SGEMM (and STRMM) on z14 from 8x4 to 16x4
by adjusting SGEMM_DEFAULT_UNROLL_M and choosing the appropriate copy
implementations. Actually make KERNEL.Z14 more flexible, so that the
change in param.h suffices. As a result, performance for SGEMM improves
by around 30% on z15.

On z14, FP SIMD instructions can operate on float-sized scalars in
vector registers, while z13 could do that for double-sized scalars only.
Thus, we can double the amount of elements of C that are held in
registers in an SGEMM kernel.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-05-12 15:59:51 +02:00
Marius Hillenbrand 71b6eaf459 s390x: Use new sgemm kernel also for strmm on Z14 and newer
Employ the newly added GEMM kernel also for STRMM on Z14. The
implementation in C with vector intrinsics exploits FP32 SIMD operations
and thereby gains performance over the existing assembly code. Extend
the implementation for handling triangular matrix multiplication,
accordingly. As added benefit, the more flexible C code enables us to
adjust register blocking in the subsequent commit.

Tested via make -C test / ctest / utest and by a couple of additional
unit tests that exercise blocking.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-05-12 15:59:51 +02:00
Marius Hillenbrand 43c0d4f312 s390x: Add vectorized sgemm kernel for Z14 and newer
Add a new GEMM kernel implementation to exploit the FP32 SIMD
operations introduced with z14 and employ it for SGEMM on z14 and newer
architectures.

The SIMD extensions introduced with z13 support operations on
double-sized scalars in vector registers. Thus, the existing SGEMM code
would extend floats to doubles before operating on them. z14 extended
SIMD support to operations on 32-bit floats. By employing these
instructions, we can operate on twice the number of scalars per
instruction (four floats in each vector registers) and avoid the
conversion operations.

The code is written in C with explicit vectorization. In experiments,
this kernel improves performance on z14 and z15 by around 2x over the
current implementation in assembly. The flexibilty of the C code paves
the way for adjustments in subsequent commits.

Tested via make -C test / ctest / utest and by a couple of additional
unit tests that exercise blocking (e.g., partial register blocks with
fewer than UNROLL_M rows and/or fewer than UNROLL_N columns).

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-05-12 15:59:51 +02:00
Martin Kroeker 2271c3506b
Work around excessive LAPACK test failures on Skylake-X
Something in the plain C parts of x86_64 cscal.c and zscal.c appears to be miscompiled by both gfortran9 and ifort when compiling for skylakex-avx512, even when the optimized Haswell microkernel is not in use.
2020-05-09 23:49:18 +02:00
Rajalakshmi Srinivasaraghavan bd9ff820bc Fix cmake compilation issue - POWER9
This patch removes extra space in the sgemmotcopy filename
thereby allowing it to create entry in kernel/Makefile
created by cmake.
2020-05-08 20:31:56 -05:00
Ashwin Sekhar T K 8353cb245a ARM64: Improve DAXPY for ThunderX2
Improve performance of DAXPY for ThunderX2
when the vector fits in L1 Cache.
2020-05-07 09:22:50 -07:00
Martin Kroeker 90dba9f716
Duplicate earlier Clang 9.0.0 workaround for corresponding Apple Clang version
As discussed on the original PR #2329, the "Apple Clang 11.0.3" that appears to be based the same LLVM release produces the same miscompilation of this file.
2020-05-05 10:44:50 +02:00
Martin Kroeker 5dd14e3d48
Make building the bfloat16 functions conditional on option BUILD_HALF (#2590)
* make building the bfloat16 BLAS functions conditional on BUILD_HALF

* pass the BUILD_HALF option to gensymbol

* Pass BUILD_HALF as a compiler define for dynamic_arch builds
2020-05-01 09:58:30 +02:00
Martin Kroeker 06208c8d01
Limit this fix to ELFv2 builds 2020-04-22 14:16:40 +02:00
Martin Kroeker f5c4c28b98
Work around POWER8BE bugs on FreeBSD (ELFv2)
for #2299
2020-04-21 17:17:17 +02:00
Martin Kroeker fa42588e1f
Merge pull request #2565 from martin-frbg/mips24k
Support MIPS32 24K family as P5600
2020-04-20 17:13:53 +02:00
Martin Kroeker e55ec82bb9
Delete KERNEL.1004K 2020-04-19 15:44:30 +02:00
Martin Kroeker 7353ea5afc
Delete KERNEL.24K 2020-04-19 15:44:19 +02:00
Martin Kroeker 6a04efb122
Rename KERNEL files to include MIPS prefix 2020-04-19 15:43:54 +02:00
Martin Kroeker d712ea724c
Add MIPS24K support 2020-04-18 21:10:18 +02:00
Rajalakshmi Srinivasaraghavan 22bb50fb81 cmake fixes 2020-04-17 13:35:17 -05:00
Rajalakshmi Srinivasaraghavan 67cc4b9e16 Fix warnings in clang and export symbol 2020-04-15 19:15:23 -05:00
Rajalakshmi Srinivasaraghavan a87793e03c Fix DYNAMIC_ARCH compilation errors 2020-04-15 09:09:50 -05:00
Rajalakshmi Srinivasaraghavan ff010f496e Build shgemm for all architecture 2020-04-14 20:38:53 -05:00
Rajalakshmi Srinivasaraghavan 7eb55504b1 RFC : Add half precision gemm for bfloat16 in OpenBLAS
This patch adds support for bfloat16 data type matrix multiplication kernel.
For architectures that don't support bfloat16, it is defined as unsigned short
(2 bytes).  Default unroll sizes can be changed as per architecture as done for
SGEMM and for now 8 and 4 are used for M and N.  Size of ncopy/tcopy can be
changed as per architecture requirement and for now, size 2 is used.

Added shgemm in kernel/power/KERNEL.POWER9 and tested in powerpc64le and
powerpc64.  For reference, added a small test compare_sgemm_shgemm.c to compare
sgemm and shgemm output.

This patch does not cover OpenBLAS test, benchmark and lapack tests for shgemm.
Complex type implementation can be discussed and added once this is approved.
2020-04-14 14:55:08 -05:00
Martin Kroeker 5b0093b5fe
Convert aligned moves to unaligned
should have no performance impact on reasonably modern cpus and fixes occasional crashes in actual user code.
2020-04-13 14:58:52 +02:00
Martin Kroeker e9bfa2291a
Fix parameter overflow 2020-04-12 19:47:02 +02:00
gxw 8d07cf9b67 Fix compilation problem on loongson platform
Using "make TARGET=GENERIC" on loongson platform will get the following
error messages:
"make[1]: *** No rule to make target 'sgemm_incopy.o', needed by 'libs'"
Add kernel/mips64/KERNEL.generic to slove the problem.
2020-04-09 19:28:15 +08:00
Martin Kroeker 806f89166e
Make ARMV7 compile with xcode and add a CI job for it (#2537)
* Add an ARMV7 iOS build on Travis

* thread_local appears to be unavailable on ARMV7 iOS

* Add no-thumb option for ARMV7 IOS build to get it to accept DMB ISH

* Make local labels in macros of nrm2_vfpv3.S compatible with the xcode assembler
2020-04-02 10:30:37 +02:00
Martin Kroeker c6af9bbb32
Merge pull request #2534 from martin-frbg/issue2496
Fix zero initialization for beta=0 case
2020-03-31 20:53:13 +02:00
Martin Kroeker 144be81ca1
fix initialization to zero in the NEON SGEMM_BETA kernel as well 2020-03-31 16:53:56 +02:00
Martin Kroeker 07cdd5d05c
Fix zero initialization for beta=0 case
use immediate initialization instead of multiplication in case register content is a NaN
2020-03-31 00:21:02 +02:00
Martin Kroeker 567d2760e6
Merge pull request #2520 from wjc404/develop
Fix avx512 sgemm performance bug when ldc is a multiple of 1024
2020-03-30 20:15:59 +02:00
wjc404 b8307768e2
Add files via upload 2020-03-21 05:42:10 +08:00
Martin Kroeker af8a619e1f
Merge pull request #2517 from wjc404/develop
Temporary fix for SKX STRSM
2020-03-17 10:12:53 +01:00
wjc404 62b9608986
Update KERNEL.SKYLAKEX 2020-03-17 12:52:55 +08:00
Martin Kroeker a1b181cea2
Merge pull request #2516 from wjc404/develop
AVX2 STRSM kernels
2020-03-16 21:58:34 +01:00
wjc404 cdc0e9011e
Update KERNEL.ZEN 2020-03-16 16:39:37 +00:00
wjc404 fa049d49c2
AVX2 STRSM kernel 2020-03-17 00:34:08 +08:00
s00548429 bec7923a0d Fix the functional bugs for zamax. 2020-03-09 15:36:50 +08:00
Rajalakshmi Srinivasaraghavan 2afc074803 Fix DYNAMIC_ARCH build for POWER9
Setting DYNAMIC_ARCH=1 on POWER9 does not build POWER9 files due to some
compiler version checks.  This patch fixes some of the macros that are used
to check compiler version.  On fixing those checks, there are some new make
failures related to icamin, icamax, isamin, isamax and caxpy files on POWER9.
This patch fixes those failures as well.
2020-03-03 12:35:10 -06:00
Martin Kroeker 4f371b0fbf
Use POWER8 kernels on big-endian POWER9 for now 2020-03-01 23:45:58 +01:00
Martin Kroeker ea8eec5d17
Merge pull request #2422 from wjc404/develop
Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix performance bugs in AVX2 s/c/z GEMM
2020-02-29 19:07:35 +01:00
Ali Saidi c623a965f9 Add Neoverse-N1 core
The implementation is a hybird of the ARMV8 one with some of the
improved TX2 rountines along with specifying -march=v8.2-a
2020-02-29 03:22:04 +00:00
wjc404 dd22eb7621
Update cgemm_kernel_8x2_haswell.c 2020-02-27 22:26:15 +08:00
wjc404 2352331e60
Update zgemm_kernel_4x2_haswell.c 2020-02-27 22:25:19 +08:00
wjc404 1b980001dd
Update zgemm_kernel_4x2_haswell.c 2020-02-26 18:38:12 +08:00
wjc404 2515e1152f
Update cgemm_kernel_8x2_haswell.c 2020-02-26 18:36:54 +08:00