Commit Graph

1473 Commits

Author SHA1 Message Date
Marius Hillenbrand 22aa81f3e5 s390x: fix cscal and zscal implementations
The implementation of complex scalar * vector multiplication for Z14
makes some LAPACK tests fail because the numerical differences to the
reference implementation exceed the threshold (as can be seen by running
make lapack-test and replacing kernel/zarch/cscal.c with a generic
implementation for comparison).

The complex multiplication uses terms of the form a * b + c * d for both
real and imaginary parts. The assembly code (and compiler-emitted code
as well) uses fused multiply add operations for the second product and
sum. The results can be "surprising", for example when both terms in the
imaginary part nearly cancel each other out. In that case, the second
product contributes more digits to the sum than the first product that
has been rounded before.

One option is to use separate multiplications (which then round the same
way) and a distinct add. Change the code to pursue that path, by (1)
requesting the compiler not to contract the operations into FMAs and (2)
replacing the assembly kernel with corresponding vectorized C code
(where change 1 also applies).

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-21 13:10:05 +02:00
Marius Hillenbrand f91057cbad s390x: move common vector definitions and utils into header
... to facilitate reuse beyond gemm_vec.c and avoid code duplication.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-21 11:32:08 +02:00
Rajalakshmi Srinivasaraghavan be43d2cb96 Optimize daxpy/zaxpy for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores. Tested in simulator and no new failures.
2020-09-17 12:56:28 -05:00
Martin Kroeker 91c84e1c01
Merge pull request #2796 from Guobing-Chen/BF16_dot_coversion_apis
Add bfloat16 based dot and conversion with single/double
2020-09-14 15:00:19 +02:00
Martin Kroeker e72430fe46
Merge pull request #2803 from xiegengxin/AVX2-asum
Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic
2020-09-06 18:32:15 +02:00
Chen, Guobing deaeb6c5b8 Add bfloat16 based dot and conversion with single/double
1. Added bfloat16 based dot as new API: shdot
2. Implemented generic kernel and cooperlake-specific (AVX512-BF16) kernel for shdot
3. Added 4 conversion APIs for bfloat16 data type <=> single/double: shstobf16 shdtobf16 sbf16tos dbf16tod
     shstobf16 -- convert single float array to bfloat16 array
     shdtobf16 -- convert double float array to bfloat16 array
     sbf16tos  -- convert bfloat16 array to single float array
     dbf16tod  -- convert bfloat16 array to double float array
4. Implemented generic kernels for all 4 conversion APIs, and cooperlake-specific kernel for shstobf16 and shdtobf16
5. Update level1 thread facilitate functions and macros to support multi-threading for these new APIs
6. Fix Cooperlake platform detection/specify issue when under dynamic-arch building
7. Change the typedef of bfloat16 from unsigned short to more strict uint16_t

Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
2020-09-04 02:31:25 +08:00
Martin Kroeker 775a87242d
Rename KERNEL.SILICON to KERNEL.VORTEX 2020-09-03 08:44:20 +02:00
Gengxin Xie 1b0f17eeed align to 64, using SSE when input size is small 2020-09-03 14:25:54 +08:00
Martin Kroeker 80794fe8fd
Create KERNEL.SILICON 2020-09-02 22:56:58 +02:00
Marius Hillenbrand 2ee5b899ce s390x: enable S/DGEMM block with explicit loop unrolling + interleaving with clang
The code for SGEMM 16x4 and DGEMM 8x4 blocks on z14 and z15 uses
explicit unrolling and interleaving to improve performance. The code
employs an empty inline asm statement with operands that constrain the
compiler's instruction scheduling and thereby enforce proper overlapping
of load and compute phases. Fix an ifdef to apply that for clang builds,
as well.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-02 13:49:31 +02:00
Marius Hillenbrand 87e5bbd887 s390x: avoid variable-length arrays in struct for asm operands
... since it is not required and clang does not support that gcc
extension. Instead, use a variable-length array directly for these
operands.

Note that, while the actual inline assembly code does not directly use
these memory operands, they serve to inform the compiler that it cannot
reorder reads or writes to/from the input and output data across the
inline asm statements.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-02 13:49:31 +02:00
Marius Hillenbrand b9b3265ec8 s390x: avoid inline assembly for vector loads for clang
... since clang does not support the instruction format for inline
assembly and also it is not required for current versions of clang.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-02 13:49:30 +02:00
Marius Hillenbrand a1616a0b86 s390x: replace nop with "nop 0" in inline assembly
... as a bandaid for building with clang until LLVM's internal assembler
supports nops without operand.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-02 13:49:30 +02:00
Marius Hillenbrand 60ef193258 s390x: use "lghi" for immediate values to fix build with clang
Some of the kernels written in assembly utilize a "load address"
instruction for loading an immediate value into a register. That is
both unnecessarily complex and LLVM's assembler does not understand that
specific syntax. Thus, replace with the appropriate "load immediate"
instruction, which is also clearer to read.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-02 13:49:30 +02:00
Gengxin Xie 448152cdd8 define __AVX2__ to ensure the haswell code compiled with avx2 2020-08-31 14:39:08 +08:00
Gengxin Xie cb3c190a3a Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic 2020-08-31 11:44:08 +08:00
Rajalakshmi Srinivasaraghavan 317ff27cda POWER10: Avoid setting accumulators to zero in gemm kernels
For the first iteration, it is better to use xvf*ger instead of xvf*gerpp
builtins which helps to avoid setting accumulators to zero. This helps
to reduce few instructions.
2020-08-28 10:42:54 -05:00
Martin Kroeker b2053239fc
Fix mssing dummy parameter (imag part of alpha) of zdot_thread_function 2020-08-23 15:08:16 +02:00
Martin Kroeker 9ee21a0a39
Merge pull request #2780 from Guobing-Chen/CPL_build_support
Enable COOPERLAKE build target
2020-08-20 19:54:29 +02:00
Martin Kroeker 6f4dc7445d
Fix typo 2020-08-19 16:36:55 +02:00
Martin Kroeker 81fbe8d088
-march=cooperlake only available in gcc >= 10 2020-08-19 16:10:15 +02:00
Martin Kroeker 75eeb265d7
[WIP] Refactor the driver code for direct SGEMM (#2782)
Move "direct SGEMM" functionality out of the SkylakeX SGEMM kernel and make it available
(on x86_64 targets only for now) in DYNAMIC_ARCH builds
* Add  sgemm_direct targets in the kernel Makefile.L3 and CMakeLists.txt
* Add direct_sgemm functions to the gotoblas struct in common_param.h
* Move sgemm_direct_performant helper to separate file
* Update gemm.c  to macros for sgemm_direct to support dynamic_arch naming via common_s,h
* (Conditionally) add sgemm_direct functions in setparam-ref.c
2020-08-19 14:51:09 +02:00
Chen, Guobing e740c4873d Enable COOPERLAKE build target
Enable new build target platform -- COOPERLAKE. This target platform
supports all the SKYLAKEX supported ISAs + avx512bf16. So all the
SKYLAKEX specific kernels/drivers and related code are now extended
to be also active on COOPERLAKE. Besides, new BF16 related kernels
are active under this target.
2020-08-13 06:18:00 +08:00
Martin Kroeker cbbe38bb88
Merge pull request #2772 from mhillenibm/s390x_gemm_tuning
s390x: GEMM tuning for z14
2020-08-11 18:14:09 +02:00
Marius Hillenbrand 07c334e7be s390x: Factor out small block sizes for SGEMM/DGEMM on z14
For small register blockings that are too small to fill up vector
registers with column vectors, we currently use a generic code block.
Replace that with instantiations of the generic code as individual
functions, so that the compiler can optimize each one separately.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-08-11 12:56:39 +02:00
Marius Hillenbrand e2828e30aa s390x: Optimize SGEMM/DGEMM blocks for z14 with explicit loop unrolling/interleaving
Improve performance of SGEMM and DGEMM on z14 and z15 by unrolling and
interleaving the inner loop of the SGEMM 16x4 and DGEMM 8x4 blocks.
Specifically, we explicitly interleave vector register loads and
computation of two iterations.

Note that this change only adds one C function, since SGEMM 16x4 and
DGEMM 8x4 actually map to the same C code: they both hold intermediate
results in a 4x4 grid of vector registers, and the C implementation is
built around that.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-08-11 12:55:42 +02:00
Rajalakshmi Srinivasaraghavan 475b5c95b9 Remove extra symbol in Makefile
While trying out different unroll values, noted that
make failed due to this extra symbol.
2020-08-07 15:27:44 -05:00
Martin Kroeker 81dcfdcf39
Multiply by 2 instead of left-shifting a potentially negative number
fixes GCC ubsan warning in the BLAS tests
2020-08-02 18:29:56 +02:00
Martin Kroeker 0ef4b3f1f2
Multiply instead of doing a left shift of a potentially negative number
fixes GCC ubsan report in the BLAS tests
2020-08-02 18:27:40 +02:00
Martin Kroeker aa53a8a5cb
Multiply by two instead of left-shifting one place
fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests
2020-08-02 18:25:09 +02:00
Martin Kroeker aa3a1e7d8c
Multiply by two rather than left shift by one place
fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests
2020-08-02 18:22:31 +02:00
Rajalakshmi Srinivasaraghavan f77b6a83f4 dgemv optimization for POWER10
Making use of new vector pair POWER10 instructions in dgemv_n and dgemv_t.
Also adding a new block 4x128 to make use of Matrix-Multiply Assist (MMA)
feature introduced in POWER ISA v3.1.  Tested on simulator and there
are no new test failures.
2020-07-29 18:59:32 -05:00
Rajalakshmi Srinivasaraghavan d557584b71 Fix compilation issues with clang on POWER
As gcc defaults to -malign-power, removing that option. Also
adding -fno-integrated-as to use GNU assembler for powerpc
assembly optimization files. Fixed other compilation errors
reported in dgemv_t.c file.
2020-07-27 14:11:07 -05:00
Ashwin Sekhar T K 4e1be0e481 ARM64: Add THUNDERX3T110 Target 2020-07-26 23:32:24 -07:00
Rajalakshmi Srinivasaraghavan 9be2688c78 Fix to store results in correct order for POWER10 GEMM kernels
There is a recent compiler change in __builtin_mma_disassemble_acc() which
affects the order of storing result in POWER10. Also removing new LDFLAG
-mno-power10-stub as it is handled by linker automatically.
2020-07-24 23:08:11 -05:00
Martin Kroeker 6a2a60038c
Merge pull request #2720 from martin-frbg/issue2694
WIP Further fixes for 32bit POWER8
2020-07-24 23:19:45 +02:00
Martin Kroeker 251a09ec90
Typo fix 2020-07-24 16:04:58 +00:00
Martin Kroeker 95d37e1575
Regroup the 32 and 64bit sections and restore 64bit CAXPY 2020-07-24 10:13:46 +00:00
Martin Kroeker 3523bb778e
Merge pull request #2721 from martin-frbg/p8align
Fix alignment errors in the power8 saxpy kernel
2020-07-24 11:06:20 +02:00
Martin Kroeker bf1f0734ff
Use OPENBLAS_MAKE_COMPLEX_FLOAT on PPC only 2020-07-23 20:40:13 +00:00
Martin Kroeker ca3561cab9
Add ifdefs around call to altivec microkernel 2020-07-23 18:30:42 +00:00
Martin Kroeker 21072e502a
Typo fix 2020-07-23 17:34:56 +00:00
Martin Kroeker 7c6e56b5df
Rewrite assignment to complex for better portability 2020-07-23 17:10:59 +02:00
Martin Kroeker 661c6bfa5a
Exclude altivec code paths if the compiler does not support them 2020-07-23 17:08:20 +02:00
Martin Kroeker 0033f8be0d
Use vec_vsx_ld/st to fix misaligned accesses flagged by asan 2020-07-16 23:32:54 +02:00
Martin Kroeker f308e741b2
remove debug output and revert changes to cdot and crot 2020-07-15 10:00:07 +02:00
Martin Kroeker da17abec87
fix trailing whitespace 2020-07-14 18:20:03 +02:00
Martin Kroeker f8c2697701
Use POWER6 GEMM, TRMM and DTRSM on 32bit POWER8 2020-07-14 18:11:19 +02:00
Martin Kroeker b144423f0f
Do not define USE_TRMM for 32bit POWER8 2020-07-14 18:10:12 +02:00
Martin Kroeker ed7e155c35
Merge branch 'develop' into aix 2020-07-07 18:52:06 +02:00