Gordon Fossum
213c0e7abb
Added special unrolled vectorized versions of "Solve" for specific sizes,
...
in DTRSM and STRSM, to improve performance in Power9 and Power10.
2020-12-04 17:07:06 -06:00
Martin Kroeker
441c08c9ff
Merge pull request #3016 from xiegengxin/complex-asum
...
Improve the performance of zasum and casum with AVX512 intrinsic
2020-12-04 22:07:16 +01:00
Gengxin Xie
0cb7a403b2
fix error declare function blas_level1_thread_with_return_value
2020-12-02 09:51:52 +08:00
Gengxin Xie
b766c1e9bb
Improve the performance of zasum and casum with AVX512 intrinsic
2020-12-01 16:49:26 +08:00
Rajalakshmi Srinivasaraghavan
7d46e31de1
POWER10: Optimize dgemv_n
...
Handling as 4x8 with vector pairs gives better performance than
existing code in POWER10.
2020-11-29 15:28:28 -06:00
Martin Kroeker
f1bf040b25
Merge pull request #2988 from xiegengxin/smp-asum
...
Improve the performance of dasum and sasum when SMP is defined
2020-11-22 12:24:13 +01:00
Xianyi Zhang
7037849498
Merge branch 'develop' into risc-v
2020-11-22 16:04:50 +08:00
Martin Kroeker
7e9cb39a25
Merge pull request #2981 from Qiyu8/fix-sum
...
Fix sum optimize issues
2020-11-16 08:40:46 +01:00
Gengxin Xie
d6e7e05bb3
Improve the performance of dasum and sasum when SMP is defined
2020-11-13 14:20:52 +08:00
Qiyu8
ae0b1dea19
modify system.cmake to enable fma flag
2020-11-13 10:20:24 +08:00
Qiyu8
e0dac6b53b
fix the CI failure of target specific option mismatch
2020-11-12 20:31:03 +08:00
Qiyu8
e5c2ceb675
fix the CI failure of lack the head
2020-11-12 17:35:17 +08:00
Qiyu8
a87e537b8c
modify macro
2020-11-11 15:53:48 +08:00
Qiyu8
5bc0a7583f
only FMA3 and vector larger than 128 have positive effects.
2020-11-11 15:18:01 +08:00
Qiyu8
8c0b206d4c
Optimize the performance of rot by using universal intrinsics
2020-11-11 14:33:12 +08:00
Qiyu8
c4c591ac5a
fix sum optimize issues
2020-11-10 16:16:38 +08:00
Xianyi Zhang
fc35b72ae1
Refs #2899
...
Merge branch 'openblas-open-910' of git://github.com/damonyu1989/OpenBLAS into damonyu1989-openblas-open-910
2020-11-10 09:38:04 +08:00
Xianyi Zhang
913cc9a4ca
Merge branch 'develop' into risc-v
2020-11-10 09:18:25 +08:00
Martin Kroeker
ff16329cb7
Merge pull request #2972 from xiegengxin/rot-intrinsic
...
Improve the performance of rot by using AVX512 and AVX2 intrinsic
2020-11-08 22:43:00 +01:00
Martin Kroeker
110c7a6de0
Merge pull request #2979 from RajalakshmiSR/dot_power10
...
Optimize sdot/ddot for POWER10
2020-11-08 10:19:34 +01:00
Rajalakshmi Srinivasaraghavan
6e364981a8
Optimize sdot/ddot for POWER10
...
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2020-11-07 15:21:58 -06:00
Martin Kroeker
b976a0bf40
Remove previous workaround for compiler flags related to cpu capabilities in x86_64 DYNAMIC_ARCH builds
2020-11-07 20:39:56 +01:00
Martin Kroeker
ff74319ea5
Merge pull request #2977 from martin-frbg/issue2976
...
Fix macro name used in ifdef for POWERPC/PGI
2020-11-07 14:41:34 +01:00
Martin Kroeker
28d2dfe2b3
Fix macro name used in ifdef
2020-11-07 12:17:49 +01:00
Gengxin Xie
725ffbf041
fix typo
2020-11-05 16:25:17 +08:00
Gengxin Xie
d9ba49165a
Improve the performance of rot by using AVX512 and AVX2 intrinsic
2020-11-05 15:12:36 +08:00
Rajalakshmi Srinivasaraghavan
dd7a9cc5bf
POWER10: Change dgemm unroll factors
...
Changing the unroll factors for dgemm to 8 shows improved performance with
POWER10 MMA feature. Also made some minor changes in sgemm for edge cases.
2020-10-31 18:28:57 -05:00
Rajalakshmi Srinivasaraghavan
b435491885
Optimize caxpy for POWER10
...
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2020-10-29 14:57:51 -05:00
Chen, Guobing
a7b1f9b1bb
Implementation of BF16 based gemv
...
1. Add a new API -- sbgemv to support bfloat16 based gemv
2. Implement a generic kernel for sbgemv
3. Implement an avx512-bf16 based kernel for sbgemv
Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
2020-10-29 02:08:23 +08:00
Martin Kroeker
67f39ad813
Merge pull request #2939 from thrasibule/Makefile_cleanup
...
reuse variables defined in Makefile.system
2020-10-28 09:38:40 +01:00
Rajalakshmi Srinivasaraghavan
c24ba8b1dd
Optimize saxpy for POWER10
...
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
2020-10-26 13:24:59 -05:00
Martin Kroeker
6f9460f0f6
Merge pull request #2937 from martin-frbg/pwr-buffersz
...
Increase and unify BUFFERSIZE on POWER;fix gcc inline warning
2020-10-23 07:15:32 +02:00
Guillaume Horel
1917a4e7b8
reuse variables defined in Makefile.system
2020-10-22 22:04:25 -04:00
Martin Kroeker
34c3c407ef
label always_inline function as inline to silence a gcc warning
2020-10-22 22:14:26 +02:00
Martin Kroeker
2e48d560ba
Fix compiler version check
2020-10-22 16:23:29 +02:00
Rajalakshmi Srinivasaraghavan
ad745c0bae
Optimize scopy/ccopy for POWER10
...
This patch makes use of new POWER10 vector pair instructions for
loads and stores. Also reorganized all variants of copy functions
to make use of same kernel.
2020-10-21 09:53:45 -05:00
İsmail Dönmez
4a1d00f589
Fix build with -Werror=return-type
...
dgemm_tcopy_16_skylakex.c CNAME function should return an int, add a
return 0 similar to other files.
2020-10-21 08:43:39 +02:00
Bart Oldeman
b073d759d0
x86_64: clobber all xmm registers after vzeroupper
...
As observed using GCC 10 using -march=native -ftree-vectorize
on Knights Landing, it is now smart enough to find clobbers inside
non-inlined static functions.
In particular, sgemv counted on a kernel to preserve the whole
%ymm2 register (since it was not in the clobber list), but the top
part was destroyed by vzeroupper. This caused many tests to fail.
This patch makes sure all xmm (and ymm/zmm by extension) registers
are listed as clobbered to avoid this happening, as most kernels
already did correctly in fact.
2020-10-20 02:16:47 +00:00
Martin Kroeker
dc6e44c3f8
Merge pull request #2916 from martin-frbg/issue2911
...
Clean up duplicate definitions in POWER8 kernels and fix power10 option passing
2020-10-19 23:33:31 +02:00
Martin Kroeker
a61c086408
Fix spurious trailing whitespace in comment
2020-10-19 09:12:12 +02:00
Bart Oldeman
03e781b766
sgemm_direct_skylakex: fix 75eeb26
regression.
...
The
`#if defined(SKYLAKEX) || defined (COOPERLAKE)`
from that commit was before #include "common.h" so caused the
compiled function to be empty, returning garbage results for
qualifying sgemm's on those architectures.
Closes #2914
2020-10-18 19:58:07 +00:00
Martin Kroeker
f1a4071d8c
Clean up STACKSIZE redefinition
2020-10-18 19:41:43 +02:00
Martin Kroeker
97cf10062f
Clean up STACKSIZE redefinition
2020-10-18 19:39:18 +02:00
Martin Kroeker
17e288e18d
Clean up STACKSIZE redefinition
2020-10-18 19:37:04 +02:00
Martin Kroeker
c1422f3e46
Clean up STACKSIZE redefinition
2020-10-18 19:31:01 +02:00
Martin Kroeker
d85b24e103
Clean up STACKSIZE redefinition
2020-10-18 19:29:45 +02:00
Zhang Xianyi
d7ba7679b6
Merge branch 'develop' into risc-v
2020-10-16 23:27:38 +08:00
Martin Kroeker
df70667043
fix core list for sse/sse2
2020-10-16 09:55:48 +02:00
Martin Kroeker
f071d1207a
add sse2
2020-10-15 22:10:32 +02:00
Martin Kroeker
dc6cefd2f5
Expressly enable -msse for 32bit DYNAMIC_ARCH kernels
2020-10-15 20:16:15 +02:00
Martin Kroeker
c339c40c01
Silence a redefinition warning
2020-10-15 19:08:12 +02:00
Martin Kroeker
10379fc83b
Use ifdef instead of if
2020-10-15 19:05:37 +02:00
Martin Kroeker
4c25910da0
Merge pull request #2896 from martin-frbg/intrin-double
...
Add compiler flag for SSE4 where available
2020-10-15 11:12:35 +02:00
damonyu
ef8e7d0279
Add the support for RISC-V Vector.
...
Change-Id: Iae7800a32f5af3903c330882cdf6f292d885f266
2020-10-15 16:09:02 +08:00
Martin Kroeker
ae6ac83991
Revert "add double precision SSE"
2020-10-15 08:37:02 +02:00
Qiyu8
4fac91ef37
adapt arm platform
2020-10-15 11:08:10 +08:00
Qiyu8
bfdf4b56da
Add double precision universal intrinsics for X86/ARM
2020-10-15 10:29:42 +08:00
Martin Kroeker
ebf0470fc2
add sse4.1 for DYNAMIC_ARCH kernels
2020-10-14 20:34:33 +02:00
Martin Kroeker
c9c3ae07af
Add double precision operations
2020-10-14 18:10:45 +02:00
Martin Kroeker
756802df61
Merge pull request #2890 from martin-frbg/s-d-sum
...
Revert special handling of Windows xNRM2 and enable C+intrinsics kern…
2020-10-14 09:02:03 +02:00
Rajalakshmi Srinivasaraghavan
0826d68f93
POWER10: Change the packing format for bfloat16
...
As the new MMA instructions need the inputs in 4x2 order for bfloat16,
changing the format in copy/packing code. This avoids permute instructions
in the gemm kernel inner loop.
2020-10-13 16:05:10 -05:00
Rajalakshmi Srinivasaraghavan
b5d30b390d
Fix build issues with bfloat16
...
This patch fixes compilation errors due to recent renaming from SH to SB
with BUILD_BFLOAT16.
2020-10-13 11:00:22 -05:00
Martin Kroeker
fecedc9c69
Add -mssse3
2020-10-13 11:55:41 +02:00
Martin Kroeker
0eacbca85f
Add Haswell and Zen to temporary sse3 whitelist
2020-10-13 11:42:39 +02:00
Martin Kroeker
6999086a2b
whitelist SANDYBRIDGE for SSE3
2020-10-13 10:32:19 +02:00
Martin Kroeker
8d2df7d066
Revert special handling of Windows xNRM2 and enable C+intrinsics kernel for SSUM/DSUM
2020-10-13 00:14:29 +02:00
Martin Kroeker
08929430cd
Merge pull request #2886 from martin-frbg/issue_2767
...
Rename "HALF" precision functions (sh prefix) to "BFLOAT16" with "sb" prefix
2020-10-13 00:04:35 +02:00
Martin Kroeker
0c84ffe05f
Merge pull request #2881 from mattip/fninit
...
add fninit to reset fpu registers before assembler routines
2020-10-12 23:50:41 +02:00
Matti Picus
403eb513a0
use emms instead, add WIN guards
2020-10-12 18:15:01 +03:00
Qiyu8
0ed1f07660
Optimize the performance of sum by using universal intrinsics
2020-10-12 19:48:53 +08:00
Martin Kroeker
3aecafad80
Change "HALF" and "sh" to "BFLOAT16" and "sb"
2020-10-12 00:00:55 +02:00
Martin Kroeker
756062afa5
Rename "HALF" and "sh" to "BFLOAT16" and "sb"
2020-10-11 23:56:17 +02:00
Martin Kroeker
2061f7fdff
Rename "HALF" and "sh" to "BFLOAT16" and "sb"
2020-10-11 23:54:53 +02:00
Martin Kroeker
dc8a1afa63
Rename "HALF" and "sh" to "BFLOAT16" and "sb"
2020-10-11 23:53:50 +02:00
Martin Kroeker
fd94236042
Rename "HALF" and "sh" to "BFLOAT16" and "sb"
2020-10-11 23:42:07 +02:00
Martin Kroeker
68ce719fac
Rename shdot_microk_cooperlake.c to sbdot_microk_cooperlake.c
2020-10-11 23:41:13 +02:00
Martin Kroeker
d7dd9b396c
Rename shdot.c to sbdot.c
2020-10-11 23:40:43 +02:00
Martin Kroeker
9ae80490e0
rename "HALF" and "sh" to "BFLOAT16" and "sb"
2020-10-11 23:39:42 +02:00
Martin Kroeker
d314d1f49f
Rename shgemm_kernel_power10.c to sbgemm_kernel_power10.c
2020-10-11 23:37:38 +02:00
Martin Kroeker
c589c3e2a1
Merge pull request #2882 from martin-frbg/issue2709
...
Use generic C for (D/Z)NRM2 on Windows x86_64
2020-10-11 22:22:30 +02:00
Martin Kroeker
ec638a82bf
Merge pull request #2852 from martin-frbg/issue2588-cmake
...
Support building only a subset of variable types
2020-10-11 22:21:33 +02:00
Martin Kroeker
6b6adf8a4a
Allow compiling only a subset of kernels for specific variable types
2020-10-11 14:52:09 +02:00
Martin Kroeker
ac653c94f3
Merge branch 'develop' into issue2588-cmake
2020-10-11 13:57:07 +02:00
Martin Kroeker
7a53128481
Add whitelist of DYNAMIC_ARCH kernels for which -msse3 needs to be enabled
2020-10-11 01:06:46 +02:00
Martin Kroeker
e1b7123bbe
Merge pull request #2867 from Qiyu8/usimd-floatdot
...
Optimize the performance of dot by using universal intrinsics in X86/ARM
2020-10-10 12:10:25 +02:00
Qiyu8
f32d34a015
add sse3 compiler flag
2020-10-10 10:36:15 +08:00
Martin Kroeker
7812486091
Use generic C for D/Z nrm2 kernels on Windows to work around fpu exception bug
2020-10-06 21:33:16 +02:00
Matti Picus
a5b164946c
add fninit to reset fpu registers before assembler routines
2020-10-05 22:13:25 +03:00
User User-User
d2333e7842
aarch64 fix std=c18 compilation
2020-10-03 18:00:34 +03:00
Qiyu8
60e6c68e38
Adapt ARM architect
2020-09-29 16:36:14 +08:00
Qiyu8
1b1a757f5f
Optimize the performance of dot by using universal intrinsics in X86/ARM
2020-09-28 20:36:53 +08:00
Rajalakshmi Srinivasaraghavan
2df4235e00
Optimize dcopy/zcopy for POWER10
...
This patch makes use of new POWER10 vector pair instructions for
loads and stores. Tested in simulator and no new failures.
2020-09-27 21:42:32 -05:00
Martin Kroeker
dfbc62ef7e
Support building only a subset of types
2020-09-22 23:25:59 +02:00
Qiyu8
14f7dad3b7
performance improved
2020-09-22 16:52:15 +08:00
Qiyu8
325b539c26
Optimize the performance of daxpy by using universal intrinsics
2020-09-22 10:38:35 +08:00
Marius Hillenbrand
22aa81f3e5
s390x: fix cscal and zscal implementations
...
The implementation of complex scalar * vector multiplication for Z14
makes some LAPACK tests fail because the numerical differences to the
reference implementation exceed the threshold (as can be seen by running
make lapack-test and replacing kernel/zarch/cscal.c with a generic
implementation for comparison).
The complex multiplication uses terms of the form a * b + c * d for both
real and imaginary parts. The assembly code (and compiler-emitted code
as well) uses fused multiply add operations for the second product and
sum. The results can be "surprising", for example when both terms in the
imaginary part nearly cancel each other out. In that case, the second
product contributes more digits to the sum than the first product that
has been rounded before.
One option is to use separate multiplications (which then round the same
way) and a distinct add. Change the code to pursue that path, by (1)
requesting the compiler not to contract the operations into FMAs and (2)
replacing the assembly kernel with corresponding vectorized C code
(where change 1 also applies).
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-21 13:10:05 +02:00
Marius Hillenbrand
f91057cbad
s390x: move common vector definitions and utils into header
...
... to facilitate reuse beyond gemm_vec.c and avoid code duplication.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-21 11:32:08 +02:00
Rajalakshmi Srinivasaraghavan
be43d2cb96
Optimize daxpy/zaxpy for POWER10
...
This patch makes use of new POWER10 vector pair instructions for
loads and stores. Tested in simulator and no new failures.
2020-09-17 12:56:28 -05:00
Martin Kroeker
91c84e1c01
Merge pull request #2796 from Guobing-Chen/BF16_dot_coversion_apis
...
Add bfloat16 based dot and conversion with single/double
2020-09-14 15:00:19 +02:00
Martin Kroeker
e72430fe46
Merge pull request #2803 from xiegengxin/AVX2-asum
...
Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic
2020-09-06 18:32:15 +02:00
Chen, Guobing
deaeb6c5b8
Add bfloat16 based dot and conversion with single/double
...
1. Added bfloat16 based dot as new API: shdot
2. Implemented generic kernel and cooperlake-specific (AVX512-BF16) kernel for shdot
3. Added 4 conversion APIs for bfloat16 data type <=> single/double: shstobf16 shdtobf16 sbf16tos dbf16tod
shstobf16 -- convert single float array to bfloat16 array
shdtobf16 -- convert double float array to bfloat16 array
sbf16tos -- convert bfloat16 array to single float array
dbf16tod -- convert bfloat16 array to double float array
4. Implemented generic kernels for all 4 conversion APIs, and cooperlake-specific kernel for shstobf16 and shdtobf16
5. Update level1 thread facilitate functions and macros to support multi-threading for these new APIs
6. Fix Cooperlake platform detection/specify issue when under dynamic-arch building
7. Change the typedef of bfloat16 from unsigned short to more strict uint16_t
Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
2020-09-04 02:31:25 +08:00
Martin Kroeker
775a87242d
Rename KERNEL.SILICON to KERNEL.VORTEX
2020-09-03 08:44:20 +02:00
Gengxin Xie
1b0f17eeed
align to 64, using SSE when input size is small
2020-09-03 14:25:54 +08:00
Martin Kroeker
80794fe8fd
Create KERNEL.SILICON
2020-09-02 22:56:58 +02:00
Marius Hillenbrand
2ee5b899ce
s390x: enable S/DGEMM block with explicit loop unrolling + interleaving with clang
...
The code for SGEMM 16x4 and DGEMM 8x4 blocks on z14 and z15 uses
explicit unrolling and interleaving to improve performance. The code
employs an empty inline asm statement with operands that constrain the
compiler's instruction scheduling and thereby enforce proper overlapping
of load and compute phases. Fix an ifdef to apply that for clang builds,
as well.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-02 13:49:31 +02:00
Marius Hillenbrand
87e5bbd887
s390x: avoid variable-length arrays in struct for asm operands
...
... since it is not required and clang does not support that gcc
extension. Instead, use a variable-length array directly for these
operands.
Note that, while the actual inline assembly code does not directly use
these memory operands, they serve to inform the compiler that it cannot
reorder reads or writes to/from the input and output data across the
inline asm statements.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-02 13:49:31 +02:00
Marius Hillenbrand
b9b3265ec8
s390x: avoid inline assembly for vector loads for clang
...
... since clang does not support the instruction format for inline
assembly and also it is not required for current versions of clang.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-02 13:49:30 +02:00
Marius Hillenbrand
a1616a0b86
s390x: replace nop with "nop 0" in inline assembly
...
... as a bandaid for building with clang until LLVM's internal assembler
supports nops without operand.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-02 13:49:30 +02:00
Marius Hillenbrand
60ef193258
s390x: use "lghi" for immediate values to fix build with clang
...
Some of the kernels written in assembly utilize a "load address"
instruction for loading an immediate value into a register. That is
both unnecessarily complex and LLVM's assembler does not understand that
specific syntax. Thus, replace with the appropriate "load immediate"
instruction, which is also clearer to read.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-02 13:49:30 +02:00
Gengxin Xie
448152cdd8
define __AVX2__ to ensure the haswell code compiled with avx2
2020-08-31 14:39:08 +08:00
Gengxin Xie
cb3c190a3a
Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic
2020-08-31 11:44:08 +08:00
Rajalakshmi Srinivasaraghavan
317ff27cda
POWER10: Avoid setting accumulators to zero in gemm kernels
...
For the first iteration, it is better to use xvf*ger instead of xvf*gerpp
builtins which helps to avoid setting accumulators to zero. This helps
to reduce few instructions.
2020-08-28 10:42:54 -05:00
Martin Kroeker
b2053239fc
Fix mssing dummy parameter (imag part of alpha) of zdot_thread_function
2020-08-23 15:08:16 +02:00
Martin Kroeker
9ee21a0a39
Merge pull request #2780 from Guobing-Chen/CPL_build_support
...
Enable COOPERLAKE build target
2020-08-20 19:54:29 +02:00
Martin Kroeker
6f4dc7445d
Fix typo
2020-08-19 16:36:55 +02:00
Martin Kroeker
81fbe8d088
-march=cooperlake only available in gcc >= 10
2020-08-19 16:10:15 +02:00
Martin Kroeker
75eeb265d7
[WIP] Refactor the driver code for direct SGEMM ( #2782 )
...
Move "direct SGEMM" functionality out of the SkylakeX SGEMM kernel and make it available
(on x86_64 targets only for now) in DYNAMIC_ARCH builds
* Add sgemm_direct targets in the kernel Makefile.L3 and CMakeLists.txt
* Add direct_sgemm functions to the gotoblas struct in common_param.h
* Move sgemm_direct_performant helper to separate file
* Update gemm.c to macros for sgemm_direct to support dynamic_arch naming via common_s,h
* (Conditionally) add sgemm_direct functions in setparam-ref.c
2020-08-19 14:51:09 +02:00
Chen, Guobing
e740c4873d
Enable COOPERLAKE build target
...
Enable new build target platform -- COOPERLAKE. This target platform
supports all the SKYLAKEX supported ISAs + avx512bf16. So all the
SKYLAKEX specific kernels/drivers and related code are now extended
to be also active on COOPERLAKE. Besides, new BF16 related kernels
are active under this target.
2020-08-13 06:18:00 +08:00
Martin Kroeker
cbbe38bb88
Merge pull request #2772 from mhillenibm/s390x_gemm_tuning
...
s390x: GEMM tuning for z14
2020-08-11 18:14:09 +02:00
Marius Hillenbrand
07c334e7be
s390x: Factor out small block sizes for SGEMM/DGEMM on z14
...
For small register blockings that are too small to fill up vector
registers with column vectors, we currently use a generic code block.
Replace that with instantiations of the generic code as individual
functions, so that the compiler can optimize each one separately.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-08-11 12:56:39 +02:00
Marius Hillenbrand
e2828e30aa
s390x: Optimize SGEMM/DGEMM blocks for z14 with explicit loop unrolling/interleaving
...
Improve performance of SGEMM and DGEMM on z14 and z15 by unrolling and
interleaving the inner loop of the SGEMM 16x4 and DGEMM 8x4 blocks.
Specifically, we explicitly interleave vector register loads and
computation of two iterations.
Note that this change only adds one C function, since SGEMM 16x4 and
DGEMM 8x4 actually map to the same C code: they both hold intermediate
results in a 4x4 grid of vector registers, and the C implementation is
built around that.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-08-11 12:55:42 +02:00
Rajalakshmi Srinivasaraghavan
475b5c95b9
Remove extra symbol in Makefile
...
While trying out different unroll values, noted that
make failed due to this extra symbol.
2020-08-07 15:27:44 -05:00
Martin Kroeker
81dcfdcf39
Multiply by 2 instead of left-shifting a potentially negative number
...
fixes GCC ubsan warning in the BLAS tests
2020-08-02 18:29:56 +02:00
Martin Kroeker
0ef4b3f1f2
Multiply instead of doing a left shift of a potentially negative number
...
fixes GCC ubsan report in the BLAS tests
2020-08-02 18:27:40 +02:00
Martin Kroeker
aa53a8a5cb
Multiply by two instead of left-shifting one place
...
fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests
2020-08-02 18:25:09 +02:00
Martin Kroeker
aa3a1e7d8c
Multiply by two rather than left shift by one place
...
fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests
2020-08-02 18:22:31 +02:00
Rajalakshmi Srinivasaraghavan
f77b6a83f4
dgemv optimization for POWER10
...
Making use of new vector pair POWER10 instructions in dgemv_n and dgemv_t.
Also adding a new block 4x128 to make use of Matrix-Multiply Assist (MMA)
feature introduced in POWER ISA v3.1. Tested on simulator and there
are no new test failures.
2020-07-29 18:59:32 -05:00
Rajalakshmi Srinivasaraghavan
d557584b71
Fix compilation issues with clang on POWER
...
As gcc defaults to -malign-power, removing that option. Also
adding -fno-integrated-as to use GNU assembler for powerpc
assembly optimization files. Fixed other compilation errors
reported in dgemv_t.c file.
2020-07-27 14:11:07 -05:00
Ashwin Sekhar T K
4e1be0e481
ARM64: Add THUNDERX3T110 Target
2020-07-26 23:32:24 -07:00
Rajalakshmi Srinivasaraghavan
9be2688c78
Fix to store results in correct order for POWER10 GEMM kernels
...
There is a recent compiler change in __builtin_mma_disassemble_acc() which
affects the order of storing result in POWER10. Also removing new LDFLAG
-mno-power10-stub as it is handled by linker automatically.
2020-07-24 23:08:11 -05:00
Martin Kroeker
6a2a60038c
Merge pull request #2720 from martin-frbg/issue2694
...
WIP Further fixes for 32bit POWER8
2020-07-24 23:19:45 +02:00
Martin Kroeker
251a09ec90
Typo fix
2020-07-24 16:04:58 +00:00
Martin Kroeker
95d37e1575
Regroup the 32 and 64bit sections and restore 64bit CAXPY
2020-07-24 10:13:46 +00:00
Martin Kroeker
3523bb778e
Merge pull request #2721 from martin-frbg/p8align
...
Fix alignment errors in the power8 saxpy kernel
2020-07-24 11:06:20 +02:00
Martin Kroeker
bf1f0734ff
Use OPENBLAS_MAKE_COMPLEX_FLOAT on PPC only
2020-07-23 20:40:13 +00:00
Martin Kroeker
ca3561cab9
Add ifdefs around call to altivec microkernel
2020-07-23 18:30:42 +00:00
Martin Kroeker
21072e502a
Typo fix
2020-07-23 17:34:56 +00:00
Martin Kroeker
7c6e56b5df
Rewrite assignment to complex for better portability
2020-07-23 17:10:59 +02:00
Martin Kroeker
661c6bfa5a
Exclude altivec code paths if the compiler does not support them
2020-07-23 17:08:20 +02:00
Martin Kroeker
0033f8be0d
Use vec_vsx_ld/st to fix misaligned accesses flagged by asan
2020-07-16 23:32:54 +02:00
Martin Kroeker
f308e741b2
remove debug output and revert changes to cdot and crot
2020-07-15 10:00:07 +02:00
Martin Kroeker
da17abec87
fix trailing whitespace
2020-07-14 18:20:03 +02:00
Martin Kroeker
f8c2697701
Use POWER6 GEMM, TRMM and DTRSM on 32bit POWER8
2020-07-14 18:11:19 +02:00
Martin Kroeker
b144423f0f
Do not define USE_TRMM for 32bit POWER8
2020-07-14 18:10:12 +02:00
Martin Kroeker
ed7e155c35
Merge branch 'develop' into aix
2020-07-07 18:52:06 +02:00
EGuesnet
634e1305f9
Update cgemm_kernel_8x4_power8.S
2020-06-30 15:16:39 +02:00
Martin Kroeker
28d69e0097
Merge pull request #2687 from martin-frbg/utfbom
...
Strip UTF8 byte order marker from source files
2020-06-26 22:53:09 +02:00
Martin Kroeker
c2467c9619
Merge pull request #2686 from RajalakshmiSR/p10_shgemm
...
powerpc: Optimized SHGEMM kernel for POWER10
2020-06-26 22:52:45 +02:00
Martin Kroeker
d199c2787d
Merge pull request #2680 from kavanabhat/aix_makefile_fix
...
Fix for #2671
2020-06-26 11:27:28 +02:00
Martin Kroeker
e30ad0e521
Strip UTF8 byte order marker from source
2020-06-26 09:00:43 +02:00
Rajalakshmi Srinivasaraghavan
d23419accc
powerpc: Optimized SHGEMM kernel for POWER10
...
This patch introduces new optimized version of SHGEMM kernel
using power10 Matrix-Multiply Assist (MMA) feature introduced in
POWER ISA v3.1. This patch makes use of new POWER10 compute instructions
for matrix multiplication operation.
Tested on simulator and there are no new test failures.
2020-06-25 22:19:08 -05:00
Martin Kroeker
c854ef5471
Fix variable names in conditional
2020-06-25 13:29:52 +02:00
Martin Kroeker
c0afc11742
Fix POWERPC builds on AIX (gcc/gfortran 7)
...
1. macro preprocessing for POWER8 and later kernels only
2. default buffer size used by AIX version of m4 is too small
2020-06-25 13:12:36 +02:00
Gordon Fossum
bb2f52844b
powerpc: Optimized ZGEMM kernel for POWER10
...
This patch introduces new optimized version of ZGEMM kernel
using power10 Matrix-Multiply Assist (MMA) feature introduced in
POWER ISA v3.1. This patch makes use of new POWER10 compute instructions
for matrix multiplication operation.
Tested on simulator and there are no new test failures.
Cycles count reduced by 30-50% compared to POWER9 version depending on
M/N/K sizes.
2020-06-24 14:50:12 -05:00
Rajalakshmi Srinivasaraghavan
571eadb880
powerpc: Optimized SGEMM/DGEMM/CGEMM for POWER10
...
This patch introduces new optimized version of SGEMM, CGEMM and DGEMM
using power10 Matrix-Multiply Assist (MMA) feature introduced in
POWER ISA v3.1. This patch makes use of new POWER10 compute instructions
for matrix multiplication operation.
Tested on simulator and there are no new test failures.
Cycles count reduced by 30-50% compared to POWER9 version depending on
M/N/K sizes.
MMA GCC patch for reference:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=8ee2640bfdc62f835ec9740278f948034bc7d9f1
2020-06-24 14:48:15 -05:00
Kavana Bhat
df4ade070f
Fix for #2671
2020-06-24 04:25:47 -05:00
Martin Kroeker
93592d1260
Merge pull request #2675 from wjc404/develop
...
AVX512 DGEMM TCOPY_16 Function
2020-06-23 09:29:02 +02:00
wjc404
086d87a302
AVX512 dgemm tcopy_16 function
2020-06-20 00:07:43 +08:00
Rajalakshmi Srinivasaraghavan
9fe930f205
powerpc: Add support for future processor
...
This is the initial patch to support build infrastructure
for POWER10 architecture.
2020-06-11 15:47:20 -05:00
ZhangDanfeng
bc6fd20a40
fix INIT8x4
...
Signed-off-by: ZhangDanfeng <467688405@qq.com>
2020-06-10 01:01:16 +08:00
Martin Kroeker
89091e6b64
Merge pull request #2645 from martin-frbg/misc_fixes
...
Miscellaneous fixes
2020-06-07 19:44:50 +02:00
Martin Kroeker
c3574ffe53
Merge pull request #2646 from wjc404/develop
...
Optimize AVX512 parallel DGEMM performance
2020-06-07 13:18:22 +02:00
wjc404
0e3ac4a06b
Add files via upload
2020-06-06 14:56:57 +08:00
Martin Kroeker
7f60fb6b91
Delete spurious copy of common_param.h
2020-06-05 10:04:16 +02:00
ZhangDanfeng
9b7877ccf1
sgemm copy source init
...
Signed-off-by: ZhangDanfeng <467688405@qq.com>
2020-06-04 02:10:45 +08:00
ZhangDanfeng
f82fa802d1
Insert prefetch
...
Signed-off-by: ZhangDanfeng <467688405@qq.com>
2020-06-04 02:08:48 +08:00
Martin Kroeker
b1ee81228a
Change complex DOT and ROT to generic kernels and switch CGEMM
...
in response to test failures seen in #2628 and BLAS-Tester
2020-06-03 09:13:29 +02:00
张丹枫
9df79ae9a3
update sgemm and strmm kernel selecting strategy
2020-05-20 22:26:58 +08:00
张丹枫
a1fc6041cd
use general register to speedup
2020-05-20 22:26:58 +08:00
张丹枫
edb423d772
align general register using to strmm_kernel_8x8
2020-05-20 22:26:58 +08:00
zhangdanfeng
0e6eb8c247
sgemm kernel use sgemm_kernel_8x8_cortexa53
...
Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>
2020-05-20 22:26:58 +08:00
zhangdanfeng
d475db29c6
optimized for cortex-a53
...
Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>
2020-05-20 22:26:58 +08:00
Marius Hillenbrand
89fe17f20e
s390x: Use new sgemm kernel also for DGEMM and DTRMM on Z14
...
Apply our new GEMM kernel implementation, written in C with vector intrinsics,
also for DGEMM and DTRMM on Z14 and newer (i.e., architectures with FP32 SIMD
instructions). As a result, we gain around 10% in performance on z15, in
addition to improving maintainability.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-05-20 10:23:35 +02:00
Marius Hillenbrand
bdd795ed03
s390x/GEMM: replace 0-init with peeled first iteration
...
... since it gains another ~2% of SGEMM and DGEMM performance on z15;
also, the code just called for that cleanup.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-05-20 10:23:35 +02:00
Marius Hillenbrand
2840432e49
s390x: improvise vector alignment hints for older compilers
...
Introduce inline assembly so that we can employ vector loads with
alignment hints on older compilers (pre gcc-9), since these are still
used in distributions such as RHEL 8 and Ubuntu 18.04 LTS.
Informing the hardware about alignment can speed up vector loads. For
that purpose, we can encode hints about 8-byte or 16-byte alignment of
the memory operand into the opcodes. gcc-9 and newer automatically emit
such hints, where applicable. Add a bit of inline assembly that achieves
the same for older compilers. Since an older binutils may not know about
the additional operand for the hints, we explicitly encode the opcode in
hex.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-05-14 15:36:03 +02:00
Marius Hillenbrand
1b0b4349a1
s390x/Z14: Change register blocking for SGEMM to 16x4
...
Change register blocking for SGEMM (and STRMM) on z14 from 8x4 to 16x4
by adjusting SGEMM_DEFAULT_UNROLL_M and choosing the appropriate copy
implementations. Actually make KERNEL.Z14 more flexible, so that the
change in param.h suffices. As a result, performance for SGEMM improves
by around 30% on z15.
On z14, FP SIMD instructions can operate on float-sized scalars in
vector registers, while z13 could do that for double-sized scalars only.
Thus, we can double the amount of elements of C that are held in
registers in an SGEMM kernel.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-05-12 15:59:51 +02:00
Marius Hillenbrand
71b6eaf459
s390x: Use new sgemm kernel also for strmm on Z14 and newer
...
Employ the newly added GEMM kernel also for STRMM on Z14. The
implementation in C with vector intrinsics exploits FP32 SIMD operations
and thereby gains performance over the existing assembly code. Extend
the implementation for handling triangular matrix multiplication,
accordingly. As added benefit, the more flexible C code enables us to
adjust register blocking in the subsequent commit.
Tested via make -C test / ctest / utest and by a couple of additional
unit tests that exercise blocking.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-05-12 15:59:51 +02:00
Marius Hillenbrand
43c0d4f312
s390x: Add vectorized sgemm kernel for Z14 and newer
...
Add a new GEMM kernel implementation to exploit the FP32 SIMD
operations introduced with z14 and employ it for SGEMM on z14 and newer
architectures.
The SIMD extensions introduced with z13 support operations on
double-sized scalars in vector registers. Thus, the existing SGEMM code
would extend floats to doubles before operating on them. z14 extended
SIMD support to operations on 32-bit floats. By employing these
instructions, we can operate on twice the number of scalars per
instruction (four floats in each vector registers) and avoid the
conversion operations.
The code is written in C with explicit vectorization. In experiments,
this kernel improves performance on z14 and z15 by around 2x over the
current implementation in assembly. The flexibilty of the C code paves
the way for adjustments in subsequent commits.
Tested via make -C test / ctest / utest and by a couple of additional
unit tests that exercise blocking (e.g., partial register blocks with
fewer than UNROLL_M rows and/or fewer than UNROLL_N columns).
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-05-12 15:59:51 +02:00
Martin Kroeker
2271c3506b
Work around excessive LAPACK test failures on Skylake-X
...
Something in the plain C parts of x86_64 cscal.c and zscal.c appears to be miscompiled by both gfortran9 and ifort when compiling for skylakex-avx512, even when the optimized Haswell microkernel is not in use.
2020-05-09 23:49:18 +02:00
Rajalakshmi Srinivasaraghavan
bd9ff820bc
Fix cmake compilation issue - POWER9
...
This patch removes extra space in the sgemmotcopy filename
thereby allowing it to create entry in kernel/Makefile
created by cmake.
2020-05-08 20:31:56 -05:00
Ashwin Sekhar T K
8353cb245a
ARM64: Improve DAXPY for ThunderX2
...
Improve performance of DAXPY for ThunderX2
when the vector fits in L1 Cache.
2020-05-07 09:22:50 -07:00
Martin Kroeker
90dba9f716
Duplicate earlier Clang 9.0.0 workaround for corresponding Apple Clang version
...
As discussed on the original PR #2329 , the "Apple Clang 11.0.3" that appears to be based the same LLVM release produces the same miscompilation of this file.
2020-05-05 10:44:50 +02:00
Martin Kroeker
5dd14e3d48
Make building the bfloat16 functions conditional on option BUILD_HALF ( #2590 )
...
* make building the bfloat16 BLAS functions conditional on BUILD_HALF
* pass the BUILD_HALF option to gensymbol
* Pass BUILD_HALF as a compiler define for dynamic_arch builds
2020-05-01 09:58:30 +02:00
Martin Kroeker
06208c8d01
Limit this fix to ELFv2 builds
2020-04-22 14:16:40 +02:00
Martin Kroeker
f5c4c28b98
Work around POWER8BE bugs on FreeBSD (ELFv2)
...
for #2299
2020-04-21 17:17:17 +02:00
Martin Kroeker
fa42588e1f
Merge pull request #2565 from martin-frbg/mips24k
...
Support MIPS32 24K family as P5600
2020-04-20 17:13:53 +02:00
Martin Kroeker
e55ec82bb9
Delete KERNEL.1004K
2020-04-19 15:44:30 +02:00
Martin Kroeker
7353ea5afc
Delete KERNEL.24K
2020-04-19 15:44:19 +02:00
Martin Kroeker
6a04efb122
Rename KERNEL files to include MIPS prefix
2020-04-19 15:43:54 +02:00
Martin Kroeker
d712ea724c
Add MIPS24K support
2020-04-18 21:10:18 +02:00
Rajalakshmi Srinivasaraghavan
22bb50fb81
cmake fixes
2020-04-17 13:35:17 -05:00
Rajalakshmi Srinivasaraghavan
67cc4b9e16
Fix warnings in clang and export symbol
2020-04-15 19:15:23 -05:00
Rajalakshmi Srinivasaraghavan
a87793e03c
Fix DYNAMIC_ARCH compilation errors
2020-04-15 09:09:50 -05:00
Rajalakshmi Srinivasaraghavan
ff010f496e
Build shgemm for all architecture
2020-04-14 20:38:53 -05:00
Rajalakshmi Srinivasaraghavan
7eb55504b1
RFC : Add half precision gemm for bfloat16 in OpenBLAS
...
This patch adds support for bfloat16 data type matrix multiplication kernel.
For architectures that don't support bfloat16, it is defined as unsigned short
(2 bytes). Default unroll sizes can be changed as per architecture as done for
SGEMM and for now 8 and 4 are used for M and N. Size of ncopy/tcopy can be
changed as per architecture requirement and for now, size 2 is used.
Added shgemm in kernel/power/KERNEL.POWER9 and tested in powerpc64le and
powerpc64. For reference, added a small test compare_sgemm_shgemm.c to compare
sgemm and shgemm output.
This patch does not cover OpenBLAS test, benchmark and lapack tests for shgemm.
Complex type implementation can be discussed and added once this is approved.
2020-04-14 14:55:08 -05:00
Martin Kroeker
5b0093b5fe
Convert aligned moves to unaligned
...
should have no performance impact on reasonably modern cpus and fixes occasional crashes in actual user code.
2020-04-13 14:58:52 +02:00
Martin Kroeker
e9bfa2291a
Fix parameter overflow
2020-04-12 19:47:02 +02:00
gxw
8d07cf9b67
Fix compilation problem on loongson platform
...
Using "make TARGET=GENERIC" on loongson platform will get the following
error messages:
"make[1]: *** No rule to make target 'sgemm_incopy.o', needed by 'libs'"
Add kernel/mips64/KERNEL.generic to slove the problem.
2020-04-09 19:28:15 +08:00
Martin Kroeker
806f89166e
Make ARMV7 compile with xcode and add a CI job for it ( #2537 )
...
* Add an ARMV7 iOS build on Travis
* thread_local appears to be unavailable on ARMV7 iOS
* Add no-thumb option for ARMV7 IOS build to get it to accept DMB ISH
* Make local labels in macros of nrm2_vfpv3.S compatible with the xcode assembler
2020-04-02 10:30:37 +02:00
Martin Kroeker
c6af9bbb32
Merge pull request #2534 from martin-frbg/issue2496
...
Fix zero initialization for beta=0 case
2020-03-31 20:53:13 +02:00
Martin Kroeker
144be81ca1
fix initialization to zero in the NEON SGEMM_BETA kernel as well
2020-03-31 16:53:56 +02:00
Martin Kroeker
07cdd5d05c
Fix zero initialization for beta=0 case
...
use immediate initialization instead of multiplication in case register content is a NaN
2020-03-31 00:21:02 +02:00
Martin Kroeker
567d2760e6
Merge pull request #2520 from wjc404/develop
...
Fix avx512 sgemm performance bug when ldc is a multiple of 1024
2020-03-30 20:15:59 +02:00
wjc404
b8307768e2
Add files via upload
2020-03-21 05:42:10 +08:00
Martin Kroeker
af8a619e1f
Merge pull request #2517 from wjc404/develop
...
Temporary fix for SKX STRSM
2020-03-17 10:12:53 +01:00
wjc404
62b9608986
Update KERNEL.SKYLAKEX
2020-03-17 12:52:55 +08:00
Martin Kroeker
a1b181cea2
Merge pull request #2516 from wjc404/develop
...
AVX2 STRSM kernels
2020-03-16 21:58:34 +01:00
wjc404
cdc0e9011e
Update KERNEL.ZEN
2020-03-16 16:39:37 +00:00
wjc404
fa049d49c2
AVX2 STRSM kernel
2020-03-17 00:34:08 +08:00
s00548429
bec7923a0d
Fix the functional bugs for zamax.
2020-03-09 15:36:50 +08:00
Rajalakshmi Srinivasaraghavan
2afc074803
Fix DYNAMIC_ARCH build for POWER9
...
Setting DYNAMIC_ARCH=1 on POWER9 does not build POWER9 files due to some
compiler version checks. This patch fixes some of the macros that are used
to check compiler version. On fixing those checks, there are some new make
failures related to icamin, icamax, isamin, isamax and caxpy files on POWER9.
This patch fixes those failures as well.
2020-03-03 12:35:10 -06:00
Martin Kroeker
4f371b0fbf
Use POWER8 kernels on big-endian POWER9 for now
2020-03-01 23:45:58 +01:00
Martin Kroeker
ea8eec5d17
Merge pull request #2422 from wjc404/develop
...
Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix performance bugs in AVX2 s/c/z GEMM
2020-02-29 19:07:35 +01:00
Ali Saidi
c623a965f9
Add Neoverse-N1 core
...
The implementation is a hybird of the ARMV8 one with some of the
improved TX2 rountines along with specifying -march=v8.2-a
2020-02-29 03:22:04 +00:00
wjc404
dd22eb7621
Update cgemm_kernel_8x2_haswell.c
2020-02-27 22:26:15 +08:00
wjc404
2352331e60
Update zgemm_kernel_4x2_haswell.c
2020-02-27 22:25:19 +08:00
Xianyi Zhang
265ab484c8
Change default RISC-V 64-bit corename to RISCV64_GENERIC
...
e.g. make CC=riscv64-unknown-linux-gnu-gcc FC=riscv64-unknown-linux-gnu-gfortran TARGET=RISCV64_GENERIC HOSTCC=gcc
2020-02-27 14:46:15 +08:00
Xianyi Zhang
44020a42a4
Fixed compile bug for RV64.
2020-02-27 14:29:42 +08:00
Xianyi Zhang
4aa2d89217
Merge branch 'develop' into risc-v
2020-02-27 13:53:49 +08:00
wjc404
1b980001dd
Update zgemm_kernel_4x2_haswell.c
2020-02-26 18:38:12 +08:00
wjc404
2515e1152f
Update cgemm_kernel_8x2_haswell.c
2020-02-26 18:36:54 +08:00
Martin Kroeker
ddcbed6690
Merge pull request #2437 from martin-frbg/issue2434
...
[WIP] Add support for Ampere EMAG8180 ARMV8 cpu
2020-02-25 18:42:52 +01:00
wjc404
903854c168
Add files via upload
2020-02-22 23:40:02 +08:00
wjc404
a2ff577a30
Update KERNEL.ZEN
2020-02-22 23:39:43 +08:00
wjc404
97a32cb0a5
Update KERNEL.HASWELL
2020-02-22 23:39:20 +08:00
Martin Kroeker
07454bf4d5
Add proper defaults for IxMIN/IxMAX kernels
...
the fallbacks from Makefile.L1 assume a combined source for absolute value and non-absolute (with ifdef USE_ABS) but here we have separate implementations
2020-02-21 11:58:15 +01:00
Martin Kroeker
4046985913
Add proper defaults for IxMIN/IxMAX kernels
...
the fallbacks from Makefile.L1 assume a combined source for absolute value and non-absolute (with ifdef USE_ABS) but here we have separate implementations
2020-02-21 11:55:52 +01:00
Martin Kroeker
e57b11acca
Add preliminary support for EMAG8180
2020-02-19 19:00:28 +01:00
Martin Kroeker
0b39cf95b0
Fix endianness conditionals
2020-02-19 18:09:54 +01:00
Martin Kroeker
9f39f0a2c3
Specify ismin/ismax assembly kernels for POWER8 directly
...
to fix utest failure in new ismin test - Makefile.L1 defaults look wrong
2020-02-17 19:55:39 +01:00
Martin Liska
aeea14ee40
Come up with LOAD_AND_COMPARE_TO_MXX macro in iamax_sse.S.
2020-02-17 09:01:53 +01:00
Martin Liska
18bcc36a69
Fix implementation of iamax_sse.S as reported in #2116 .
...
The was a typo in iamax_sse.S where one of the comparison
was cmpeqps instead of cmpeqss. That misdetected index
for sequences where the minimum value was 0.
2020-02-17 09:01:53 +01:00
Martin Liska
0e7f43c898
Add missing USE_MIN in kernel/CMakeLists.txt.
2020-02-17 09:01:53 +01:00
wjc404
f566787e6e
Update KERNEL.SKYLAKEX
2020-02-16 22:58:44 +08:00
wjc404
e3368cbf18
AVX512 STRMM kernel
2020-02-16 22:58:00 +08:00
Martin Kroeker
cafdd999b8
Update caxpy_power8.S
2020-02-13 22:44:09 +01:00
Martin Kroeker
92ca92a46c
Update caxpy_power8.S
2020-02-13 21:24:54 +01:00
Martin Kroeker
486c35c5dc
Update icamin_power8.S
2020-02-13 18:38:43 +01:00
Martin Kroeker
5ba3699f41
Update isamin_power8.S
2020-02-13 00:00:32 +01:00
Martin Kroeker
8eefa530cd
Update isamax_power8.S
2020-02-12 23:59:50 +01:00
Martin Kroeker
de40d47edf
Update isamin_power8.S
2020-02-12 23:57:48 +01:00
Martin Kroeker
7c162b8a21
Update isamax_power8.S
2020-02-12 23:56:57 +01:00
Martin Kroeker
0544cbc806
Fix syntax of endianness conditional
2020-02-12 20:00:29 +01:00
Martin Kroeker
120d20731f
Fix syntax of endianness conditional
2020-02-12 19:58:42 +01:00
Martin Kroeker
dc345d84df
Fix syntax of endianness conditional and add gcc version check for workaround
2020-02-12 19:56:52 +01:00
Bart Oldeman
7ea5e07d1c
Fix inline asm in dscal: mark x, x1 as clobbered. Fixes #2408
...
The leaq instructions in dscal_kernel_inc_8 modify x and x1 so they
must be declared as input/output constraints, otherwise the compiler
may assume the corresponding registers are not modified.
2020-02-12 14:11:44 +00:00
Martin Kroeker
7e5cbb6f35
Fix bad conditional syntax that caused spurious application of USE_TRMM
2020-02-10 21:17:39 +01:00
wjc404
3447d04eaf
Update dgemm_kernel_16x2_skylakex.c
2020-02-06 02:14:10 +00:00
wjc404
8b5cdcc64c
Update sgemm_kernel_8x4_haswell.c
2020-02-06 01:47:46 +00:00
wjc404
4e00d96a78
Update dgemm_kernel_16x2_skylakex.c
2020-02-06 01:46:36 +00:00
wjc404
096da2f51a
Update dgemm_kernel_16x2_skylakex.c
2020-02-05 13:36:57 +08:00
wjc404
081b188529
Update KERNEL.SKYLAKEX
2020-02-03 21:38:08 +08:00
wjc404
8019e70211
AVX512 16x2 DGEMM kernel
2020-02-03 21:32:56 +08:00
Qiyu8
ff42e68652
Optimize genenal Gemm Beta
2020-01-20 11:49:42 +08:00
Martin Kroeker
70f45749b9
Merge pull request #2367 from wjc404/develop
...
Improve paralleled SGEMM performance on SKYLAKEX CPUs
2020-01-15 21:13:43 +01:00
wjc404
e5dcdeb550
Update sgemm_direct_skylakex.c
2020-01-13 16:59:23 +08:00
wjc404
952cc2ba38
Update sgemm_kernel_16x4_skylakex_2.c
2020-01-13 16:58:54 +08:00
wjc404
feaafbedd3
make skylakex sgemm code more friendly for readers
...
BTW some kernels were adjusted to improve performance
2020-01-13 16:28:41 +08:00
Martin Kroeker
b36018be6d
Merge pull request #2365 from wjc404/develop
...
Fix SKYLAKEX STRMM issues
2020-01-09 23:23:09 +01:00
wjc404
3a100b2797
Update KERNEL.SKYLAKEX
2020-01-09 13:48:41 +08:00
Martin Kroeker
38742d5547
Merge pull request #2361 from wjc404/develop
...
Optimize AVX2 SGEMM & STRMM
2020-01-08 16:20:28 +01:00
wjc404
bd4c032f52
Update sgemm_kernel_8x4_haswell.c
2020-01-07 11:22:46 +08:00
wjc404
9dc9b7b95e
Update sgemm_kernel_8x4_haswell.c
2020-01-06 20:11:36 +08:00
wjc404
92b10212de
optimize AVX2 SGEMM
2020-01-06 12:11:21 +08:00
wjc404
b73bf01378
optimize AVX2 SGEMM
2020-01-06 12:09:14 +08:00
wjc404
eb3c9f1db9
optimize AVX2 SGEMM
2020-01-06 12:07:02 +08:00
Martin Kroeker
456ee2e1f0
Merge pull request #2357 from chenxuqiang/dgemm_beta_zero
...
kernel/arm64/dgemm_beta.S: add beta == zero branch
2020-01-02 22:28:36 +01:00
shengyang
80db5f11e1
update
2020-01-02 11:01:57 +08:00
chenxuqiang
52de4cc8fd
kernel/arm64/dgemm_beta.S: add beta == zero branch
...
added beta == zero branch, and no need to load C matrix.
Signed by: Xuqiang Chen <chenxuqiang3@hisilicon.com>
2020-01-01 21:50:45 -05:00
Martin Kroeker
44028581cc
Merge pull request #2355 from Zeyiii/dev-zeyi2
...
Use arm neon instructions to optimize sgemm_beta operation
2020-01-01 22:14:16 +01:00
Martin Kroeker
86ab939936
Merge pull request #2354 from ZuoQ3/develop
...
[WIP] Use arm neon instructions to optimize tcopy operation
2020-01-01 22:13:37 +01:00
Martin Kroeker
6c85cb1869
Merge pull request #2352 from wjc404/develop
...
AVX2 ZGEMM3M kernel
2019-12-31 18:08:10 +01:00
Martin Kroeker
995768bbc5
Merge pull request #2351 from Zeyiii/develop
...
prefetching for dgemm_beta
2019-12-31 18:07:37 +01:00
int_13h
96ad579428
add in runtime cpu detection for zarch ( #2349 )
...
add in runtime cpu detection for zarch
2019-12-31 18:03:27 +01:00
shengyang
8d84403205
Use arm neon instructions to optimize ncopy operation
...
modified: KERNEL.ARMV8
modified: KERNEL.TSV110
new file: sgemm_ncopy_4.S
2019-12-31 17:06:35 +08:00
w00421467
0833a4846a
Use arm neon instructions to optimize sgemm_beta operation
2019-12-31 10:42:03 +08:00
zq
50f7fc1401
[WIP] Use arm neon instructions to optimize tcopy operation
2019-12-31 10:21:23 +08:00
w00421467
d1b53806be
Merge remote-tracking branch 'pub/develop' into develop
2019-12-31 10:13:24 +08:00
wjc404
a0f0a802fc
Update zgemm3m_kernel_4x4_haswell.c
2019-12-30 17:33:42 +08:00
wjc404
700fe5b5ee
Add files via upload
2019-12-30 17:18:59 +08:00
wjc404
f60840c420
Update KERNEL.ZEN
2019-12-30 16:04:23 +08:00
wjc404
109e18cd96
Update KERNEL.HASWELL
2019-12-30 16:03:24 +08:00
wjc404
ae1579be13
Create zgemm3m_kernel_4x4_haswell.c
2019-12-30 16:02:51 +08:00
w00421467
3ccf8885ac
prefetching for dgemm_beta
2019-12-30 11:45:49 +08:00
wjc404
cd765f094b
Update cgemm3m_kernel_8x4_haswell.c
2019-12-27 18:23:29 +08:00
wjc404
3a66c8cac1
Update KERNEL.ZEN
2019-12-27 18:04:08 +08:00
wjc404
ed9af2f7da
Update KERNEL.HASWELL
2019-12-27 18:01:38 +08:00
wjc404
5fd1edead9
Create cgemm3m_kernel_8x4_haswell.c
2019-12-27 18:00:55 +08:00
wjc404
eeecd623d8
Update cgemm_kernel_8x2_haswell.c
2019-12-24 00:40:16 +08:00
wjc404
2cd9306bb5
Update KERNEL.ZEN
2019-12-23 23:42:30 +08:00
wjc404
c418c81224
Update KERNEL.HASWELL
2019-12-23 23:41:44 +08:00
wjc404
025741f16a
Fast Haswell CGEMM kernel
2019-12-23 23:40:03 +08:00
wjc404
f41d52665d
Fast Haswell ZGEMM kernel
2019-12-21 14:37:06 +08:00
wjc404
d573d24de7
Fast Haswell ZGEMM kernel
2019-12-21 14:35:15 +08:00
w00421467
b7cc69ee62
declare DGEMM_BETA in KERNEL.ARMV8 rather than the generic KERNEL
2019-12-20 10:11:50 +08:00
w00421467
aeef942c4f
use arm neon instructions to optimize gemm beta operation
2019-12-17 10:00:13 +08:00
Martin Kroeker
1a6ea8ee6d
Merge pull request #2338 from kavanabhat/aix_mod
...
Changes to build on AIX in POWER8 mode
2019-12-09 17:54:49 +01:00
Kavana Bhat
6baa9b07d7
AIX changes for Power8
2019-12-06 04:33:32 -06:00
Kavana Bhat
3938e59569
AIX changes for Power8
2019-12-04 00:23:46 -06:00
Isuru Fernando
b863b32ac5
Workaround an ICE in clang 9.0.0
...
This bug is not there in 8.x nor in the 9.0 daily snapshot.
2019-12-01 12:59:46 -06:00
Martin Kroeker
dd04143d4a
Merge pull request #2328 from martin-frbg/ppc9
...
Fix precompiled kernels on POWER9 and make their use conditional on (old) gcc version
2019-11-30 12:23:57 +01:00
Martin Kroeker
f3a6164bff
Merge pull request #2324 from antonblanchard/power9_segv
...
Fix SEGV in cdot_power9
2019-11-30 00:03:42 +01:00
Martin Kroeker
dedd822d1a
Fix caxpy/caxpyc naming in localentry
2019-11-29 23:56:57 +01:00
Martin Kroeker
2181fb7047
Fix caxpy/caxpyc naming in localentry
2019-11-29 23:54:15 +01:00
Martin Kroeker
a9b62c03f8
Substitute precompiled gcc7 codes only when gcc is older than 9.x
2019-11-29 23:49:50 +01:00
Martin Kroeker
97762234f9
Add variable for gcc >=9 test
...
used in KERNEL.POWER9
2019-11-29 23:47:23 +01:00
wjc404
934e601e93
Update dgemm_kernel_4x8_skylakex_2.c
2019-11-28 19:56:35 +08:00
Anton Blanchard
cf2a8e410c
Fix SEGV in cdot_power9
...
We were corrupting r2 because the local entry wasn't being
setup correctly.
2019-11-26 21:55:04 -07:00
wjc404
eb1e9c8c92
some optimizations
2019-11-26 14:12:20 +08:00
Andreas Arnez
d117dfd505
Change bad usage of "asum" to "sum" in ZARCH versions of ?sum
...
The ZARCH implementations of ?sum contain a cut & paste-error: An inline
assembly argument is named "sum", but the assembly references "asum"
instead. The mismatch causes a build error. This is fixed.
2019-11-21 13:49:13 +01:00
Martin Kroeker
b09b5be0a4
Merge pull request #2315 from ewanglong/develop
...
revised fix windows compatible for #2313
2019-11-21 05:06:44 +01:00
Wang, Long
bfb5fbdb4d
revised fix windows compatible for #2313
...
Signed-off-by: Wang, Long <long1.wang@intel.com>
2019-11-21 10:22:58 +08:00
Martin Kroeker
08fa83aba2
Merge pull request #2312 from martin-frbg/power8be
...
Further Power8 big-endian corrections
2019-11-20 15:12:06 +01:00
Wang, Long
1191db1a49
For the sake of windows compatible, used "unsigned long long" to ensure 64-bit length
...
Signed-off-by: Wang, Long <long1.wang@intel.com>
2019-11-20 21:30:47 +08:00
Wang, Long
0caf1434c9
Fix the integer overflow issue for large matrix size
...
For large matrix, e.g. M=N=K, and M>1290, int mnk=M*N*K will overflow.
This will lead to wrong branching to single-threading. The performance
is downgraded significantly.
Signed-off-by: Wang, Long <long1.wang@intel.com>
2019-11-20 14:11:17 +08:00
Martin Kroeker
cad0d150db
Define alternate kernels for big-endian POWER8
2019-11-17 23:12:10 +01:00
Martin Kroeker
eba0aeb7cd
Fix compilation for big-endian POWER8
2019-11-17 22:58:32 +01:00
Martin Kroeker
0c07c356c1
Define alternate kernels for big-endian PPC440
2019-11-17 19:25:08 +01:00
Martin Kroeker
3e67017ac8
Merge pull request #2309 from martin-frbg/ppc970-be
...
Fix PPC970 big-endian support
2019-11-17 18:22:24 +01:00
Martin Kroeker
b3ac6ee222
Define alternate kernels for big-endian PPC970
...
The altivec versions of SGEMM and CGEMM fail most test in LAPACK-TESTING when compiled for big endian, STRSM/CTRSM even cause segfaults. The rot kernels either fail the corresponding utest or lead to failures in LAPACK-TESTING.
2019-11-17 15:19:39 +01:00
Martin Kroeker
71e96163db
Merge pull request #2305 from wjc404/develop
...
AVX512 CGEMM & ZGEMM kernels
2019-11-12 07:38:37 +01:00
wjc404
819e852ae7
AVX512 CGEMM & ZGEMM kernels
...
96-99% 1-thread performance of MKL2018
2019-11-11 20:04:52 +08:00
Martin Kroeker
4c6a457358
Merge pull request #2300 from wjc404/develop
...
Optimize SGEMM on SKYLAKEX CPUs
2019-11-06 07:27:33 +01:00
wjc404
836c414e22
optimizations of software prefetching
2019-11-05 13:36:56 +08:00
Martin Kroeker
3cd97f1a80
Merge pull request #2301 from martin-frbg/ppc8be
...
Disable IDAMIN/MAX and IZAMIN/MAX optimizations on big-endian POWER8
2019-11-04 22:54:28 +01:00
wjc404
430c11e135
Add files via upload
2019-11-04 20:10:12 +08:00
wjc404
fbacd2605d
optimizations via software prefetches
2019-11-04 19:37:19 +08:00
Martin Kroeker
68597002ea
The assembly microkernel is not safe to use on ELFv1
2019-11-03 22:42:46 +01:00
Martin Kroeker
d2a6285549
The assembly microkernel is not safe to use on ELFv1
2019-11-03 22:41:19 +01:00
Martin Kroeker
d999688d1a
The assembly microkernel is not safe to use on ELFv1
2019-11-03 22:39:06 +01:00
Martin Kroeker
928fe1b28e
The assembly microkernel is not safe to use on ELFv1
2019-11-03 22:37:27 +01:00
wjc404
1df9a2013d
new sgemm kernel for skylakex
2019-11-02 00:00:48 +08:00
Martin Kroeker
85ccdce8c4
Remove the IOS fallbacks to generic C kernels
2019-10-25 23:02:37 +02:00
wjc404
6ff013bae0
native support for icopy_4
...
90% MKL 1-thread performance.
2019-10-19 03:54:44 +08:00
wjc404
0d669e04bb
Update dgemm_kernel_8x8_skylakex.c
2019-10-18 15:00:17 +08:00
wjc404
17cdd9f9e1
some correction
2019-10-18 14:58:07 +08:00
wjc404
6bcb06fcb1
make further changes to icopy_8 easier
2019-10-18 10:47:31 +08:00
wjc404
b7315f8401
Add files via upload
2019-10-16 19:23:36 +08:00
wjc404
9b19e9e1b0
Update dgemm_kernel_8x8_skylakex.c
2019-10-16 10:14:51 +08:00
wjc404
6bd67ddbab
Update dgemm_kernel_8x8_skylakex.c
2019-10-16 03:20:08 +08:00
wjc404
844629af57
Add files via upload
2019-10-16 02:00:34 +08:00
Martin Kroeker
a448884a63
Remove automatic label postfixes from macro included only once
2019-10-08 08:37:50 +02:00
Martin Kroeker
3a2df19db6
Fix accidental duplication of jump instruction
2019-10-08 08:09:26 +02:00
Martin Kroeker
d2093a40d3
Merge pull request #2277 from martin-frbg/issue2275
...
Rewrite ARMV8 code to allow cross-compilation for IOS
2019-10-06 23:01:54 +02:00
Martin Kroeker
56837e9d92
Make local labels in macro compatible with the xcode assembler
...
... which does not perform the automatic numbering on instantiation that the _@ suffix signifies
2019-10-04 14:53:23 +02:00
Martin Kroeker
5e244d80f2
Merge pull request #2271 from quickwritereader/strmm_fix
...
fixed bug power9 strmm . BLAS-TESTER passes
2019-09-29 13:53:45 +02:00
AbdelRauf
ede5efebab
trmm fix
2019-09-29 02:28:34 +00:00
Martin Kroeker
596a22325a
Fix prologue of power9 assembly cdot(c) kernel to provide cdotc
2019-09-27 00:47:18 +02:00
Martin Kroeker
7f58f3ad0e
Fix mis-edits in the gcc-derived power8 caxpy kernel
2019-09-27 00:44:26 +02:00
Martin Kroeker
673e5a0495
Replace several POWER8/9 C kernels with their gcc7-generated assembly versions ( #2263 )
...
* Add gcc7-generated assembly files for POWER8/9 isa/ica-min/max and POWER9 caxpy
To work around internal compiler errors encountered when compiling the original C source with gcc 4 and 5, and wrong code generated by gcc 8.3.0
* Use gcc-generated assembly instead of original C sources
to work around internal compiler errors encountered with gcc 4.8/5.4 and wrong code generation by gcc 8.3
* Use gcc-generated assembly instead of the original C source
to work around internal compiler errors encountered with gcc 4.8 and 5.4, and wrong code generation by gcc 8.3
* Add gcc7-generated assembler version of caxpy for power8
to work around wrong code generated by gcc 8.3
* Handle CONJ define for caxpyc
* Handle CONJ define for caxpyc
* Add gcc7-generated assembly cdot for POWER9
* Use prebuilt assembly for POWER9 cdot
created with gcc 7.3.1 to work around ICE in older gcc versions
* Exclude POWER9 from DYNAMIC_ARCH when gcc versions is lower than 6
* Update Makefile.system
* Use PROLOGUE macro to ensure correct function name for DYNAMIC_ARCH
* Disable POWER9 with old gcc versions
2019-09-22 22:35:22 +02:00
Martin Kroeker
e7c4d6705a
Revert #2051 and replace with a better fix ( #2261 )
...
* Revert #2051 and add a better fix for TARGET=generic with DYNAMIC_ARCH
fixes #2257 without breaking #2048 again
2019-09-17 18:56:04 +02:00
Martin Kroeker
f3c314550c
Merge pull request #2243 from quickwritereader/develop
...
possible cgemv,caxpy,cdot fix
2019-08-30 23:06:23 +02:00
AbdelRauf
847c20c9b7
fix uninitialized variables i
2019-08-30 11:14:55 +00:00
AbdelRauf
4c22828812
caxpy and cdot are using vec_vsx_ld
2019-08-30 04:09:15 +00:00
AbdelRauf
e79712d969
cgemv using vec_vsx_ld instead of letting gcc to decide
2019-08-30 02:52:04 +00:00
AbdelRauf
be09551cdf
aligned
2019-08-29 23:22:23 +00:00
Martin Kroeker
11c59acfb1
Keep both PGI/SUN and default code paths to avoid breaking Clang/WIndows
2019-08-28 18:07:44 +02:00
Martin Kroeker
3a55dca2dc
Make x86_64 zdot compile with PGI and Sun C again
...
broken by #2222 as CREAL,CIMAG do not expand to a valid lvalue with these compilers
2019-08-28 11:35:31 +02:00
Kavana Bhat
3dc6b26eff
AIX changes for Power8
2019-08-20 06:51:35 -05:00
Martin Kroeker
9ef96b32a6
Add multithreading support to the x86_64 zdot kernel ( #2222 )
...
* Add multithreading support
copied from the ThunderX2T99 kernel. For #2221
2019-08-15 22:09:12 +02:00
Martin Kroeker
103b32fdb7
Merge pull request #2216 from martin-frbg/issue2214
...
Remove case-sensitivity in x86 LSAME on (AMD) cpus without CMOV
2019-08-13 13:59:33 +02:00
Martin Kroeker
aef9804089
Fix unwanted case-sensitivity in x86 LSAME for (AMD) processors without CMOV
...
Problem was already noticed some years ago in #238 , but back then the problem was only corrected in one of the #ifdef branches.
Fixes #2214
2019-08-13 10:19:10 +02:00
Martin Kroeker
dccff2e785
Merge pull request #2206 from martin-frbg/zen-dtrmm
...
Replace vpermpd with vpermilpd in the Haswell DTRMM kernel
2019-08-09 07:55:20 +02:00
Martin Kroeker
5c3458a6e7
Merge pull request #2199 from martin-frbg/zen-dtrsm
...
Replace most vpermpd calls in the Haswell DTRSM_RN kernel
2019-08-09 07:55:02 +02:00
Martin Kroeker
acf6002ab2
Replace most vpermpd calls in the Haswell DTRSM_RN kernel
2019-08-03 12:40:13 +02:00
Martin Kroeker
2dfb804cb9
Replace vpermpd with vpermilpd in the Haswell DTRMM kernel
...
to improve performance on AMD Zen (#2180 ) applying wjc404's improvement of the DGEMM kernel from #2186
2019-07-28 23:17:28 +02:00
Martin Kroeker
4c153ec9da
Merge pull request #2196 from wjc404/develop
...
Add vbroadcastsd kernel to dgemm_kernel_4x8_haswell.S
2019-07-28 23:11:40 +02:00
wjc404
7eecd8e39c
Add files via upload
2019-07-28 07:39:09 +08:00
Martin Kroeker
7b0b7c11d2
Merge pull request #2190 from martin-frbg/zdot-zen
...
Replace vpermpd with vpermilpd in the Haswell/Zen zdot microkernel
2019-07-23 16:15:08 +02:00
Martin Kroeker
28e96458e5
Replace vpermpd with vpermilpd
...
to improve performance on Zen/Zen2 (as demonstrated by wjc404 in #2180 )
2019-07-22 08:28:16 +02:00
wjc404
95fb98f556
Update dgemm_kernel_4x8_haswell.S
2019-07-21 01:10:32 +08:00
wjc404
4801c6d36b
Update dgemm_kernel_4x8_haswell.S
2019-07-21 00:47:45 +08:00
wjc404
9440fa607d
Add files via upload
2019-07-20 22:08:22 +08:00
wjc404
94db259e5b
Add files via upload
2019-07-20 22:04:41 +08:00
wjc404
f49f8047ac
Add files via upload
2019-07-20 14:33:37 +08:00
wjc404
825777faab
Update dgemm_kernel_4x8_haswell.S
2019-07-19 23:58:24 +08:00
wjc404
9c89757562
Add files via upload
2019-07-19 23:47:58 +08:00
wjc404
9b04baeaee
Update dgemm_kernel_4x8_haswell.S
2019-07-17 23:50:03 +08:00
wjc404
8a074b3965
Update dgemm_kernel_4x8_haswell.S
2019-07-17 23:47:30 +08:00
wjc404
211ab03b14
Update dgemm_kernel_4x8_haswell.S
2019-07-17 22:39:15 +08:00
wjc404
1733f927e6
Update dgemm_kernel_4x8_haswell.S
2019-07-17 21:27:41 +08:00
wjc404
182b06d6ad
Update dgemm_kernel_4x8_haswell.S
2019-07-17 17:02:35 +08:00
wjc404
7a9050d681
Update dgemm_kernel_4x8_haswell.S
2019-07-17 00:55:06 +08:00
wjc404
0ba29fd262
Update dgemm_kernel_4x8_haswell.S for zen2
...
replaced a bunch of vpermpd instructions with vpermilpd and vperm2f128
2019-07-17 00:46:51 +08:00
Martin Kroeker
6b6c9b1441
Merge pull request #2172 from quickwritereader/develop
...
power9 cgemm/ctrmm. new sgemm 8x16
2019-07-01 21:06:02 +02:00
AbdelRauf
a97b301aaa
cgemm/ctrmm power9
2019-07-01 14:07:54 +00:00
Piotr Kubaj
eebfeba768
Fix build on FreeBSD/powerpc64.
...
Signed-off-by: Piotr Kubaj <pkubaj@anongoth.pl>
2019-06-25 10:58:56 +02:00
kavanabhat
a575f1e4c7
Update dtrmm_kernel_16x4_power8.S
2019-06-19 15:27:14 +05:30
AbdelRauf
cdbfb891da
new sgemm 8x16
2019-06-17 15:33:38 +00:00
Martin Kroeker
a17cf36225
Merge pull request #2153 from quickwritereader/develop
...
improved power9 zgemm,sgemm
2019-06-06 07:42:56 +02:00
AbdelRauf
148c4cc5fd
conflict resolve
2019-06-05 20:50:50 +00:00
AbdelRauf
d0c3543c3f
power9 zgemm ztrmm optimized
2019-06-05 20:07:16 +00:00
AbdelRauf
a469b32cf4
sgemm pipeline improved, zgemm rewritten without inner packs, ABI lxvx v20 fixed with vs52
2019-06-04 07:11:30 +00:00
AbdelRauf
8fe794f059
improved zgemm power9 based on power8
2019-05-30 15:31:25 +00:00
Martin Kroeker
74c10b57c6
Use generic kernels for complex (I)AMAX to support softfp
2019-05-30 11:38:11 +02:00
Martin Kroeker
c5495d2056
Ensure correct output for DAMAX with softfp
2019-05-30 11:25:43 +02:00
Martin Kroeker
c70496b108
Separate implementations of AMAX and IAMAX on arm
...
As noted in #1912 and comment on #1942 , the combined implementation happens to "do the right thing" on hardfp, but cannot return both value and index on softfp where they would have to share the return register
2019-05-29 15:02:51 +02:00
Martin Kroeker
9ea30f3788
Replace ISMIN and ISAMIN kernels on all x86_64 platforms ( #2125 )
...
* Mark iamax_sse.S as unsuitable for MIN due to issue #2116
* Use iamax.S rather than iamax_sse.S for ISMIN/ISAMIN on all x86_64 as workaround for #2116
2019-05-09 14:42:36 +02:00
Martin Kroeker
6a8b4269b5
Merge pull request #2111 from martin-frbg/issue1955
...
Disable the SkyLakeX DGEMMIxCOPY kernels as well
2019-05-05 18:08:49 +02:00
Martin Kroeker
b1561ecc68
Disable DGEMMINCOPY as well for now
...
#1955
2019-05-05 15:52:01 +02:00
Martin Kroeker
7ed8431527
Disable the SkyLakeX DGEMMITCOPY kernel as well
...
as a stopgap measure for https://github.com/numpy/numpy/issues/13401 as mentioned in #1955
2019-05-04 22:54:41 +02:00
Martin Kroeker
3f427c0cf9
Merge pull request #2107 from quickwritereader/develop
...
sgemm/strmm kernel for power9
2019-05-02 07:56:57 +02:00
AbdelRauf
47f892198c
conflict resolve
2019-05-01 19:36:22 +00:00
AbdelRauf
628b335e83
Merge branch 'develop' of https://github.com/quickwritereader/OpenBLAS into develop
2019-04-29 08:57:44 +00:00
AbdelRauf
0f105dd8a5
sgemm/strmm
2019-04-29 08:49:50 +00:00
Martin Kroeker
ccfb7ead15
Merge pull request #2072 from martin-frbg/sum
...
Add (C)BLAS extension ?sum
2019-04-23 20:11:36 +02:00
Rashmica Gupta
bcdf1d4917
Add in runtime CPU detection for POWER.
2019-04-09 14:20:16 +10:00
Martin Kroeker
c04a729081
Add ?sum definitions for generic kernel
2019-03-31 13:55:49 +02:00
Martin Kroeker
100d94f94e
Add ?sum
2019-03-31 13:55:05 +02:00
Martin Kroeker
246ca29679
Add ZARCH implementation of ?sum
...
as trivial copies of the respective ?asum kernels with the ABS and vflpsb calls removed
2019-03-30 22:49:05 +01:00
Martin Kroeker
9d717cb5ee
Add x86_64 implementation of ?sum
...
as trivial copy of ?asum with the fabs calls removed
2019-03-30 22:27:04 +01:00
Martin Kroeker
e3bc83f2a8
Add x86 implementation of ?sum
...
as trivial copy of ?asum with the fabs calls removed
2019-03-30 22:26:10 +01:00
Martin Kroeker
70f2a4e0d7
Add SPARC implementation of ?sum
...
as trivial copy of ?asum with the fabs replaced by fmov to preserve code structure
2019-03-30 22:25:06 +01:00
Martin Kroeker
706dfe263b
Add POWER implementation of ?sum
...
as trivial copy of ?asum with the fabs replaced by fmr to preserve code structure
2019-03-30 22:23:42 +01:00
Martin Kroeker
688fa9201c
Add MIPS64 implementation of ?sum
...
as trivial copy of ?asum with the fabs replaced by mov to preserve code structure
2019-03-30 22:22:15 +01:00
Martin Kroeker
cdbe0f0235
Add MIPS implementation of ?sum
...
as trivial copy of ?asum with the fabs calls removed
2019-03-30 22:20:14 +01:00
Martin Kroeker
f8b82bc6dc
Add ia64 implementation of ?sum
...
as trivial copy of asum with the fabs calls removed
2019-03-30 22:18:03 +01:00
Martin Kroeker
3e3ccb9011
Add ARM64 implementations of ?sum
...
as trivial copies of the respective ?asum kernels with the fabs calls removed
2019-03-30 22:13:36 +01:00
Martin Kroeker
94ab4e6fb2
Add ARM implementations of ?sum
...
(trivial copies of the respective ?asum with the fabs calls removed)
2019-03-30 22:11:38 +01:00
Martin Kroeker
c3cfc6986b
Add implementations of ssum/dsum and csum/zsum
...
as trivial copies of asum/zsasum with the fabs calls replaced by fmov to preserve code structure
2019-03-30 22:05:11 +01:00
Martin Kroeker
b9f4943a14
Add ?sum
2019-03-30 22:01:13 +01:00
Martin Kroeker
32c7063cb0
Merge pull request #2061 from martin-frbg/martin-frbg-patch-1
...
Disable the AVX512 DGEMM kernel (again)
2019-03-30 21:21:38 +01:00
Martin Kroeker
7c51cc8527
Merge branch 'develop' into develop
2019-03-29 19:36:29 +01:00
AbdelRauf
853a18bc17
power9 makefile. dgemm based on power8 kernel with following changes : 32x unrolled 16x4 kernel and 8x4 kernel using (lxv stxv butterfly rank1 update). improvement from 17 to 22-23gflops. dtrmm cases were added into dgemm itself
2019-03-29 15:49:40 +00:00
Martin Kroeker
e608d4f7fe
Disable the AVX512 DGEMM kernel (again)
...
Due to as yet unresolved errors seen in #1955 and #2029
2019-03-13 22:10:28 +01:00
Martin Kroeker
03d7110900
Merge pull request #2042 from maomao194313/develop
...
add TARGET support for HiSilicon tsv110 CPUs
2019-03-12 22:57:39 +01:00
Martin Kroeker
f18ab6c17b
Merge pull request #2051 from martin-frbg/issue2048
...
Make TARGET=GENERIC compatible with DYNAMIC_ARCH=1
2019-03-09 16:39:35 +01:00
Martin Kroeker
5b95534afc
Make TARGET=GENERIC compatible with DYNAMIC_ARCH=1
...
for issue #2048
2019-03-09 11:21:16 +01:00
Celelibi
b7f59da42d
Fix crash in sgemm SSE/nano kernel on x86_64
...
Fix bug #2047 .
Signed-off-by: Celelibi <celelibi@gmail.com>
2019-03-07 16:55:13 +01:00
maomao194313
783ba8058f
HiSilicon tsv110 CPUs optimization branch
...
add HiSilicon tsv110 CPUs optimization branch
2019-03-04 16:30:50 +08:00
Andrew
6eee1beac5
move fix to right place
2019-02-24 20:41:02 +02:00
Martin Kroeker
e12cdf58ef
Merge pull request #2024 from martin-frbg/gcc9fixes4
...
Fix inline assembly constraints in Bulldozer TRSM kernels
2019-02-17 11:49:15 +01:00
Martin Kroeker
1860c9456d
Merge pull request #2023 from martin-frbg/gcc9fixes3
...
Fix inline assembly constraints in various x86_64 GEMVN kernels
2019-02-17 11:48:57 +01:00
Martin Kroeker
f9bb76d29a
Fix inline assembly constraints in Bulldozer TRSM kernels
...
rework indices to allow marking i,as and bs as both input and output (marked operand n1 as well for simplicity). For #2009
2019-02-16 20:06:48 +01:00
Martin Kroeker
efb9038f72
Fix inline assembly constraints
2019-02-16 18:46:17 +01:00
Martin Kroeker
e976557d29
Fix inline assembly constraints
...
rework indices to allow marking argument lda as input and output.
2019-02-16 18:36:39 +01:00
Martin Kroeker
9d8be15789
Fix inline assembly constraints
...
rework indices to allow marking argument lda4 as input and output. For #2009
2019-02-16 18:24:11 +01:00
Martin Kroeker
d752799a0f
Merge pull request #2021 from martin-frbg/gcc9fixes2
...
Fix wrong constraints in inline assembly of Haswell DTRSM kernel
2019-02-16 18:05:40 +01:00
Martin Kroeker
c26c0b77a7
Fix wrong constraints in inline assembly
...
for #2009
2019-02-15 15:08:16 +01:00
Martin Kroeker
1c6da2d03c
Merge pull request #2019 from martin-frbg/gcc9fixes
...
Fix unannounced modification of input operand 8 (lda4) in Haswell GEMVN microkernel
2019-02-15 15:02:54 +01:00
Martin Kroeker
4255a58cd2
Rename operands to put lda on the input/output constraint list
2019-02-15 10:10:04 +01:00
Martin Kroeker
46e415b140
Save and restore input argument 8 (lda4)
...
Fixes miscompilation with gcc9 -ftree-vectorize (related to issue #2009 )
2019-02-14 22:43:18 +01:00
Bart Oldeman
69a97ca7b9
dgemv_kernel_4x4(Haswell): add missing clobbers for xmm0,xmm1,xmm2,xmm3
...
This fixes a crash in dblat2 when OpenBLAS is compiled using
-march=znver1 -ftree-vectorize -O2
See also:
https://github.com/easybuilders/easybuild-easyconfigs/issues/7180
2019-02-14 16:27:58 +00:00
Martin Kroeker
056917d616
Merge pull request #2013 from martin-frbg/issue2011
...
Fix invalid memory access in PPC gemm_beta
2019-02-14 09:29:34 +01:00
Martin Kroeker
718efcec6f
Fix out-of-bounds memory access in gemm_beta
...
Fixes #2011 (as suggested by davemq), assuming typo by K.Goto
2019-02-13 22:08:37 +01:00
Martin Kroeker
f9d67bb5e8
Fix out-of-bounds memory access in gemm_beta
...
Fixes #2011 (as suggested by davemq) presuming typo by K.Goto
2019-02-13 22:06:41 +01:00
Martin Kroeker
76bb74fcd4
Merge pull request #2012 from maamountki/z14
...
[ZARCH] Many improvements
2019-02-13 20:15:56 +01:00
maamountki
0a54c98b9d
[ZARCH] Modify constraints
2019-02-13 21:06:25 +02:00
maamountki
bec54ae366
[ZARCH] Fix caxpy
2019-02-13 12:54:35 +02:00
Martin Kroeker
ab1630f9fa
Fix declaration of arguments in inline assembly
...
Argument 0 is modified so should be input and output
2019-02-12 16:14:02 +01:00
Martin Kroeker
b824fa70eb
Fix declaration of assembly arguments in SSYMV and DSYMV microkernels
...
Arguments 0 and 1 are both input and output
2019-02-12 16:00:18 +01:00
Martin Kroeker
91481a3e4e
Fix declaration of input arguments in inline assembly
...
Argument 0 is modified as it doubles as a counter
2019-02-12 15:51:43 +01:00
Martin Kroeker
dc6ac9eab0
Fix declaration of input arguments in the x86_64 s/dGEMV_T and s/dGEMV_N kernels
...
Arguments 0 and 1 need to be tagged as both input and output
2019-02-12 15:33:48 +01:00
maamountki
f583674109
[ZARCH] Fix cgemv_t_4
2019-02-12 13:12:28 +02:00
maamountki
77fe70019f
[ZARCH] Fix constraints and source code formatting
2019-02-11 16:01:13 +02:00
maamountki
7039770165
[ZARCH] Undo the last commit
2019-02-06 20:11:44 +02:00
maamountki
11a43e8116
[ZARCH] Set alignment hint for vl/vst
2019-02-05 19:17:08 +02:00
maamountki
61526480f9
[ZARCH] Fix copy constraint
2019-02-05 07:51:19 +02:00
maamountki
81daf6bc38
[ZARCH] Format source code, Fix constraints
2019-02-05 07:30:38 +02:00
Martin Kroeker
729e925174
Merge pull request #1996 from quickwritereader/develop
...
NBMAX=4096 for gemvn, added sgemvn 8x8 for future
2019-02-04 16:52:04 +01:00
Ubuntu
498ac98581
Note for unused kernels
2019-02-04 15:41:56 +00:00
Ubuntu
cd9ea45463
NBMAX=4096 for gemvn, added sgemvn 8x8 for future
2019-02-04 06:57:11 +00:00
Martin Kroeker
f9c5023e04
Merge pull request #1994 from quickwritereader/develop
...
sgemv cgemv pairs
2019-02-01 21:04:47 +01:00
Ubuntu
4abc375a91
sgemv cgemv pairs
2019-02-01 13:45:00 +00:00
Martin Kroeker
874df65491
Fix incorrect sgemv results for IBM z14
...
part of PR #1993 that was inadvertently misplaced into the toplevel directory
2019-02-01 12:58:59 +01:00
Martin Kroeker
877023e1e1
Fix precision of zarch DSDOT
...
from patch provided by aarnez in #991
2019-01-31 21:22:26 +01:00
Martin Kroeker
265142edd5
Fix typo in the zarch min/max kernels
...
from patch provided by aarnez in #991
2019-01-31 21:21:40 +01:00
Martin Kroeker
885a3c4350
USE_TRMM on Z14
...
from patch provided by aarnez in #991
2019-01-31 21:18:09 +01:00
maamountki
82124729af
Merge branch 'develop' into z14
2019-01-31 19:36:41 +02:00
maamountki
29416cb5a3
[ZARCH] Add Z13 version for max/min functions
2019-01-31 19:11:11 +02:00
maamountki
48b9b94f7f
[ZARCH] Improve loading performance for camax/icamax
2019-01-31 18:52:11 +02:00
Martin Kroeker
86a824c97f
Fix wrong comparison that made IMIN identical to IMAX
...
as reported by aarnez in #1990
2019-01-31 15:27:21 +01:00
Martin Kroeker
808410c2c7
Fix wrong comparison that made IMIN identical to IMAX
...
as suggested in #1990
2019-01-31 15:25:15 +01:00
maamountki
fcd814a8d2
[ZARCH] Fix bug in max/min functions
2019-01-29 17:59:38 +02:00
maamountki
dc4d3bccd5
[ZARCH] Fix icamax/icamin
2019-01-29 03:47:49 +02:00
maamountki
c7143c1019
[ZARCH] Fix iamax/imax single precision
2019-01-28 17:52:23 +02:00
maamountki
04873bb174
[ZARCH] Undo the last commit
2019-01-28 17:32:24 +02:00
maamountki
c8ef9fb220
[ZARCH] Fix bug in iamax/iamin/imax/imin
2019-01-28 17:16:18 +02:00
maamountki
b111829226
[ZARCH] Update max/min functions
2019-01-21 15:56:04 +02:00
Martin Kroeker
32b0f1168e
Fix declaration of input arguments in the Sandybridge GER microkernels ( #1967 )
...
* Tag arguments 0 and 1 as both input and output
2019-01-18 08:11:39 +01:00
Martin Kroeker
b495e54310
Fix declaration of input arguments in the x86_64 SCAL microkernels ( #1966 )
...
* Tag arguments 0 and 1 as both input and output (see #1964 )
2019-01-18 08:11:07 +01:00
Martin Kroeker
d5e6940253
Fix declaration of input arguments in the x86_64 microkernels for DOT and AXPY ( #1965 )
...
* Tag operands 0 and 1 as both input and output
For #1964 (basically a continuation of coding problems first seen in #1292 )
2019-01-17 23:20:32 +01:00
Ubuntu
43a4572038
crot fix
2019-01-17 14:45:31 +00:00
Abdelrauf
a034e65512
Merge branch 'develop' into develop
2019-01-16 19:25:13 +04:00
Ubuntu
8c3386be87
Added missing Blas1 single fp {saxpy, caxpy, cdot, crot(refactored version of srot),isamax ,isamin, icamax, icamin},
...
Fixed idamin,icamin choosing the first occurance index of equal minimals
2019-01-16 15:16:21 +00:00
maamountki
b815a04c87
[ZARCH] fix a bug in max/min functions
2019-01-15 21:04:22 +02:00
maamountki
1a7925b3a3
[ZARCH] Update dgemv_n_4.c
2019-01-11 17:43:11 +02:00
maamountki
406f835f00
[ZARCH] update cgemv_n_4.c
2019-01-11 17:39:17 +02:00
maamountki
621dedb37b
[ZARCH] Update cgemv_t_4.c
2019-01-11 17:37:11 +02:00
maamountki
b731e8246f
Update sgemv_t_4.c
2019-01-11 17:14:04 +02:00
maamountki
ecc31b743f
Update dgemv_t_4.c
2019-01-11 17:13:02 +02:00
maamountki
5d89d6b143
[ZARCH] fix sgemv_n_4.c
2019-01-11 17:08:24 +02:00
maamountki
67432b23c2
[ZARCH] fix cgemv_n_4.c
2019-01-11 16:44:46 +02:00
maamountki
be66f5d5c2
[ZARCH] fix data prefetch type in sdot
2019-01-09 16:50:07 +02:00
maamountki
c2ffef8156
[ZARCH] fix data prefetch type in ddot
2019-01-09 16:49:44 +02:00
maamountki
e7455f500c
[ZARCH] fix dsdot.c
2019-01-09 16:33:54 +02:00
maamountki
3eafcfa650
[ZARCH] fix cgemv_n_4.c
2019-01-09 07:43:45 +02:00
maamountki
94cd946b96
[ZARCH] fix cgemv_n_4.c
2019-01-04 17:45:56 +02:00
maamountki
1aa840a0a2
[ZARCH] fix sgemv_t_4.c
2019-01-04 01:38:18 +02:00
Arjan van de Ven
795285c587
Fix thinko in skylake beta handling
...
casting ints is cheaper but it has a rounding, not memory casing effect, resulting in
invalid outcome
2018-12-24 18:49:50 +00:00