Commit Graph

2023 Commits

Author SHA1 Message Date
Martin Kroeker dd04143d4a
Merge pull request #2328 from martin-frbg/ppc9
Fix precompiled kernels on POWER9 and make their use conditional on (old) gcc version
2019-11-30 12:23:57 +01:00
Martin Kroeker f3a6164bff
Merge pull request #2324 from antonblanchard/power9_segv
Fix SEGV in cdot_power9
2019-11-30 00:03:42 +01:00
Martin Kroeker dedd822d1a
Fix caxpy/caxpyc naming in localentry 2019-11-29 23:56:57 +01:00
Martin Kroeker 2181fb7047
Fix caxpy/caxpyc naming in localentry 2019-11-29 23:54:15 +01:00
Martin Kroeker a9b62c03f8
Substitute precompiled gcc7 codes only when gcc is older than 9.x 2019-11-29 23:49:50 +01:00
Martin Kroeker 97762234f9
Add variable for gcc >=9 test
used in KERNEL.POWER9
2019-11-29 23:47:23 +01:00
wjc404 934e601e93
Update dgemm_kernel_4x8_skylakex_2.c 2019-11-28 19:56:35 +08:00
Anton Blanchard cf2a8e410c Fix SEGV in cdot_power9
We were corrupting r2 because the local entry wasn't being
setup correctly.
2019-11-26 21:55:04 -07:00
wjc404 eb1e9c8c92
some optimizations 2019-11-26 14:12:20 +08:00
Andreas Arnez d117dfd505 Change bad usage of "asum" to "sum" in ZARCH versions of ?sum
The ZARCH implementations of ?sum contain a cut & paste-error: An inline
assembly argument is named "sum", but the assembly references "asum"
instead.  The mismatch causes a build error.  This is fixed.
2019-11-21 13:49:13 +01:00
Martin Kroeker b09b5be0a4
Merge pull request #2315 from ewanglong/develop
revised fix windows compatible for #2313
2019-11-21 05:06:44 +01:00
Wang, Long bfb5fbdb4d revised fix windows compatible for #2313
Signed-off-by: Wang, Long <long1.wang@intel.com>
2019-11-21 10:22:58 +08:00
Martin Kroeker 08fa83aba2
Merge pull request #2312 from martin-frbg/power8be
Further Power8 big-endian corrections
2019-11-20 15:12:06 +01:00
Wang, Long 1191db1a49 For the sake of windows compatible, used "unsigned long long" to ensure 64-bit length
Signed-off-by: Wang, Long <long1.wang@intel.com>
2019-11-20 21:30:47 +08:00
Wang, Long 0caf1434c9 Fix the integer overflow issue for large matrix size
For large matrix, e.g. M=N=K, and M>1290, int mnk=M*N*K will overflow.
This will lead to wrong branching to single-threading. The performance
is downgraded significantly.

Signed-off-by: Wang, Long <long1.wang@intel.com>
2019-11-20 14:11:17 +08:00
Martin Kroeker cad0d150db
Define alternate kernels for big-endian POWER8 2019-11-17 23:12:10 +01:00
Martin Kroeker eba0aeb7cd
Fix compilation for big-endian POWER8 2019-11-17 22:58:32 +01:00
Martin Kroeker 0c07c356c1
Define alternate kernels for big-endian PPC440 2019-11-17 19:25:08 +01:00
Martin Kroeker 3e67017ac8
Merge pull request #2309 from martin-frbg/ppc970-be
Fix PPC970 big-endian support
2019-11-17 18:22:24 +01:00
Martin Kroeker b3ac6ee222
Define alternate kernels for big-endian PPC970
The altivec versions of SGEMM and CGEMM fail most test in LAPACK-TESTING when compiled for big endian, STRSM/CTRSM even cause segfaults. The rot kernels either fail the corresponding utest or lead to failures in LAPACK-TESTING.
2019-11-17 15:19:39 +01:00
Martin Kroeker 71e96163db
Merge pull request #2305 from wjc404/develop
AVX512 CGEMM & ZGEMM kernels
2019-11-12 07:38:37 +01:00
wjc404 819e852ae7
AVX512 CGEMM & ZGEMM kernels
96-99% 1-thread performance of MKL2018
2019-11-11 20:04:52 +08:00
Martin Kroeker 4c6a457358
Merge pull request #2300 from wjc404/develop
Optimize SGEMM on SKYLAKEX CPUs
2019-11-06 07:27:33 +01:00
wjc404 836c414e22
optimizations of software prefetching 2019-11-05 13:36:56 +08:00
Martin Kroeker 3cd97f1a80
Merge pull request #2301 from martin-frbg/ppc8be
Disable IDAMIN/MAX and IZAMIN/MAX optimizations on big-endian POWER8
2019-11-04 22:54:28 +01:00
wjc404 430c11e135
Add files via upload 2019-11-04 20:10:12 +08:00
wjc404 fbacd2605d
optimizations via software prefetches 2019-11-04 19:37:19 +08:00
Martin Kroeker 68597002ea
The assembly microkernel is not safe to use on ELFv1 2019-11-03 22:42:46 +01:00
Martin Kroeker d2a6285549
The assembly microkernel is not safe to use on ELFv1 2019-11-03 22:41:19 +01:00
Martin Kroeker d999688d1a
The assembly microkernel is not safe to use on ELFv1 2019-11-03 22:39:06 +01:00
Martin Kroeker 928fe1b28e
The assembly microkernel is not safe to use on ELFv1 2019-11-03 22:37:27 +01:00
wjc404 1df9a2013d
new sgemm kernel for skylakex 2019-11-02 00:00:48 +08:00
Martin Kroeker 85ccdce8c4
Remove the IOS fallbacks to generic C kernels 2019-10-25 23:02:37 +02:00
wjc404 6ff013bae0
native support for icopy_4
90% MKL 1-thread performance.
2019-10-19 03:54:44 +08:00
wjc404 0d669e04bb
Update dgemm_kernel_8x8_skylakex.c 2019-10-18 15:00:17 +08:00
wjc404 17cdd9f9e1
some correction 2019-10-18 14:58:07 +08:00
wjc404 6bcb06fcb1
make further changes to icopy_8 easier 2019-10-18 10:47:31 +08:00
wjc404 b7315f8401
Add files via upload 2019-10-16 19:23:36 +08:00
wjc404 9b19e9e1b0
Update dgemm_kernel_8x8_skylakex.c 2019-10-16 10:14:51 +08:00
wjc404 6bd67ddbab
Update dgemm_kernel_8x8_skylakex.c 2019-10-16 03:20:08 +08:00
wjc404 844629af57
Add files via upload 2019-10-16 02:00:34 +08:00
Martin Kroeker a448884a63
Remove automatic label postfixes from macro included only once 2019-10-08 08:37:50 +02:00
Martin Kroeker 3a2df19db6
Fix accidental duplication of jump instruction 2019-10-08 08:09:26 +02:00
Martin Kroeker d2093a40d3
Merge pull request #2277 from martin-frbg/issue2275
Rewrite ARMV8 code to allow cross-compilation for IOS
2019-10-06 23:01:54 +02:00
Martin Kroeker 56837e9d92
Make local labels in macro compatible with the xcode assembler
... which does not perform the automatic numbering on instantiation that the _@ suffix signifies
2019-10-04 14:53:23 +02:00
Martin Kroeker 5e244d80f2
Merge pull request #2271 from quickwritereader/strmm_fix
fixed bug power9 strmm . BLAS-TESTER passes
2019-09-29 13:53:45 +02:00
AbdelRauf ede5efebab trmm fix 2019-09-29 02:28:34 +00:00
Martin Kroeker 596a22325a
Fix prologue of power9 assembly cdot(c) kernel to provide cdotc 2019-09-27 00:47:18 +02:00
Martin Kroeker 7f58f3ad0e
Fix mis-edits in the gcc-derived power8 caxpy kernel 2019-09-27 00:44:26 +02:00
Martin Kroeker 673e5a0495
Replace several POWER8/9 C kernels with their gcc7-generated assembly versions (#2263)
* Add gcc7-generated assembly files for POWER8/9 isa/ica-min/max and POWER9 caxpy

To work around internal compiler errors encountered when compiling the original C source with gcc 4 and 5, and wrong code generated by gcc 8.3.0

* Use gcc-generated assembly instead of original C sources

to work around internal compiler errors encountered with gcc 4.8/5.4 and wrong code generation by gcc 8.3

* Use gcc-generated assembly instead of the original C source

to work around internal compiler errors encountered with gcc 4.8 and 5.4, and wrong code generation by gcc 8.3

* Add gcc7-generated assembler version of caxpy for power8

to work around wrong code generated by gcc 8.3

* Handle CONJ define for caxpyc

* Handle CONJ define for caxpyc

* Add gcc7-generated assembly cdot for POWER9

* Use prebuilt assembly for POWER9 cdot

created with gcc 7.3.1 to work around ICE in older gcc versions

* Exclude POWER9 from DYNAMIC_ARCH when gcc versions is lower than 6

* Update Makefile.system

* Use PROLOGUE macro to ensure correct function name for DYNAMIC_ARCH

* Disable POWER9 with old gcc versions
2019-09-22 22:35:22 +02:00
Martin Kroeker e7c4d6705a
Revert #2051 and replace with a better fix (#2261)
* Revert #2051 and add a better fix for TARGET=generic with DYNAMIC_ARCH
fixes #2257 without breaking #2048 again
2019-09-17 18:56:04 +02:00
Martin Kroeker f3c314550c
Merge pull request #2243 from quickwritereader/develop
possible cgemv,caxpy,cdot fix
2019-08-30 23:06:23 +02:00
AbdelRauf 847c20c9b7 fix uninitialized variables i 2019-08-30 11:14:55 +00:00
AbdelRauf 4c22828812 caxpy and cdot are using vec_vsx_ld 2019-08-30 04:09:15 +00:00
AbdelRauf e79712d969 cgemv using vec_vsx_ld instead of letting gcc to decide 2019-08-30 02:52:04 +00:00
AbdelRauf be09551cdf aligned 2019-08-29 23:22:23 +00:00
Martin Kroeker 11c59acfb1
Keep both PGI/SUN and default code paths to avoid breaking Clang/WIndows 2019-08-28 18:07:44 +02:00
Martin Kroeker 3a55dca2dc
Make x86_64 zdot compile with PGI and Sun C again
broken by #2222 as CREAL,CIMAG do not expand to a valid lvalue with these compilers
2019-08-28 11:35:31 +02:00
Kavana Bhat 3dc6b26eff AIX changes for Power8 2019-08-20 06:51:35 -05:00
Martin Kroeker 9ef96b32a6
Add multithreading support to the x86_64 zdot kernel (#2222)
* Add multithreading support

copied from the ThunderX2T99 kernel. For #2221
2019-08-15 22:09:12 +02:00
Martin Kroeker 103b32fdb7
Merge pull request #2216 from martin-frbg/issue2214
Remove case-sensitivity in x86 LSAME on (AMD) cpus without CMOV
2019-08-13 13:59:33 +02:00
Martin Kroeker aef9804089
Fix unwanted case-sensitivity in x86 LSAME for (AMD) processors without CMOV
Problem was already noticed some years ago in #238, but back then the problem was only corrected in one of the #ifdef branches.
Fixes #2214
2019-08-13 10:19:10 +02:00
Martin Kroeker dccff2e785
Merge pull request #2206 from martin-frbg/zen-dtrmm
Replace vpermpd with vpermilpd in the Haswell DTRMM kernel
2019-08-09 07:55:20 +02:00
Martin Kroeker 5c3458a6e7
Merge pull request #2199 from martin-frbg/zen-dtrsm
Replace most vpermpd calls in the Haswell DTRSM_RN kernel
2019-08-09 07:55:02 +02:00
Martin Kroeker acf6002ab2
Replace most vpermpd calls in the Haswell DTRSM_RN kernel 2019-08-03 12:40:13 +02:00
Martin Kroeker 2dfb804cb9
Replace vpermpd with vpermilpd in the Haswell DTRMM kernel
to improve performance on AMD Zen (#2180) applying wjc404's improvement of the DGEMM kernel from #2186
2019-07-28 23:17:28 +02:00
Martin Kroeker 4c153ec9da
Merge pull request #2196 from wjc404/develop
Add vbroadcastsd kernel to dgemm_kernel_4x8_haswell.S
2019-07-28 23:11:40 +02:00
wjc404 7eecd8e39c
Add files via upload 2019-07-28 07:39:09 +08:00
Martin Kroeker 7b0b7c11d2
Merge pull request #2190 from martin-frbg/zdot-zen
Replace vpermpd with vpermilpd in the Haswell/Zen zdot microkernel
2019-07-23 16:15:08 +02:00
Martin Kroeker 28e96458e5
Replace vpermpd with vpermilpd
to improve performance on Zen/Zen2 (as demonstrated by wjc404 in #2180)
2019-07-22 08:28:16 +02:00
wjc404 95fb98f556
Update dgemm_kernel_4x8_haswell.S 2019-07-21 01:10:32 +08:00
wjc404 4801c6d36b
Update dgemm_kernel_4x8_haswell.S 2019-07-21 00:47:45 +08:00
wjc404 9440fa607d
Add files via upload 2019-07-20 22:08:22 +08:00
wjc404 94db259e5b
Add files via upload 2019-07-20 22:04:41 +08:00
wjc404 f49f8047ac
Add files via upload 2019-07-20 14:33:37 +08:00
wjc404 825777faab
Update dgemm_kernel_4x8_haswell.S 2019-07-19 23:58:24 +08:00
wjc404 9c89757562
Add files via upload 2019-07-19 23:47:58 +08:00
wjc404 9b04baeaee
Update dgemm_kernel_4x8_haswell.S 2019-07-17 23:50:03 +08:00
wjc404 8a074b3965
Update dgemm_kernel_4x8_haswell.S 2019-07-17 23:47:30 +08:00
wjc404 211ab03b14
Update dgemm_kernel_4x8_haswell.S 2019-07-17 22:39:15 +08:00
wjc404 1733f927e6
Update dgemm_kernel_4x8_haswell.S 2019-07-17 21:27:41 +08:00
wjc404 182b06d6ad
Update dgemm_kernel_4x8_haswell.S 2019-07-17 17:02:35 +08:00
wjc404 7a9050d681
Update dgemm_kernel_4x8_haswell.S 2019-07-17 00:55:06 +08:00
wjc404 0ba29fd262
Update dgemm_kernel_4x8_haswell.S for zen2
replaced a bunch of vpermpd instructions with vpermilpd and vperm2f128
2019-07-17 00:46:51 +08:00
Martin Kroeker 6b6c9b1441
Merge pull request #2172 from quickwritereader/develop
power9 cgemm/ctrmm. new sgemm 8x16
2019-07-01 21:06:02 +02:00
AbdelRauf a97b301aaa cgemm/ctrmm power9 2019-07-01 14:07:54 +00:00
Piotr Kubaj eebfeba768 Fix build on FreeBSD/powerpc64.
Signed-off-by: Piotr Kubaj <pkubaj@anongoth.pl>
2019-06-25 10:58:56 +02:00
kavanabhat a575f1e4c7
Update dtrmm_kernel_16x4_power8.S 2019-06-19 15:27:14 +05:30
AbdelRauf cdbfb891da new sgemm 8x16 2019-06-17 15:33:38 +00:00
Martin Kroeker a17cf36225
Merge pull request #2153 from quickwritereader/develop
improved power9 zgemm,sgemm
2019-06-06 07:42:56 +02:00
AbdelRauf 148c4cc5fd conflict resolve 2019-06-05 20:50:50 +00:00
AbdelRauf d0c3543c3f power9 zgemm ztrmm optimized 2019-06-05 20:07:16 +00:00
AbdelRauf a469b32cf4 sgemm pipeline improved, zgemm rewritten without inner packs, ABI lxvx v20 fixed with vs52 2019-06-04 07:11:30 +00:00
AbdelRauf 8fe794f059 improved zgemm power9 based on power8 2019-05-30 15:31:25 +00:00
Martin Kroeker 74c10b57c6
Use generic kernels for complex (I)AMAX to support softfp 2019-05-30 11:38:11 +02:00
Martin Kroeker c5495d2056
Ensure correct output for DAMAX with softfp 2019-05-30 11:25:43 +02:00
Martin Kroeker c70496b108
Separate implementations of AMAX and IAMAX on arm
As noted in #1912 and comment on #1942, the combined implementation happens to "do the right thing" on hardfp, but cannot return both value and index on softfp where they would have to share the return register
2019-05-29 15:02:51 +02:00
Martin Kroeker 9ea30f3788
Replace ISMIN and ISAMIN kernels on all x86_64 platforms (#2125)
* Mark iamax_sse.S as unsuitable for MIN due to issue #2116
* Use iamax.S rather than iamax_sse.S for ISMIN/ISAMIN on all x86_64 as workaround for #2116
2019-05-09 14:42:36 +02:00
Martin Kroeker 6a8b4269b5
Merge pull request #2111 from martin-frbg/issue1955
Disable the SkyLakeX DGEMMIxCOPY kernels as well
2019-05-05 18:08:49 +02:00
Martin Kroeker b1561ecc68
Disable DGEMMINCOPY as well for now
#1955
2019-05-05 15:52:01 +02:00
Martin Kroeker 7ed8431527
Disable the SkyLakeX DGEMMITCOPY kernel as well
as a stopgap measure for https://github.com/numpy/numpy/issues/13401 as mentioned in #1955
2019-05-04 22:54:41 +02:00
Martin Kroeker 3f427c0cf9
Merge pull request #2107 from quickwritereader/develop
sgemm/strmm kernel for power9
2019-05-02 07:56:57 +02:00
AbdelRauf 47f892198c conflict resolve 2019-05-01 19:36:22 +00:00
AbdelRauf 628b335e83 Merge branch 'develop' of https://github.com/quickwritereader/OpenBLAS into develop 2019-04-29 08:57:44 +00:00
AbdelRauf 0f105dd8a5 sgemm/strmm 2019-04-29 08:49:50 +00:00
Martin Kroeker ccfb7ead15
Merge pull request #2072 from martin-frbg/sum
Add (C)BLAS extension ?sum
2019-04-23 20:11:36 +02:00
Rashmica Gupta bcdf1d4917 Add in runtime CPU detection for POWER. 2019-04-09 14:20:16 +10:00
Martin Kroeker c04a729081
Add ?sum definitions for generic kernel 2019-03-31 13:55:49 +02:00
Martin Kroeker 100d94f94e
Add ?sum 2019-03-31 13:55:05 +02:00
Martin Kroeker 246ca29679
Add ZARCH implementation of ?sum
as trivial copies of the respective ?asum kernels with the ABS and vflpsb calls removed
2019-03-30 22:49:05 +01:00
Martin Kroeker 9d717cb5ee
Add x86_64 implementation of ?sum
as trivial copy of ?asum with the fabs calls removed
2019-03-30 22:27:04 +01:00
Martin Kroeker e3bc83f2a8
Add x86 implementation of ?sum
as trivial copy of ?asum with the fabs calls removed
2019-03-30 22:26:10 +01:00
Martin Kroeker 70f2a4e0d7
Add SPARC implementation of ?sum
as trivial copy of ?asum with the fabs replaced by fmov to preserve code structure
2019-03-30 22:25:06 +01:00
Martin Kroeker 706dfe263b
Add POWER implementation of ?sum
as trivial copy of ?asum with the fabs replaced by fmr to preserve code structure
2019-03-30 22:23:42 +01:00
Martin Kroeker 688fa9201c
Add MIPS64 implementation of ?sum
as trivial copy of ?asum with the fabs replaced by mov to preserve code structure
2019-03-30 22:22:15 +01:00
Martin Kroeker cdbe0f0235
Add MIPS implementation of ?sum
as trivial copy of ?asum with the fabs calls removed
2019-03-30 22:20:14 +01:00
Martin Kroeker f8b82bc6dc
Add ia64 implementation of ?sum
as trivial copy of asum with the fabs calls removed
2019-03-30 22:18:03 +01:00
Martin Kroeker 3e3ccb9011
Add ARM64 implementations of ?sum
as trivial copies of the respective ?asum kernels with the fabs calls removed
2019-03-30 22:13:36 +01:00
Martin Kroeker 94ab4e6fb2
Add ARM implementations of ?sum
(trivial copies of the respective ?asum with the fabs calls removed)
2019-03-30 22:11:38 +01:00
Martin Kroeker c3cfc6986b
Add implementations of ssum/dsum and csum/zsum
as trivial copies of asum/zsasum with the fabs calls replaced by fmov to preserve code structure
2019-03-30 22:05:11 +01:00
Martin Kroeker b9f4943a14
Add ?sum 2019-03-30 22:01:13 +01:00
Martin Kroeker 32c7063cb0
Merge pull request #2061 from martin-frbg/martin-frbg-patch-1
Disable the AVX512 DGEMM kernel (again)
2019-03-30 21:21:38 +01:00
Martin Kroeker 7c51cc8527
Merge branch 'develop' into develop 2019-03-29 19:36:29 +01:00
AbdelRauf 853a18bc17 power9 makefile. dgemm based on power8 kernel with following changes : 32x unrolled 16x4 kernel and 8x4 kernel using (lxv stxv butterfly rank1 update). improvement from 17 to 22-23gflops. dtrmm cases were added into dgemm itself 2019-03-29 15:49:40 +00:00
Martin Kroeker e608d4f7fe
Disable the AVX512 DGEMM kernel (again)
Due to as yet unresolved errors seen in #1955 and #2029
2019-03-13 22:10:28 +01:00
Martin Kroeker 03d7110900
Merge pull request #2042 from maomao194313/develop
add TARGET support for HiSilicon tsv110 CPUs
2019-03-12 22:57:39 +01:00
Martin Kroeker f18ab6c17b
Merge pull request #2051 from martin-frbg/issue2048
Make TARGET=GENERIC compatible with DYNAMIC_ARCH=1
2019-03-09 16:39:35 +01:00
Martin Kroeker 5b95534afc
Make TARGET=GENERIC compatible with DYNAMIC_ARCH=1
for issue #2048
2019-03-09 11:21:16 +01:00
Celelibi b7f59da42d Fix crash in sgemm SSE/nano kernel on x86_64
Fix bug #2047.

Signed-off-by: Celelibi <celelibi@gmail.com>
2019-03-07 16:55:13 +01:00
maomao194313 783ba8058f
HiSilicon tsv110 CPUs optimization branch
add HiSilicon tsv110 CPUs  optimization branch
2019-03-04 16:30:50 +08:00
Andrew 6eee1beac5 move fix to right place 2019-02-24 20:41:02 +02:00
Martin Kroeker e12cdf58ef
Merge pull request #2024 from martin-frbg/gcc9fixes4
Fix inline assembly constraints in Bulldozer TRSM kernels
2019-02-17 11:49:15 +01:00
Martin Kroeker 1860c9456d
Merge pull request #2023 from martin-frbg/gcc9fixes3
Fix inline assembly constraints in various x86_64 GEMVN kernels
2019-02-17 11:48:57 +01:00
Martin Kroeker f9bb76d29a
Fix inline assembly constraints in Bulldozer TRSM kernels
rework indices to allow marking i,as and bs as both input and output (marked operand n1 as well for simplicity). For #2009
2019-02-16 20:06:48 +01:00
Martin Kroeker efb9038f72
Fix inline assembly constraints 2019-02-16 18:46:17 +01:00
Martin Kroeker e976557d29
Fix inline assembly constraints
rework indices to allow marking argument lda as input and output.
2019-02-16 18:36:39 +01:00
Martin Kroeker 9d8be15789
Fix inline assembly constraints
rework indices to allow marking argument lda4 as input and output. For #2009
2019-02-16 18:24:11 +01:00
Martin Kroeker d752799a0f
Merge pull request #2021 from martin-frbg/gcc9fixes2
Fix wrong constraints in inline assembly of Haswell DTRSM kernel
2019-02-16 18:05:40 +01:00
Martin Kroeker c26c0b77a7
Fix wrong constraints in inline assembly
for #2009
2019-02-15 15:08:16 +01:00
Martin Kroeker 1c6da2d03c
Merge pull request #2019 from martin-frbg/gcc9fixes
Fix unannounced modification of input operand 8 (lda4) in Haswell GEMVN microkernel
2019-02-15 15:02:54 +01:00
Martin Kroeker 4255a58cd2
Rename operands to put lda on the input/output constraint list 2019-02-15 10:10:04 +01:00
Martin Kroeker 46e415b140
Save and restore input argument 8 (lda4)
Fixes miscompilation with gcc9 -ftree-vectorize (related to issue #2009)
2019-02-14 22:43:18 +01:00
Bart Oldeman 69a97ca7b9 dgemv_kernel_4x4(Haswell): add missing clobbers for xmm0,xmm1,xmm2,xmm3
This fixes a crash in dblat2 when OpenBLAS is compiled using
-march=znver1 -ftree-vectorize -O2

See also:
https://github.com/easybuilders/easybuild-easyconfigs/issues/7180
2019-02-14 16:27:58 +00:00
Martin Kroeker 056917d616
Merge pull request #2013 from martin-frbg/issue2011
Fix invalid memory access in PPC gemm_beta
2019-02-14 09:29:34 +01:00
Martin Kroeker 718efcec6f
Fix out-of-bounds memory access in gemm_beta
Fixes #2011 (as suggested by davemq), assuming typo by K.Goto
2019-02-13 22:08:37 +01:00
Martin Kroeker f9d67bb5e8
Fix out-of-bounds memory access in gemm_beta
Fixes #2011 (as suggested by davemq) presuming typo by K.Goto
2019-02-13 22:06:41 +01:00
Martin Kroeker 76bb74fcd4
Merge pull request #2012 from maamountki/z14
[ZARCH] Many improvements
2019-02-13 20:15:56 +01:00
maamountki 0a54c98b9d
[ZARCH] Modify constraints 2019-02-13 21:06:25 +02:00
maamountki bec54ae366
[ZARCH] Fix caxpy 2019-02-13 12:54:35 +02:00
Martin Kroeker ab1630f9fa
Fix declaration of arguments in inline assembly
Argument 0 is modified so should be input and output
2019-02-12 16:14:02 +01:00
Martin Kroeker b824fa70eb
Fix declaration of assembly arguments in SSYMV and DSYMV microkernels
Arguments 0 and 1 are both input and output
2019-02-12 16:00:18 +01:00
Martin Kroeker 91481a3e4e
Fix declaration of input arguments in inline assembly
Argument 0 is modified as it doubles as a counter
2019-02-12 15:51:43 +01:00
Martin Kroeker dc6ac9eab0
Fix declaration of input arguments in the x86_64 s/dGEMV_T and s/dGEMV_N kernels
Arguments 0 and 1 need to be tagged as both input and output
2019-02-12 15:33:48 +01:00
maamountki f583674109
[ZARCH] Fix cgemv_t_4 2019-02-12 13:12:28 +02:00
maamountki 77fe70019f
[ZARCH] Fix constraints and source code formatting 2019-02-11 16:01:13 +02:00
maamountki 7039770165
[ZARCH] Undo the last commit 2019-02-06 20:11:44 +02:00
maamountki 11a43e8116
[ZARCH] Set alignment hint for vl/vst 2019-02-05 19:17:08 +02:00
maamountki 61526480f9
[ZARCH] Fix copy constraint 2019-02-05 07:51:19 +02:00
maamountki 81daf6bc38
[ZARCH] Format source code, Fix constraints 2019-02-05 07:30:38 +02:00
Martin Kroeker 729e925174
Merge pull request #1996 from quickwritereader/develop
NBMAX=4096 for gemvn, added sgemvn 8x8 for future
2019-02-04 16:52:04 +01:00
Ubuntu 498ac98581 Note for unused kernels 2019-02-04 15:41:56 +00:00
Ubuntu cd9ea45463 NBMAX=4096 for gemvn, added sgemvn 8x8 for future 2019-02-04 06:57:11 +00:00
Martin Kroeker f9c5023e04
Merge pull request #1994 from quickwritereader/develop
sgemv cgemv pairs
2019-02-01 21:04:47 +01:00
Ubuntu 4abc375a91 sgemv cgemv pairs 2019-02-01 13:45:00 +00:00
Martin Kroeker 874df65491
Fix incorrect sgemv results for IBM z14
part of PR #1993 that was inadvertently misplaced into the toplevel directory
2019-02-01 12:58:59 +01:00
Martin Kroeker 877023e1e1
Fix precision of zarch DSDOT
from patch provided by aarnez in #991
2019-01-31 21:22:26 +01:00
Martin Kroeker 265142edd5
Fix typo in the zarch min/max kernels
from patch provided by aarnez in #991
2019-01-31 21:21:40 +01:00
Martin Kroeker 885a3c4350
USE_TRMM on Z14
from patch provided by aarnez in #991
2019-01-31 21:18:09 +01:00
maamountki 82124729af
Merge branch 'develop' into z14 2019-01-31 19:36:41 +02:00
maamountki 29416cb5a3
[ZARCH] Add Z13 version for max/min functions 2019-01-31 19:11:11 +02:00
maamountki 48b9b94f7f
[ZARCH] Improve loading performance for camax/icamax 2019-01-31 18:52:11 +02:00
Martin Kroeker 86a824c97f
Fix wrong comparison that made IMIN identical to IMAX
as reported by aarnez in #1990
2019-01-31 15:27:21 +01:00
Martin Kroeker 808410c2c7
Fix wrong comparison that made IMIN identical to IMAX
as suggested in #1990
2019-01-31 15:25:15 +01:00
maamountki fcd814a8d2
[ZARCH] Fix bug in max/min functions 2019-01-29 17:59:38 +02:00
maamountki dc4d3bccd5
[ZARCH] Fix icamax/icamin 2019-01-29 03:47:49 +02:00
maamountki c7143c1019
[ZARCH] Fix iamax/imax single precision 2019-01-28 17:52:23 +02:00
maamountki 04873bb174
[ZARCH] Undo the last commit 2019-01-28 17:32:24 +02:00
maamountki c8ef9fb220
[ZARCH] Fix bug in iamax/iamin/imax/imin 2019-01-28 17:16:18 +02:00
maamountki b111829226
[ZARCH] Update max/min functions 2019-01-21 15:56:04 +02:00
Martin Kroeker 32b0f1168e
Fix declaration of input arguments in the Sandybridge GER microkernels (#1967)
* Tag arguments 0 and 1 as both input and output
2019-01-18 08:11:39 +01:00
Martin Kroeker b495e54310
Fix declaration of input arguments in the x86_64 SCAL microkernels (#1966)
* Tag arguments 0 and 1 as both input and output (see #1964)
2019-01-18 08:11:07 +01:00
Martin Kroeker d5e6940253
Fix declaration of input arguments in the x86_64 microkernels for DOT and AXPY (#1965)
* Tag operands 0 and 1 as both input and output

For #1964 (basically a continuation of coding problems first seen in #1292)
2019-01-17 23:20:32 +01:00
Ubuntu 43a4572038 crot fix 2019-01-17 14:45:31 +00:00
Abdelrauf a034e65512
Merge branch 'develop' into develop 2019-01-16 19:25:13 +04:00
Ubuntu 8c3386be87 Added missing Blas1 single fp {saxpy, caxpy, cdot, crot(refactored version of srot),isamax ,isamin, icamax, icamin},
Fixed idamin,icamin choosing the first occurance index of equal minimals
2019-01-16 15:16:21 +00:00
maamountki b815a04c87
[ZARCH] fix a bug in max/min functions 2019-01-15 21:04:22 +02:00
maamountki 1a7925b3a3
[ZARCH] Update dgemv_n_4.c 2019-01-11 17:43:11 +02:00
maamountki 406f835f00
[ZARCH] update cgemv_n_4.c 2019-01-11 17:39:17 +02:00
maamountki 621dedb37b
[ZARCH] Update cgemv_t_4.c 2019-01-11 17:37:11 +02:00
maamountki b731e8246f
Update sgemv_t_4.c 2019-01-11 17:14:04 +02:00
maamountki ecc31b743f
Update dgemv_t_4.c 2019-01-11 17:13:02 +02:00
maamountki 5d89d6b143
[ZARCH] fix sgemv_n_4.c 2019-01-11 17:08:24 +02:00
maamountki 67432b23c2
[ZARCH] fix cgemv_n_4.c 2019-01-11 16:44:46 +02:00
maamountki be66f5d5c2
[ZARCH] fix data prefetch type in sdot 2019-01-09 16:50:07 +02:00
maamountki c2ffef8156
[ZARCH] fix data prefetch type in ddot 2019-01-09 16:49:44 +02:00
maamountki e7455f500c
[ZARCH] fix dsdot.c 2019-01-09 16:33:54 +02:00
maamountki 3eafcfa650
[ZARCH] fix cgemv_n_4.c 2019-01-09 07:43:45 +02:00
maamountki 94cd946b96
[ZARCH] fix cgemv_n_4.c 2019-01-04 17:45:56 +02:00
maamountki 1aa840a0a2
[ZARCH] fix sgemv_t_4.c 2019-01-04 01:38:18 +02:00
Arjan van de Ven 795285c587 Fix thinko in skylake beta handling
casting ints is cheaper but it has a rounding, not memory casing effect, resulting in
invalid outcome
2018-12-24 18:49:50 +00:00
Arjan van de Ven d321448a63 dgemm: use dgemm_ncopy_8_skylakex.c also for Haswell
The dgemm_ncopy_8_skylakex.c code is not avx512 specific and gives
a nice performance boost for medium sized matrices
2018-12-16 23:09:22 +00:00
Arjan van de Ven c43331ad0a dgemm: Use the skylakex beta function also for haswell
it's more efficient for certain tall/skinny matrices
2018-12-16 23:09:17 +00:00
Martin Kroeker c4e23dd016
Update Makefile 2018-12-16 18:14:40 +01:00
Martin Kroeker cfc4acc221
typo 2018-12-16 16:19:51 +01:00
Martin Kroeker 545c2b1bbb
Add -mavx2 on Haswell only if the compiler supports it 2018-12-16 13:09:19 +01:00
Arjan van de Ven 69d206440a Make the skylakex/haswell sgemm code compile and run even with compilers without avx2 support 2018-12-16 00:19:41 +00:00
Martin Kroeker 3843e3e017
use -maxv2 on haswell 2018-12-15 23:30:31 +01:00
Martin Kroeker fbcb14a74b
should be core-avx2 2018-12-15 20:18:59 +01:00
Martin Kroeker 2a3190dc76
fix elseifeq and use older option core2-avx for compatibility 2018-12-15 20:17:44 +01:00
Martin Kroeker 1ebe5c0f49
Add -march=haswell to HASWELL part of DYNAMIC_ARCH build 2018-12-15 19:35:35 +01:00
Arjan van de Ven 0586899a10 Use sgemm_ncopy_4_skylakex.c also for Haswell
sgemm_ncopy_4_skylakex.c uses SSE transpose operations where the
real perf win happens; this also works great for Haswell.

This gives double digit percentage gains on small and skinny matrices
2018-12-15 13:49:19 +00:00
Arjan van de Ven 00dc09ad19 Use the skylake sgemm beta code also for haswell
with a few small changes it's possible to use the skylake sgemm code
also for haswell, this gives a modest gain (10% range) for smallish
matrixes but does wonders for very skinny matrixes
2018-12-15 13:49:13 +00:00
Arjan van de Ven cdc668d82b Add a "sgemm direct" mode for small matrixes
OpenBLAS has a fancy algorithm for copying the input data while laying
it out in a more CPU friendly memory layout.

This is great for large matrixes; the cost of the copy is easily
ammortized by the gains from the better memory layout.

But for small matrixes (on CPUs that can do efficient unaligned loads) this
copy can be a net loss.

This patch adds (for SKYLAKEX initially) a "sgemm direct" mode, that bypasses
the whole copy machinary for ALPHA=1/BETA=0/... standard arguments,
for small matrixes only.

What is small? For the non-threaded case this has been measured to be
in the M*N*K = 28 * 512 * 512 range, while in the threaded case it's
less, around M*N*K = 1 * 512 * 512
2018-12-13 13:47:31 +00:00
Martin Kroeker 87718807f0
Merge pull request #1910 from martin-frbg/issue1909
Fix for DYNAMIC_ARCH builds made on a AVX512-capable host
2018-12-12 14:56:25 +01:00
Martin Kroeker 51aec8e96b
make sure the added march=skylake-avx512 does not cause problems on Windows 2018-12-11 22:47:32 +01:00
Martin Kroeker 06f7d78d70
Add -march=skylake-avx512 to SkylakeX part of DYNAMIC_ARCH builds 2018-12-11 21:10:38 +01:00
Martin Kroeker 7639f2e1f0
Rewrite the conditional for OSX to fix cmake parsing on others
The Makefile variable parser in utils.cmake currently does not handle conditionals. Having the definitions for non-OSX last will at least make cmake builds work again on non-OSX platforms.
2018-12-06 14:04:27 +01:00
Martin Kroeker 2fc712469d
Avoid creating spurious non-suffixed c/zgemm_kernels
Plain cgemm_kernel and zgemm_kernel are not used anywhere, only cgemm_kernel_b etc.
Needlessly building them (without any define like NN, CN, etc.) just happened to work on most platforms, but not on arm64. See #1870
2018-12-06 13:56:06 +01:00
Martin Kroeker 6ba30e270d
Fix typo that broke CNRM2 on ARMV8 since 0.3.0
must have happened in my #1449
2018-12-06 13:42:25 +01:00
Martin Kroeker 701ea88347
Use p2align instead of align for OSX compatibility
fixes #1902
2018-12-03 13:06:43 +01:00
Martin Kroeker 6c7b691083
Really revert xDOT changes from 1832
neglected to rebase #1892 on merging
2018-11-30 21:32:01 +01:00
Martin Kroeker 5f4c550c27
Merge pull request #1892 from martin-frbg/mipsdot
revert MIPS64 xDOT kernel changes from #1832
2018-11-30 21:28:21 +01:00
Martin Kroeker 95a5542e3c
Revert DOT kernel changes from #1834
as the failures seen on Loongson3A appear to be limited to DSDOT/SDSDOT (i.e. my hackish "fix" from #1684)
2018-11-30 11:16:24 +01:00
Martin Kroeker 7a2e1bc804
Use generic kernel for DSDOT/SDSDOT
as discussed in #1834
2018-11-30 10:57:09 +01:00
Martin Kroeker 35653e38b3
Merge pull request #1834 from fengrl/develop
register push/pop command change
2018-11-30 10:48:46 +01:00
Andrew 19c4bdd8b3 Add return value so that freebsd system clang does not err out 2018-11-25 21:35:01 +01:00
Renato Golin 310ea55f29 Simplifying ARMv8 build parameters
ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode
(which is not right because TX2 is ARMv8.1) as well as requiring a few
redundancies in the defines, making it harder to maintain and understand
what core has what. A few other minor issues were also fixed.

Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX,
ThunderX2, and XGene.

Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester.

A summary:
 * Removed TX2 code from ARMv8 build, to make sure it is compatible with
   all ARMv8 cores, not just v8.1. Also, the TX2 code has actually
   harmed performance on big cores.
 * Commoned up ARMv8 architectures' defines in params.h, to make sure
   that all will benefit from ARMv8 settings, in addition to their own.
 * Adding a few more cores, using ARMv8's include strategy, to benefit
   from compiler optimisations using mtune. Also updated cache
   information from the manuals, making sure we set good conservative
   values by default. Removed Vulcan, as it's an alias to TX2.
 * Auto-detecting most of those cores, but also updating the forced
   compilation in getarch.c, to make sure the parameters are the same
   whether compiled natively or forced arch.

Benefits:
 * ARMv8 build is now guaranteed to work on all ARMv8 cores
 * Improved performance for ARMv8 builds on some cores (A72, Falkor,
   ThunderX1 and 2: up to 11%) over current develop
 * Improved performance for *all* cores comparing to develop branch
   before TX2's patch (9% ~ 36%)
 * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than
   current develop's branch and 8% faster than deveop before tx2 patches

Issues:
 * Regression from current develop branch for A53 (-12%) and A57 (-3%)
   with ARMv8 builds, but still faster than before TX2's commit (+15%
   and +24% respectively). This can be improved with a simplification of
   TX2's code, to be done in future patches. At least the code is
   guaranteed to be ARMv8.0 now.

Comments:
 * CortexA57 builds are unchanged on A57 hardware from develop's branch,
   which makes sense, as it's untouched.
 * CortexA72 builds improve over A57 on A72 hardware, even if they're
   using the same includes due to new compiler tunning in the makefile.
2018-11-19 16:41:49 +00:00
fengruilin 43bb386b10 fix dot problem on 64bit mips 2018-11-15 11:11:59 +08:00
Arjan van de Ven dcc5d6291e skylakex: Make the sgemm/dgemm beta code robust for a N=0 or M=0 case
in the threading code there are cases where N or M can become 0,
and the optimized beta code did not handle this well, leading
to a crash

during the audit for the crash a few edge conditions on the if statements
were found and fixed as well
2018-11-01 01:42:09 +00:00
fengrl 2d8064174c
register push/pop command change
64bit push/pop register command should be used. Otherwise, data will lost.
2018-10-26 17:55:15 +08:00
Ashwin Sekhar T K d5aeff636f ARM64: Enable DYNAMIC_ARCH
Enable DYNAMIC_ARCH feature on ARM64. This patch uses the cpuid
feature in linux kernel to detect the core type at runtime
(https://www.kernel.org/doc/Documentation/arm64/cpu-feature-registers.txt).

If this feature is missing in kernel, then the user should use the
OPENBLAS_CORETYPE env variable to select the desired core type.
2018-10-22 01:49:35 -07:00
Ashwin Sekhar T K e7b66cd36e ARM64: Fix DYNAMIC_ARCH compilation for cores which dont use GEMM3M 2018-10-22 01:45:51 -07:00
Ashwin Sekhar T K d50abc8903 ARM64: Move parameters from parameter.c to param.h
Remove the runtime setting of P, Q, R parameters for
targets ARMV8, THUNDERX2T99. Instead set them as constants
in param.h at compile time.
2018-10-22 01:45:51 -07:00
Ashwin Sekhar T K 351a0c777c ARM64: Remove XGENE1 references
Remove XGENE1 target as the implementation for the
same is incomplete. Moreover whoever wishes to use
on XGENE1 can use the generic ARMV8 target as there
are no XGENE1 specific optimizations in OpenBLAS.
2018-10-22 01:45:51 -07:00
Ashwin Sekhar T K 21f46a1cf2 ARM64: Use THUNDERX2T99 Neon Kernels for ARMV8
Currently the generic ARMV8 target uses C implementations
for many routines. Replace these with the neon implementations
written for THUNDERX2T99 target which are upto 6x faster for
certain routines.
2018-10-17 10:44:37 -07:00
Ashwin Sekhar T K caf339412f ARM64: Remove dependency of THUNDERX2T99 Makefile on CORTEXA57 Makefile 2018-10-17 08:02:40 -07:00
Ashwin Sekhar T K 8001fdcd2a ARM64: Remove dependency of THUNDERX Makefile on ARMV8 Makefile 2018-10-17 08:02:16 -07:00
Ashwin Sekhar T K 162e312832 ARM64: Remove dependency of CORTEXA57 Makefile on ARMV8 Makefile 2018-10-17 08:01:45 -07:00
Ashwin Sekhar T K c3d93caa8d ARM64: Remove dependency of XGENE1 Makefile on ARMV8 Makefile 2018-10-17 08:01:27 -07:00
Arjan van de Ven 55b244ca0d enable the SGEMM/SKX C based kernel
In QA the final bug was found so now the sklyakex sgemm C based kernel can
be activated....
2018-10-12 09:30:35 +00:00
Arjan van de Ven d4bad73834 Add a C+intrinsics version of the SGEMM/skylakex kernel
for most sizes this is 1.2x to 1.4x faster than the current code
2018-10-10 01:49:22 +00:00
Arjan van de Ven 582c589727 dgemm/skylakex: replace discrete mul/add with fma
very minor gains since it's not super hot code, but general principles
2018-10-06 23:13:26 +00:00
Arjan van de Ven adbf6afa25 Add vector optimizations for ncopy as well for dgemm/skylakex 2018-10-06 21:18:12 +00:00
Arjan van de Ven 32bec8afbb add a skylakex optimized dgemm beta function 2018-10-06 16:36:26 +00:00
Arjan van de Ven 20c5d668fe dgemm/avx512 simplify and speed up the 4x4 kernel 2018-10-06 14:12:32 +00:00
Arjan van de Ven 6d43c51ccf undo slow dgemm/skylake microoptimization
the compare is more costly than the work
2018-10-06 14:00:37 +00:00
Arjan van de Ven d74dc39b0f Add optimized *copy versions for skylakex
Add optimized n/t copy versions for skylakex; in the patch the
tcopy is also rewritten using intrinsics; the ncopy file
will be worked on in a future commit
2018-10-06 13:51:44 +00:00
Arjan van de Ven 66b43affbc Add a 24x8 kernel to the skylakex dgemm implementation
Minor gains for small matrixes, but at 512x512 and above the gain
gets more significant.
2018-10-05 13:22:21 +00:00
Arjan van de Ven 1938819c25 skylake dgemm: Add a 16x8 kernel
The next step for the avx512 dgemm code is adding a 16x8 kernel.
In the 8x8 kernel, each FMA has a matching load (the broadcast);
in the 16x8 kernel we can reuse this load for 2 FMAs, which
in turn reduces pressure on the load ports of the CPU and gives
a nice performance boost (in the 25% range).
2018-10-05 13:11:35 +00:00
Martin Kroeker b7496c3638
Function name needs to be CNAME, set from outside to allow suffixing for dynamic_arch 2018-10-04 19:14:59 +02:00
Arjan van de Ven 45fe8cb0c5 Create a AVX512 enabled version of DGEMM
This patch adds dgemm_kernel_4x8_skylakex.c which is
* dgemm_kernel_4x8_haswell.s converted to C + intrinsics
* 8x8 support added
* 8x8 kernel implemented using AVX512

Performance is a work in progress, but already shows a 10% - 20%
increase for a wide range of matrix sizes.
2018-10-03 14:45:25 +00:00
Martin Kroeker 544b069e85
Merge pull request #1780 from martin-frbg/issue1774-2
Convert fldmia/fstmia instructions to UAL syntax for clang7
2018-09-29 09:27:47 +02:00
Martin Kroeker 9b2a7ad40d
Convert fldmia/fstmia instructions to UAL syntax for clang7
second part of fix for #1774, containing files missed in #1775
2018-09-28 23:05:15 +02:00
fengruilin 6fc85a6359 test_axpy work error on LOONGSON3A platform #1777 2018-09-26 15:14:04 +08:00
Martin Kroeker 7e5df34e6a
Convert fldmia/fstmia instructions to UAL syntax for clang7
fixes #1774
2018-09-25 09:41:58 +02:00
Andrew 1e531701b7 fix small typo 2018-09-09 16:52:25 +02:00
Martin Kroeker ba4f433321
Merge pull request #1749 from martin-frbg/issue1531
Fix ARMV8 cross-compilation for IOS
2018-09-07 11:02:01 +02:00
Martin Kroeker 1cb7b9015e
Conditional compilation of assembly files that IOS does not like 2018-09-04 11:06:51 +02:00
Martin Kroeker a4bd41e9f2
Fix paths to C kernels for nrm2 2018-09-04 10:51:19 +02:00
Martin Kroeker e11126b26a
Merge pull request #1745 from martin-frbg/issue1743
Set USE_TRMM for all ZARCH variants to fix TRMM faults with zarch-gen…
2018-08-29 07:43:58 +02:00
Martin Kroeker f3fd44a731
Set USE_TRMM for all ZARCH variants to fix TRMM faults with zarch-generic
fixes #1743
2018-08-28 21:34:07 +02:00
maamountki e6c0e39492
Optimize Zgemv 2018-08-13 12:23:40 +03:00
Martin Kroeker 375dff54fc
Merge pull request #1733 from fenrus75/dsymv
Add an AVX512 enabled DSYMV (L) function
2018-08-12 18:18:36 +02:00
Martin Kroeker a5f165275a
Merge pull request #1732 from fenrus75/dgemv
Add an AVX512 enabled DGEMV (n)  function
2018-08-12 18:17:42 +02:00
Martin Kroeker 8c13aa495a
Merge pull request #1730 from fenrus75/fix-sdot
Fix typo in sdot function
2018-08-12 18:17:01 +02:00
Arjan van de Ven 9bec34cb67 Add an AVX512 enabled DSYMV (L) function
written in C intrinsics for best readability.
(the same C code works for Haswell as well)

For logistical reasons the code falls back to the existing
haswell AVX2 implementation if the GCC or LLVM compiler is not new enough
2018-08-11 17:46:24 +00:00
Arjan van de Ven 87bebdbd8a Add an AVX512 enabled DGEMV (n) function
written in C intrinsics for best readability.
(the same C code works for Haswell as well)

For logistical reasons the code falls back to the existing
haswell AVX2 implementation if the GCC or LLVM compiler is not new enough
2018-08-11 17:38:12 +00:00
Arjan van de Ven 36add7570a Fix typo in sdot function
it looks like my previous pull request was short the final commit;
fix a typo in sdot
2018-08-11 17:16:45 +00:00
Arjan van de Ven cacacc8007 Add an AVX512 enabled DSCAL function
written in C intrinsics for best readability.
(the same C code works for Haswell as well)

For logistical reasons the code falls back to the existing
haswell AVX2 implementation if the GCC or LLVM compiler is not new enough
2018-08-11 17:14:57 +00:00
Martin Kroeker 1a00ef3d27
Merge pull request #1725 from fenrus75/axpy
Add a AVX512 enabled SAXPY/DAXPY functions
2018-08-11 11:01:20 +02:00
Arjan van de Ven 2e99873ff7 Add a AVX512 enabled SAXPY/DAXPY functions
written in C intrinsics for best readability.
(the same C code works for Haswell as well)

For logistical reasons the code falls back to the existing
haswell AVX2 implementation if the GCC or LLVM compiler is not new enough
2018-08-10 02:58:32 +00:00
Arjan van de Ven 00abaa865b Add an AVX512 enabled SDOT function
written in C intrinsics for best readability.
(the same C code works for Haswell as well)

For logistical reasons the code falls back to the existing
haswell AVX2 implementation if the GCC or LLVM compiler is not new enough
2018-08-10 02:33:43 +00:00
Arjan van de Ven 7932ff3ea9 Add an AVX512 enabled DDOT function
written in C intrinsics for best readability.
(the same C code works for Haswell as well)

For logistical reasons the code falls back to the existing
haswell AVX2 implementation if the GCC or LLVM compiler is not new enough
2018-08-09 03:55:52 +00:00
maamountki 23229011db
[ZARCH] Z14 support, BLAS 1/2 single precision implementations, Some missing double precision implementations, Gemv optimization 2018-08-06 18:20:40 +03:00
Martin Kroeker 4e103c822c
typo fix 2018-07-16 12:56:39 +02:00
Martin Kroeker d2142760e0
Fix precision problem in DSDOT 2018-07-15 17:11:40 +02:00
Martin Kroeker 2fbfc64da8
Use C kernels for default c/zAXPY, xROT, c/zSWAP 2018-07-15 17:09:55 +02:00
Martin Kroeker ba8388cee0
Merge pull request #1651 from martin-frbg/avx512-nodgemm
Disable the 16x2 DTRMM kernel on SkylakeX as well
2018-06-30 17:48:03 +02:00
Martin Kroeker 6e54b0a027
Disable the 16x2 DTRMM kernel on SkylakeX as well 2018-06-30 17:31:06 +02:00
Martin Kroeker 40c8cbc3bf
Merge pull request #1650 from martin-frbg/avx512-nodgemm
Disable the AVX512 DGEMM kernel for now
2018-06-30 13:05:46 +02:00
Martin Kroeker f0a8dc2eec
Disable the AVX512 DGEMM kernel for now
due to #1643
2018-06-30 11:34:48 +02:00
Martin Kroeker b83e4c60c7
Remove premature exit for INC_X or INC_Y zero 2018-06-26 20:46:42 +02:00
Martin Kroeker e344db269b
Remove premature exit for INC_X or INC_Y zero 2018-06-26 20:45:57 +02:00
Martin Kroeker 545b82efd3
Remove premature exit for INC_X or INC_Y zero 2018-06-26 20:45:00 +02:00
Martin Kroeker e322a951fe
Remove premature exit for INC_X or INC_Y zero 2018-06-26 20:44:13 +02:00
Martin Kroeker c628c6fa59
Merge pull request #1612 from oon3m0oo/cpus
Fixed a few more unnecessary calls to num_cpu_avail.
2018-06-14 16:51:31 +02:00
Martin Kroeker 6f71c0fce4
Return a somewhat sane default value for L2 cache size if cpuid retur… (#1611)
* Return a somewhat sane default value for L2 cache size if cpuid returned something unexpected

Fixes #1610, the KVM hypervisor on Google Chromebooks returning zero for CPUID  0x80000006, causing DYNAMIC_ARCH
builds of OpenBLAS to hang
2018-06-11 13:26:19 +02:00
Craig Donner c2545b0fd6 Fixed a few more unnecessary calls to num_cpu_avail.
I don't have as many benchmarks for these as for gemm, but it should still
make a difference for small matrices.
2018-06-11 10:17:16 +01:00
Arjan van de Ven 89372e0993 Use AVX512 also for DGEMM
this required switching to the generic gemm_beta code (which is faster anyway on SKX)
for both DGEMM and SGEMM

Performance for the not-retuned version is in the 30% range
2018-06-03 22:17:27 +00:00
Martin Kroeker 0023515733
Typo fix (misplaced parenthesis) 2018-06-03 13:22:59 +02:00
Arjan van de Ven 99c7bba8e4 Initial support for SkylakeX / AVX512
This patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server)
target. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set,
which brings 2 basic things:
1) 512 bit wide SIMD (2x width of AVX2)
2) 32 SIMD registers (2x the number on AVX2)

This initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel
to AVX512VL; more will follow later but this patch aims to get the infrastructure
in place for this "later".

Full performance tuning has not been done yet; with more registers and wider SIMD
it's in theory possible to retune the kernels but even without that there's an
interesting enough performance increase (30-40% range) with just this change.
2018-06-03 07:58:52 +00:00
Martin Kroeker 8562d5787a
Merge pull request #1583 from martin-frbg/issue1575
Handle INCX=0,INCY=0 case
2018-05-31 21:55:26 +02:00
Martin Kroeker 7df8c4f76f
typo fix 2018-05-31 17:23:08 +02:00
Martin Kroeker 2fc748bf72
Restore optimized swap kernel now that we have a proper fix 2018-05-31 13:41:12 +02:00
Martin Kroeker d1b7be14aa
Handle INCX=0,INCY=0 case
Fixes #1575 (sswap/dswap failing the swap utest on x86) as suggested by atsampson.
2018-05-31 12:52:04 +02:00
Martin Kroeker 961d25e9c7
Use the new zrot.c on POWER8 for crot as well
fixes #1571 (the old zrot.S assembly does not handle incx=0 correctly)
2018-05-23 22:54:39 +02:00
Martin Kroeker f5959f2543
Merge pull request #1567 from martin-frbg/mipstrmm
Revert " Switch mips32 target to USE_TRMM to fix complex TRMM"
2018-05-17 20:50:23 +02:00
Martin Kroeker 82012b960b
Revert " Switch mips32 target to USE_TRMM to fix complex TRMM"
... as it was just a silly workaround for the issue seen in #1563, caused by #1419
2018-05-17 20:30:03 +02:00
Martin Kroeker 8dd3515fa2
Merge pull request #1565 from martin-frbg/mipstypo
Remove extraneous brace from previous commit of mips dsdot fix
2018-05-17 20:22:58 +02:00
Martin Kroeker 95f7f0229c
Remove extraneous brace from previous commit 2018-05-17 18:43:59 +02:00
Martin Kroeker 5082fe4306
Merge pull request #1564 from martin-frbg/issue1563
Revert changes from PR#1419
2018-05-17 14:04:13 +02:00
Martin Kroeker 7a7619af6d
Revert changes from PR#1419
at least one of these changes apparently is an oversimplification, leading to TRMM breakage on some platforms as observed in #1563
2018-05-17 11:40:08 +02:00
Martin Kroeker 893b535540
Use correct data type for initializers of v2f64, v4f32
Fixes #1561
2018-05-15 14:42:12 +02:00
Martin Kroeker 018f2dad27
Switch mips32 target to USE_TRMM to fix complex TRMM 2018-05-02 20:25:32 +02:00
Martin Kroeker 9d5098dbc9
Add MIPS 1004K target (Mediatek MT7621 SOC) 2018-05-02 20:20:44 +02:00
Martin Kroeker 954f1832de
Merge pull request #1540 from martin-frbg/mips32-zasum
Fix typo in MIPS P5600 complex ASUM code selection
2018-04-25 23:23:00 +02:00
Martin Kroeker 941ad280a8
Fix typo in MIPS P5600 complex ASUM code selection 2018-04-25 22:50:10 +02:00
Martin Kroeker 1da365312a
Merge pull request #1538 from martin-frbg/arm7utest
Fix handling of zero INCX, INCY in ArmV7 AXPY and ROT
2018-04-25 08:38:58 +02:00
Martin Kroeker 2d0929fa7c
Move the test for zero incx,incy in ARMV7 ROT
to pass the related utest (see #1469)
2018-04-24 22:43:00 +02:00
Martin Kroeker 125343cc88
Drop test for zero incx,incy in armv7 AXPY
...to pass the related utest (see #1469)
2018-04-24 22:39:50 +02:00
Martin Kroeker 8a3b6fa108
Use generic zrot.c on ppc64/POWER6 to work around utest failure from … (#1535)
* Use generic C implementation of zrot on ppc64/POWER6 to work around utest failure from #1469
2018-04-23 19:05:49 +02:00
Martin Kroeker 9c5518319a
Revert "Fix 32bit HASWELL builds" 2018-04-22 20:20:04 +02:00
Jerry Zhao 0ee395db35 Fixed TRMM and SYMM for RISCV 2018-04-18 18:03:32 -07:00
Jerry Zhao c167a3d6f4 Added RISCV build 2018-04-16 14:08:31 -07:00
Martin Kroeker 2ca0faf495
Merge pull request #1515 from martin-frbg/mipsdot
Correct precision of mips dsdot
2018-04-11 08:21:25 +02:00
Martin Kroeker 0fe434598b
Fix precision of mips dsdot 2018-04-10 23:30:59 +02:00
Martin Kroeker c7b55b6082
Merge pull request #1499 from quickwritereader/develop
Implemented missing vsx simd  kernels for power8 blas1/2 double. z13 modifications
2018-03-27 21:43:23 +02:00
Martin Kroeker 840e01061f
Merge pull request #1491 from martin-frbg/ddot_mt
Add multithreading support for Haswell DDOT
2018-03-27 21:43:05 +02:00
QWR QWR 28ca97015d power8:Added initial zgemv_(t|n) ,i(d|z)amax,i(d|z)amin,dgemv_t(transposed),zrot
z13: improved zgemv_(t|n)_4,zscal,zaxpy
2018-03-27 14:54:41 +00:00
Martin Kroeker 6a6ffaff1e
Merge pull request #1494 from martin-frbg/x86_dsdot
Use generic/dot.c instead of the inferior arm/dot.c for x86 DSDOT
2018-03-17 15:26:47 +01:00
Martin Kroeker 28ac9ea5a6
Use generic/dot.c instead of the inferior arm/dot.c for x86 DSDOT
to resolve dsdot utest failure seen in #1492
2018-03-17 13:49:15 +01:00
Martin Kroeker a55694dd5b
Declare dot_compute static to avoid conflicts in multiarch builds 2018-03-16 22:23:36 +01:00
Martin Kroeker 85a41e9cdb
Add multithreading support for Haswell DDOT
copied from ashwinyes' implementation in dot_thunderx2t99.c
2018-03-16 16:58:47 +01:00
Martin Kroeker 81215711a2
Re-enable DAXPY microkernels for x86_64
as the inaccuracies seen in the original testcase for #1332 appear to be due to an artefact that amplifies the very small rounding differences between FMA and discrete multiply+add
2018-03-04 19:37:03 +01:00
Martin Kroeker 22167170b3
Merge pull request #1477 from quickwritereader/develop
Power8 blas3 copy-pack routines
2018-02-28 18:46:54 +01:00
Ashwin Sekhar T K fa9ca65c0e ARM64: Fix utest dsdot errors 2018-02-27 10:47:55 +00:00
Martin Kroeker 719b68f077
Merge pull request #1473 from martin-frbg/p2align
Replace .align with .p2aligns in dscal.c and the Nehalem microkernels as well
2018-02-27 08:28:20 +01:00
Martin Kroeker fe9f15f2d8
Merge pull request #1472 from martin-frbg/utest-fixes
Fix limited DSDOT precision on arm,aarch64 and zarch
2018-02-26 22:48:07 +01:00
Martin Kroeker 497f0c3d8a
Replace .align with .p2align in the Nehalem microkernels 2018-02-26 20:58:33 +01:00
Martin Kroeker ea37db828e
Convert .align to .p2align for OSX compatibility 2018-02-26 20:48:03 +01:00
Martin Kroeker 6e70287776
Use generic/dot.c for DSDOT on ARMV5 and above
The default arm/dot.c is less precise when used for DSDOT, as shown by utest
2018-02-25 19:57:23 +01:00
Martin Kroeker 58f236ad73
Use generic/dot.c for DSDOT on zarch 2018-02-25 19:52:14 +01:00
Martin Kroeker e207107150
Use generic/dot.c for DSDOT on z13
The implementation in arm/dot.c has lower precision, as shown by the utest for dsdot.
2018-02-25 19:51:25 +01:00
Martin Kroeker c9d408064a
Use dot.S also for DSDOT on CORTEXA57 2018-02-25 19:48:09 +01:00
Martin Kroeker 288d1a3f6e
Use dot.S also for DSDOT on ARMV8 2018-02-25 19:45:16 +01:00
Martin Kroeker 7c1925acec
Use .p2align instead of .align for compatibility on Sandybridge as well 2018-02-24 19:43:15 +01:00
Martin Kroeker 2359c7c1a9
Use .p2align instead of .align for portability
The OSX assembler apparently mishandles the argument to decimal .align, leading to a significant loss of performance 
as observed in #730, #901 and most recently #1470
2018-02-24 17:50:13 +01:00
Martin Kroeker e7366a4161
Restore the remaining utests (#1462)
* Restore the remaining utests

* Try fork test on Cygwin and Linux only, it hangs on at least ARMv8/Android as well

* Use generic sswap/dswap kernels for NEHALEM 32bit to fix fault found by the restored swap utest

* Disable zdotu test for MS cl to work around runtime error -1073741819 on AppVeyor for now
(probably coding error in the initialization of the complex numbers or wrong choice of zdotu API)
2018-02-20 10:07:17 +01:00
the mslm 2c0a008281 dgemm_ncopy_4_ save/restore 2018-02-18 01:30:17 +00:00
the mslm c5425daa6b power8 ?gemm_tcopy save/restore 2018-02-16 23:36:46 +00:00
Martin Kroeker b47e6822aa
Enable most assembly kernels in the generic ARMV8 target
ref #1439
2018-02-06 11:42:58 +01:00
Abdelrauf 60596a1abc
Merge branch 'develop' into develop 2018-01-31 16:17:04 -08:00
Abdelrauf afd514c25d small fix inside ifdef z13mvc . (z13mvc code is not used in production) 2018-01-31 18:30:59 -05:00
Martin Kroeker f45776ec1f
Merge pull request #1440 from quickwritereader/develop
small corrections
2018-01-31 23:48:47 +01:00
Martin Kroeker e388459a27
Merge pull request #1419 from brada4/develop
Initialize unitialized values for repeated calls
2018-01-31 23:48:34 +01:00
Abdelrauf f653e7a18d
small fix
small fix inside ifdef z13mvc . (z13mvc code is not used in production)
2018-01-31 07:49:38 -08:00
the mslm f946a89432 zscal (case: real alpha=0 ) mikrokernel shift&mem fix , da_i as input reg. small typo fixes 2018-01-26 19:25:27 -08:00
Martin Kroeker 485df77612
Make USE_TRMM depend on TARGET_CORE not TARGET
Fixes #1432 (and possibly other DTRMM-related failures on Haswell and related architectures when built with cmake)
2018-01-26 23:20:00 +01:00
Martin Kroeker e4c71a799a
Merge pull request #1426 from quickwritereader/develop
(Z13 ) Blas1 mikrokernels can be inlined by gcc. Refactoring,fixes,tunings
2018-01-20 17:34:54 +01:00
the mslm 2619ad7ea5 Blas1 mikrokernels can be inlined by gcc. Refactoring ( symbolic operan
names). Some fixes and tunings
2018-01-19 19:24:35 -08:00
Andrew e5cc3d72c0 core.IdenticalExpr clang501 checker 2018-01-19 23:17:43 +01:00
Andrew 4938faa822 core.IdenticalExpr clang501 checker 2018-01-19 23:15:58 +01:00
Andrew 9fa986337d add missing brackets to silence indentation warnings gcc721 2018-01-19 23:11:12 +01:00
Andrew 3eed97f6b9 Initialize values to silence cppcheck 2018-01-12 22:35:00 +01:00
Andrew 13e137fbc9 Initialize uninitialized variables (cppcheck) 2018-01-12 22:33:41 +01:00
Martin Kroeker 3d23f45107
Merge pull request #1415 from quickwritereader/develop
(Z systems Z13) small fixes, some (i(dz)amin,i(dz)amax,(dz)dot,(dz)asum) mikrokernels…
2018-01-11 08:35:02 +01:00
Abdelrauf 87669d1c0a small fixes, some (i(dz)amin,i(dz)amax,(dz)dot,(dz)asum) mikrokernels can be inlined 2018-01-10 20:36:53 -05:00
Martin Kroeker 42285d8e70
Merge pull request #1410 from brada4/develop
Address warnings #1357
2018-01-06 20:02:46 +01:00
Andrew d602b99386 LAPACK helpers in C that need care too 2018-01-02 14:38:50 +01:00
Andrew 4d0b005e5b Eliminate remaining unused results in kernels (clang5 analyzer) 2018-01-01 20:54:39 +01:00
Martin Kroeker b81656936f
Merge pull request #1409 from martin-frbg/issue1292-2
Tag %1 and %2 as both input and output operands
2017-12-31 20:18:48 +01:00
Martin Kroeker b973990df2
Tag %1 and %2 as both input and output operands
fix from #1292 extended to the other gemv microkernels
2017-12-31 18:03:36 +01:00
Martin Kroeker 1e31124eb0
Merge pull request #1406 from martin-frbg/issue1292
Tag %1 and %2 as both input and output
2017-12-30 14:52:03 +01:00
Martin Kroeker cc9500db41
Merge pull request #1403 from brada4/develop
Address few more warnings
2017-12-30 14:51:34 +01:00
Martin Kroeker 723f396a20
Tag %1 and %2 as both input and output
The inline assembly modifies its input operands, so mark them as output to avoid surprises with optimization. Fixes #1292
2017-12-29 23:56:41 +01:00
Andrew 03e5ff0687 initialize potentially unitialized variables (clang5) 2017-12-26 09:24:24 +01:00
Andrew 47deec2c1a fix couple of dead assignment warnings 2017-12-22 00:56:35 +01:00
Martin Kroeker 43c0622e7b
Retire Piledriver/Steamroller/Excavator daxpy microkernels as well
related to issue #1332
2017-12-13 18:40:39 +01:00
Martin Kroeker 0623636c98
Use Sandybridge daxpy kernel on Haswell and Zen for now
The testcase from #1332 exposes a problem in daxpy_microk_haswell-2.c that is not seen with
any of the other Intel x86_64 microkernels.
2017-12-10 19:24:31 +01:00
Andrew 281a2b952f warning cleanup (#1380)
* dead increments in driver/level2

* dead increments in kernel/generic

* part dead increments in kernel/x86_64
2017-12-05 19:54:10 +01:00
Martin Kroeker 8213385ab8
Work around compiler warnings for unused variables in the generic zgemm3m_Xcopy kernels 2017-12-02 22:51:58 +01:00
Martin Kroeker db00a51e6b
Merge pull request #1371 from martin-frbg/develop
Add trivially optimized DSDOT for POWER8
2017-11-29 19:55:21 +01:00
martin 7a4b3cfbf8 Add trivially optimized DSDOT for POWER8 2017-11-28 18:38:07 +01:00
Martin Kroeker 6c77b5f267
Merge pull request #1369 from martin-frbg/dsdot
Add optimized dsdot to all other x86_64 kernels that use sdot.c
2017-11-28 18:15:31 +01:00
Andrew 441a9c8385 more dead increments clang4 scan-build deadcode.deadstores 2017-11-26 17:24:08 +01:00
Andrew 1236dbe5a6 Eliminate 2-8 dead increments code 2017-11-26 13:26:11 +01:00
Martin Kroeker c92cd6d162
Add trivially optimized dsdot based on sdot 2017-11-24 20:05:27 +01:00
Martin Kroeker cae5d9a20b
Add trivially optimized dsdot based on sdot 2017-11-24 20:04:29 +01:00
Martin Kroeker 3d891c3106
Add trivially optimized dsdot based on sdot 2017-11-24 20:03:40 +01:00
Martin Kroeker 4fbdcfa823
Add trivially optimized dsdot based on sdot 2017-11-24 20:02:28 +01:00
Martin Kroeker 1bb6a96ebc
Add trivially optimized dsdot based on sdot 2017-11-24 20:01:42 +01:00
Martin Kroeker 6bd163f37a
Add trivially optimized dsdot based on sdot 2017-11-24 20:00:23 +01:00
Martin Kroeker f0333333d1
Add trivially optimized dsdot based on sdot 2017-11-24 19:59:28 +01:00
Andrew e89b979b2c fix spurious compiler warning fix (no code change) 2017-11-24 18:39:04 +01:00
Andrew 7e9b29b9b8 fix spurious compiler warning (no code change) 2017-11-24 18:36:37 +01:00
Martin Kroeker 6157d0902a
Merge pull request #1358 from martin-frbg/unused_vars
Clean up spurious unused variables in the kernels
2017-11-15 11:31:43 +01:00
Martin Kroeker 3fea849bbf
Remove unused variables from Haswell dtrmm and Bulldozer dtrsm 2017-11-14 23:35:10 +01:00
Martin Kroeker 8f177621bc
Remove unused variables at0...at3 from ?symv_U 2017-11-14 23:32:25 +01:00
Martin Kroeker 5f402b7759
Remove unused (loop?) variable j from the gemv_n_4 implementations 2017-11-14 23:29:42 +01:00
Martin Kroeker 65bf0a343c
Remove unused variable btpr 2017-11-14 23:25:50 +01:00
Martin Kroeker acf3d34bc5
Silence an unused variable warning with a cast
l2 cache size is not universally needed to assign default unrolling limits, but neither putting its declaration inside an ifdef nor cloning it into all ifdef sections that need it really makes sense here.
2017-11-14 23:23:44 +01:00
Martin Kroeker ab87ee6b48 Merge pull request #1329 from martin-frbg/dsdot
(Trivial) optimized dsdot implementation for HASWELL
2017-10-25 19:13:38 +02:00
Martin Kroeker a07807caac Eliminate loop code when called as/from dsdot 2017-10-25 16:45:41 +02:00
Ashwin Sekhar T K a0128aa489 ARM64: Convert all labels to local labels
While debugging/profiling applications using perf or other tools, the
kernels appear scattered in the profile reports. This is because the labels
within the kernels are not local and each label is shown as a separate
function.

To avoid this, all the labels within the kernels are changed to local
labels.
2017-10-24 11:40:05 +00:00
Martin Kroeker 0e2cf102e1 Fix 32bit HASWELL 2017-10-24 10:07:44 +02:00
Martin Kroeker 5e3e91d0fc Split the microkernel workload into chunks of 32 floats for dsdot mode to limit loss of precision 2017-10-22 18:18:51 +02:00
Martin Kroeker 28c3fa8950 Add dsdot 2017-10-16 23:29:03 +02:00
Martin Kroeker 8ac87c1cb6 Implement DSDOT with unchanged sdot microkernels 2017-10-16 23:27:51 +02:00
Martin Kroeker c7a8512d12 Cmake fixes for DYNAMIC_ARCH builds and whitespace in path names (#1323)
* prebuild.cmake: Put quotes around path names that may contain whitespace
(Copied from alexkaratakis' PR #1295)
* kernel/CMakeLists.txt: Fix common_lapack header inclusion and DYNAMIC_ARCH generation of ?neg_tcopy and ?laswp_ncopy files
* lapack/CMakeLists.txt: Use correct template for ?laswp_(plus,minus) functions
2017-10-09 23:34:18 +02:00
Martin Kroeker 97ecd4996a Merge pull request #1319 from martin-frbg/issue601
Fix out-of-bounds memory accesses exposed by xccblat3 testcase
2017-10-08 23:31:06 +02:00
Martin Kroeker 1eb43cccad Merge pull request #1317 from martin-frbg/power8-asm
Save and restore VSX registers
2017-10-08 23:30:46 +02:00
Martin Kroeker 9d92f526dd Comment out a code block that performs out-of-bounds memory accesses
...and does not appear to be needed even when it stays within the bounds of the array
2017-10-06 23:51:32 +02:00
Martin Kroeker 514d237257 Merge pull request #1279 from xsacha/develop
CMake improvements
2017-10-06 21:13:45 +02:00
Martin Kroeker f96afd94b0 Fix out-of-bounds accesses where the data should be zero anyway 2017-10-01 01:06:39 +02:00
Martin Kroeker 9c017a2218 Save and restore VSX registers 2017-09-28 12:17:09 +02:00
Shivraj Patil e3d844b062 Added mips I6500 core
Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>
2017-09-22 11:57:43 +05:30
Martin Kroeker 46c9357c72 Merge pull request #1288 from quickwritereader/develop
Optimized standard Blas Level-1,2 (excluding nrm2 functions) for z13 (double precision). Issue 884
2017-09-09 23:47:17 +02:00
Abdurrauf 1cfdb2295d Optimized standard Blas Level-1,2 (excluding nrm2 functions) for z13 (double precision) 2017-09-06 16:41:08 +04:00
Sacha Refshauge 47ebce4d1a Clean up, fix old typos. Simplify arch usages. Move system arch check to earlier position. 2017-08-21 00:37:29 +10:00
Sacha Refshauge 69b560751c Improvements to previous commit (cross-compile).
Fix typos and bad if statements discovered in 0.2.20.
2017-08-20 22:50:31 +10:00
Sacha Refshauge 11911fd941 Add kernel/Makefile.LA to CMake 2017-08-20 00:59:14 +10:00
Isuru Fernando d3b677fe87 Add commonobjs 2017-08-07 23:12:40 +05:30
Isuru Fernando 505b218829 Merge remote-tracking branch 'upstream/develop' into dyn 2017-08-06 19:07:00 +05:30
Isuru Fernando d9346930dd Merge remote-tracking branch 'upstream/develop' into develop 2017-08-04 07:57:55 +05:30
Ashwin Sekhar T K 4899d67f7d THUDNERX2T99: Fix clang compilation 2017-08-02 11:28:45 -07:00
Isuru Fernando 1d1854032b Add missing EXCAVATOR 2017-08-02 19:03:04 +05:30
Isuru Fernando 2c51a990ac Fix extra whitespaces. CMake parser macro fails with it
TODO: Fix the parser macro to strip trailing whitespaces
2017-08-02 18:26:57 +05:30
Isuru Fernando 7892434572 Add hemm3m and symm3m objects 2017-08-02 18:24:54 +05:30
Isuru Fernando d798487213 Fixes for dynamic_arch. almost there 2017-08-02 17:25:49 +05:30
Isuru Fernando 251715d9ef configure kernel_core.h 2017-08-01 23:23:55 +05:30
Isuru Fernando 50deeb49b7 configure setparam 2017-08-01 22:32:47 +05:30
Isuru Fernando 4260215adf Support DYNAMIC_ARCH with cmake 2017-08-01 22:25:52 +05:30
Isuru Fernando d245caa49a Support out-of-source build 2017-08-01 15:16:14 +05:30
Isuru Fernando ca17b4b75c Fix complex support for MSVC headers 2017-07-28 11:50:29 +05:30
Isuru Fernando dc24914415 check compiler is msvc instead of msvc 2017-07-28 11:49:39 +05:30
Zhang Xianyi d5ef0dee9a Merge pull request #1226 from ashwinyes/develop_arm_clang_ual_fix
arm: Fix clang compilation for ARMv7
2017-07-10 20:04:42 +08:00
Zhang Xianyi 4239dd65ce Merge branch 'develop' into develop_arm_softfp 2017-07-10 20:02:36 +08:00
Ashwin Sekhar T K f02d535fde arm: Fix clang compilation for ARMv7
clang is not recognizing some pre-UAL VFP mnemonics like fnmacs, fnmacd,
fnmuls and fnmuld. Replaced them with equivalent UAL mnemonics which are
vmls.f32, vmls.f64, vnmul.f32 and vnmul.f64 respectively.
2017-07-07 12:35:58 +05:30
Zhang Xianyi a6515bb858 Merge pull request #1218 from m-brow/power9
Optimise loads on Power9 LE
2017-07-03 13:48:29 +08:00
Ashwin Sekhar T K 37efb5bc1d arm: Remove unnecessary files/code
Since softfp code has been added to all required vfp kernels,
the code for auto detection of abi is no longer required.

The option to force softfp ABI on make command line by giving
ARM_SOFTFP_ABI=1 is retained. But there is no need to give this option
anymore.

Also the newly added C versions of 4x4/4x2 gemm/trmm kernels are removed.
These are longer required. Moreover these kernels has bugs.
2017-07-02 03:06:36 +05:30
Ashwin Sekhar T K 97d671eb61 arm: add softfp support in zgemm/ztrmm vfp kernels 2017-07-02 02:54:32 +05:30
Ashwin Sekhar T K 305cd2e8b4 arm: add softfp support in cgemm/ctrmm vfp kernels 2017-07-02 02:42:32 +05:30
Ashwin Sekhar T K 09bc6ebe5b arm: add softfp support in dgemm/dtrmm vfp kernels 2017-07-02 02:24:38 +05:30
Ashwin Sekhar T K 872a11a2bf arm: add softfp support in sgemm/strmm vfp kernels 2017-07-02 02:23:48 +05:30
Ashwin Sekhar T K eda9e8632a generic: Bug fixes in generic 4x2 and 4x4 gemm kernels 2017-07-02 02:00:48 +05:30
Ashwin Sekhar T K 8f83d3f961 arm: add softfp support in vfp gemv kernels 2017-07-02 01:03:31 +05:30
Ashwin Sekhar T K 83bd547517 arm: add softfp support in kernel/arm/swap_vfp.S 2017-07-01 20:37:40 +05:30
Ashwin Sekhar T K e25f4c01d6 arm: add softfp support in kernel/arm/nrm2_vfp*.S 2017-07-01 19:57:28 +05:30
Ashwin Sekhar T K 54915ce343 arm: add softfp support in kernel/arm/*dot_vfp.S 2017-06-30 23:46:02 +05:30
Ashwin Sekhar T K 0150fabdb6 arm: add softfp support in kernel/arm/rot_vfp.S 2017-06-30 21:52:32 +05:30
Ashwin Sekhar T K 4f0773f07d arm: add softfp support in kernel/arm/axpy_vfp.S 2017-06-30 20:25:59 +05:30
Ashwin Sekhar T K aa5edebc80 arm: add softfp support in kernel/arm/asum_vfp.S 2017-06-30 18:21:05 +05:30
Ashwin Sekhar T K 89924b3d5b arm: Use assembly implementations based on the ARM abi
In case of softfp abi, assembly implementations of only those APIs are
used which doesnt have a floating point argument or return value.

In case of hard abi, all assembly implementations are used.
2017-06-30 18:21:05 +05:30
Ashwin Sekhar T K da7f0ff425 generic: add some generic gemm and trmm kernels
Added generic 4x4 and 4x2 gemm kernels
Added generic 4x2 trmm kernel
2017-06-30 18:21:05 +05:30
Zhang Xianyi 482015f8d6 Merge branch 'arm_soft_fp_abi' into develop 2017-06-23 11:35:25 +08:00
Matt Brown bd831a03a8 Optimise sscal for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 17:02:46 +10:00
Matt Brown edc97918f8 Optimise srot for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 17:02:35 +10:00
Matt Brown e0034de22d Optimise sdot for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 17:02:19 +10:00
Matt Brown 32c7fe6bff Optimise sasum for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 17:02:10 +10:00
Matt Brown 19bdf9d52b Optimise casum for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 17:00:07 +10:00
Matt Brown 4f09030fdc Optimise cswap for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 16:59:53 +10:00
Matt Brown 6f4eca5ea4 Optimise sswap for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 16:59:13 +10:00
Matt Brown be55f96cbd Optimise scopy for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 16:59:13 +10:00
Matt Brown 96dd0ef4f7 Optimise ccopy for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 16:58:59 +10:00
Alan Modra dc40bc7368 Power8 inline assembly tweaks
Further fixes on top of 9e2f316ed.  Writing some doco for gcc on
inline assembly woke me up to some more errors.

- dgemv_kernel_4x4 asm did not mention *ap as a memory input, and
  *y is both read and write.
- sasum_kernel_32 and casum_kernel_16 did not use %x for a vsx insn
  operand, a problem if the "=f" sum output was ever allocated a vsx
  reg in the altivec set.  This might be possible with inlining and
  future gcc optimisation.
2017-04-04 23:13:54 +09:30
Zhang Xianyi b5c96fcfcd Support ARM SOFTFP ABI for saxpy, sdot, snrm2, sscal, sgemv, sger. 2017-03-20 17:39:25 +08:00
Denis Steckelmacher c9ff735da6 Add ZEN support (tested for auto-detected static backend) 2017-03-19 15:32:50 +01:00
Martin Kroeker cd135e2b59 Merge pull request #1130 from quickwritereader/develop
Blas 3 for single precision
2017-03-15 10:00:52 +01:00
Martin Kroeker a6efabf155 Replace gnu _real_ , _imag_ extensions in initializers 2017-03-13 00:38:37 +01:00
Abdurrauf 08786c4b95 strmm and ctrmm 2017-03-13 01:23:16 +04:00
Zhang Xianyi 90e02ccf68 Support ARM softfp ABI for sgemm on ARMV7.
make ARM_SOFTFP_ABI=1
2017-03-06 22:16:13 +08:00
Abdurrauf 82e80fa82b initial strmm(sgemm). not tuned yet 2017-03-06 04:27:40 +04:00
Martin Kroeker d1fe040d9b Merge pull request #1110 from quickwritereader/develop
Conventional usage of the register save area.
2017-03-01 23:08:07 +01:00
Abdurrauf 411982715c conventional usage of the register save area 2017-03-01 20:39:39 +04:00
Abdurrauf e831d6924e changed to conventional register save area 2017-03-01 03:13:21 +04:00
Martin Kroeker ffc1d6c468 Merge pull request #1108 from ashwinyes/develop_20170203_thunderx2t99
Optimized Implementations for ThunderX2T99
2017-02-28 16:02:19 +01:00
Ashwin Sekhar T K 67473d09dd THUNDERX2T99: Bug Fixes in D/Z NRM2 and ZGEMM 2017-02-28 01:11:38 -08:00
Ashwin Sekhar T K 19ba133383 THUNDERX2T99: Add Optimized ZGEMM Implementation 2017-02-28 05:31:41 +00:00
Abdurrauf 0d96b0e2a7 Merge branch 'z13' into develop 2017-02-26 06:17:33 +04:00
Abdurrauf 848cb27b1e ztrmm kernel. 2017-02-26 06:14:12 +04:00
Martin Kroeker dc34a0da96 Merge pull request #915 from mdong/small_fix_for_icc
remove input from clobbered list
2017-02-23 20:00:22 +01:00
Ashwin Sekhar T K a3935f0dfb THUNDERX2T99: Add Optimized D/Z NRM2 Implementation 2017-02-23 10:02:15 -08:00
Ashwin Sekhar T K 738628e9a8 ARM64: Remove unused code 2017-02-21 21:42:32 -08:00
Ashwin Sekhar T K ab3ffab96a THUNDERX2T99: Add Optimized C/Z DOT Implementation 2017-02-21 03:40:59 -08:00
Ashwin Sekhar T K f036be9ce2 THUNDERX2T99: Add Optimized SDOT Implementation 2017-02-21 03:24:32 -08:00
Ashwin Sekhar T K faba876fda THUNDERX2T99: Bug fix in C/Z IAMAX 2017-02-19 23:11:50 -08:00
Ashwin Sekhar T K 172a62d73e THUNDERX2T99: Add Optimized C/Z IAMAX Implementation 2017-02-17 03:06:32 -08:00
Ashwin Sekhar T K 228c75a69c THUNDERX2T99: Add parallel SCNRM2 Implementation 2017-02-14 04:10:06 -08:00
Martin Kroeker 9e2f316ede Power8 inline assembly fixes
Quoting patch author amodra from #1078
Lots of issues here.
- The vsx regs weren't listed as clobbered.
- Poor choice of vsx regs, which along with the lack of clobbers led to
  trashing v0..v21 and fr14..fr23.  Ideally you'd let gcc choose all
  temp vsx regs, but asms currently have a limit of 30 i/o parms.
- Other regs were clobbered unnecessarily, seemingly in an attempt to
  clobber inputs, with gcc-7 complaining about the clobber of r2.
  (Changed inputs should be also listed as outputs or as an i/o.)
- "r" constraint used instead of "b" for gprs used in insns where the
  r0 encoding means zero rather than r0.
- There were unused asm inputs too.
- All memory was clobbered rather than hooking up memory outputs with
  proper memory constraints, and that and the lack of proper memory
  input constraints meant the asms needed to be volatile and their
  containing function noinline.
- Some parameters were being passed unnecessarily via memory.
- When a copy of a
2017-02-13 23:38:50 +01:00
Ashwin Sekhar T K 8e89668f62 THUNDERX2T99: Fix bug in SNRM2 2017-02-07 02:14:33 -08:00
Ashwin Sekhar T K f63deae9de THUNDERX2T99: Add Optimized S/D IAMAX Implementation 2017-02-07 01:35:55 -08:00
Martin Kroeker 60eea75409 Merge pull request #1076 from ashwinyes/develop_20170130_thunderx2t99
More optimized implementations for ThunderX2T99
2017-02-04 17:25:43 +01:00
Ashwin Sekhar T K 071a830e8b THUNDERX2T99: Add optimized S/D/C/Z SWAP Implementations 2017-02-03 03:55:06 -08:00
Ashwin Sekhar T K d09f88192c THUNDERX2T99: Add optimized S/D/C/Z COPY Implementations 2017-02-02 15:26:38 +05:30
Ashwin Sekhar T K e58233460a THUDNERX2T99: Add optimized D/C/Z ASUM Implementations 2017-02-02 15:26:22 +05:30
Ashwin Sekhar T K 99bd2892bf THUNDERX2T99: Add optimized CASUM Implementation 2017-01-30 17:44:32 +05:30
Ashwin Sekhar T K ff6f572f2e THUNDERX2T99: Rename labels in for DDOT and SNRM2 2017-01-30 17:44:32 +05:30
Ashwin Sekhar T K e0dc5f58c5 THUNDERX2T99: Remove Duplicate Code 2017-01-30 17:44:32 +05:30
Ashwin Sekhar T K 2757b49767 THUNDERX2T99: Add Optimized CGEMM Implementation 2017-01-30 17:44:26 +05:30
Zhang Xianyi ff41e13385 Merge pull request #1074 from ashwinyes/develop_20170116_thunderx2t99_sgemm
Add more THUNDERX2T99 Optimized APIs
2017-01-25 22:17:05 +08:00
Ashwin Sekhar T K 907e286eb6 THUNDERX2T99: Add threaded SNRM2 Implementation 2017-01-24 21:39:29 +05:30
Ashwin Sekhar T K cde3aee08b ARM64: Rename kernel files to have consistent naming 2017-01-24 14:53:34 +05:30
Ashwin Sekhar T K ee6ea7e988 THUNDERX2T99: Add Optimized CNRM2 Implementation 2017-01-24 10:23:32 +05:30
Ashwin Sekhar T K ca0b36b012 THUNDERX2T99: Add Optimized SNRM2 Implementation 2017-01-24 10:23:21 +05:30
Ashwin Sekhar T K d0a79ca6e0 THUNDERX2T99: Add threaded DDOT Implementation 2017-01-19 11:11:42 +05:30
Ashwin Sekhar T K 0c07003ccf THUNDERX2T99: Add Optimized DDOT Implementation 2017-01-19 11:11:07 +05:30
Ashwin Sekhar T K f33fcedb30 THUNDERX2T99: Improve SGEMM 2017-01-19 11:11:07 +05:30
Ashwin Sekhar T K 0f1d6e8b39 THUNDERX2T99: Improve DGEMM 2017-01-19 11:11:07 +05:30
Ashwin Sekhar T K 981064acc6 THUNDERX2T99: Add Optimized DAXPY Implementation 2017-01-19 11:10:57 +05:30
Shivraj Patil a4d97d980f Added rot functions.
Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>
2017-01-17 12:15:07 +05:30
Ashwin Sekhar T K f279ff4789 THUNDERX2T99: Add Optimized SGEMM Implementation 2017-01-16 21:44:33 +05:30
Ashwin Sekhar T K 759f37feba ARM64: Let target VULCAN inherit THUNDERX2T99 properties 2017-01-16 21:44:19 +05:30
Zhang Xianyi 0863a0d4b4 Merge pull request #1061 from ashwinyes/develop_aarch64_vulcan_thunderx_patch
Add new targets for ARM64
2017-01-16 13:20:10 +08:00
Werner Saar 28e2fab33e prepared kernel/setparam-ref.c for UNROLL values, that are not a power of two 2017-01-11 11:56:50 +01:00
Ashwin Sekhar T K 4b55fae337 ARM64: Add Cavium THUNDERX2T99 Target 2017-01-11 11:18:40 +05:30
Andrew Pinski 95649dee28 THUNDERX: Add optimized version of daxpy
This is better for single core but does not change anything for multiple cores
2017-01-11 11:18:36 +05:30
Andrew Pinski 8fdb0655e9 THUNDERX: Add an optimized version of ddot 2017-01-10 15:01:37 +05:30
Andrew Pinski fb200c7245 ARM64: Add Cavium THUNDERX Target 2017-01-10 15:01:37 +05:30
Ashwin Sekhar T K 0b8e876d89 VULCAN: Add optimized DGEMM implementation 2017-01-10 15:01:37 +05:30
Ashwin Sekhar T K 4713e7c47f ARM64: Add the VULCAN Target 2017-01-10 15:01:17 +05:30
Ashwin Sekhar T K 6085386b10 CORTEXA57: Add assembly kernels for copy routines 2017-01-10 15:01:05 +05:30
kaustubh 1480f3df71 Add msa optimization for AXPY, COPY, SCALE, SWAP
Signed-off-by: kaustubh <kaustubh.raste@imgtec.com>
2017-01-09 18:27:23 +05:30
kaustubh 88afb3bc94 Add msa optimization for AXPY, COPY, SCALE, SWAP
Signed-off-by: kaustubh <kaustubh.raste@imgtec.com>
2017-01-09 18:22:09 +05:30
Zhang Xianyi b678471d65 Merge branch 'z13' into develop
Conflicts:
	CONTRIBUTORS.md
2017-01-09 05:52:42 -05:00
Zhang Xianyi 864e202afd Add USE_TRMM=1 for IBM z13 in kernel/Makefile.L3 2017-01-09 05:48:09 -05:00
Abdurrauf 6418667818 dtrmm and dgemm for z13 2017-01-04 19:32:33 +04:00
Shivraj Patil a9bf8a781a Added prefetch to CGEMV and ZGEMV.
Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>
2016-12-27 11:33:51 +05:30
kaustubh 5f93aa5f87 Updated data prefetch in TRSM, ASUM, DOT functions
Signed-off-by: kaustubh <kaustubh.raste@imgtec.com>
2016-12-14 14:05:11 +05:30
kaustubh 9db451acd0 Updated data prefetch in TRSM, ASUM, DOT functions
Signed-off-by: kaustubh <kaustubh.raste@imgtec.com>
2016-12-13 14:02:14 +05:30
kaustubh 3eaff85191 Updated data prefetch in TRSM, ASUM, DOT functions
Signed-off-by: kaustubh <kaustubh.raste@imgtec.com>
2016-12-13 11:41:17 +05:30
kaustubh 00abce3b93 Add data prefetch in DOT and ASUM functions
Signed-off-by: kaustubh <kaustubh.raste@imgtec.com>
2016-11-22 11:21:03 +05:30
Andrew becf8bc7a0 remove dead code 2016-10-31 12:46:56 +01:00
kaustubh f3419e634c SGEMM, DGEMM, CGEMM, ZGEMM functions data prefetch
Signed-off-by: kaustubh <kaustubh.raste@imgtec.com>
2016-10-17 18:29:38 +05:30
Zhang Xianyi 7472c79ea6 Merge pull request #984 from ksraste/develop
STRSM, DTRSM functions data prefetch
2016-10-17 11:33:16 +08:00
kaustubh 90e2321ac3 STRSM, DTRSM functions data prefetch
Signed-off-by: kaustubh <kaustubh.raste@imgtec.com>
2016-10-14 16:41:28 +05:30
Martin Kroeker 4998e19869 Change file comments to work around clang 3.9 assembler bug 2016-10-13 16:51:08 +02:00
Martin Kroeker 91610f3835 Update zdot_msa.c 2016-10-05 18:59:09 +02:00
Martin Kroeker 6e22ecf102 Update zdot.c 2016-10-05 18:58:03 +02:00
Martin Kroeker 6221d6df5f Update zdot.c 2016-10-05 18:57:14 +02:00
Martin Kroeker 16446d1d23 Remove explicit include of complex.h 2016-09-29 23:45:56 +02:00
Martin Kroeker a6e9e0b94b Remove explicit include of complex.h 2016-09-29 23:43:28 +02:00
Martin Kroeker 3178e4fea0 Remove explicit include of complex.h 2016-09-29 23:41:43 +02:00
Martin Kroeker 95c245ddb0 Remove explicit include of complex.h 2016-09-29 23:40:36 +02:00
Martin Kroeker 4b1b27347f Remove explicit include of complex.h 2016-09-29 23:39:35 +02:00
Shivraj Patil 54747fe24a DGEMM function split and data prefech
Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>
2016-09-22 17:25:46 +05:30
Zhang Xianyi 515bc56ea9 Refs #946. Use nrm2 reference implementation for Power8. 2016-08-18 18:59:43 -07:00
Zhang Xianyi ae70b916f4 Refs #929. Deal with zero and NaNs for scale. 2016-08-18 10:24:42 -07:00
Shivraj Patil 9687437928 MIPS n32 ABI and build time mips simd support check
Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>
2016-08-10 17:44:22 +05:30
Shivraj Patil d1c6469283 MIPS n32 ABI support, MSA support detection and rename ARCH, ARCHFLAGS
Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>
2016-08-08 11:58:01 +05:30
Ashwin Sekhar T K c54a29bb48 Cortex A57: Improvements to DGEMM 8x4 kernel 2016-07-26 10:58:21 +05:30
Shivraj Patil beb1d076a4 Added MSA optimization for GEMV_N, GEMV_T, ASUM, DOT functions
Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>
2016-07-15 18:38:25 +05:30
Zhang Xianyi 8a592ee386 Merge pull request #924 from ashwinyes/develop_aarch64_improvements_20160714
Improvements to Aarch64 kernels
2016-07-14 15:47:55 -04:00
Ashwin Sekhar T K 0a5ff9f9f9 Improvements to TRMM and GEMM kernels 2016-07-14 13:56:04 +05:30
Ashwin Sekhar T K 8a40f1355e Improvements to GEMV kernels 2016-07-14 13:50:38 +05:30
Ashwin Sekhar T K 78782485b6 Improvements to COPY and IAMAX kernels 2016-07-14 13:49:34 +05:30
Shivraj Patil 57df7956ee Added CGEMM, ZGEMM, STRMM, DTRMM, CTRMM, ZTRMM. Updated macros in SGEMM, DGEMM, STRMM.
Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>
2016-06-28 17:51:10 +05:30
Zhang Xianyi 4a30a2584a Merge pull request #897 from ksraste/develop
STRSM optimized for MSA
2016-06-27 10:04:18 -04:00
mdong 098d8ec5d6 remove input from clobbered list 2016-06-24 16:37:58 -04:00
Werner Saar f04af36ad0 Merge pull request #898 from wernsaar/develop
added experimental support for optimized lapack fortran functions
2016-05-31 14:13:52 +02:00
Kaustubh Raste 011431b9d7 STRSM optimized for MSA
Signed-off-by: Kaustubh Raste <kaustubh.raste@imgtec.com>
2016-05-31 10:17:23 +05:30
Kaustubh Raste c8a7860eb3 STRSM optimized
Signed-off-by: Kaustubh Raste <kaustubh.raste@imgtec.com>
2016-05-30 21:17:00 +05:30
Zhang Xianyi 2daad2bcb5 Merge pull request #893 from biddisco/develop
Replace CMAKE_SOURCE_DIR/CMAKE_BINARY_DIR with PROJECT_SOURCE_DIR/PRO…
2016-05-30 14:52:58 +08:00
John Biddiscombe 053044ae4d Replace CMAKE_SOURCE_DIR/CMAKE_BINARY_DIR with PROJECT_SOURCE_DIR/PROJECT_BINARY_DIR
If OpenBLAS is built using add_subdirectory(OpenBlas) as part of another project
then the paths set by CMAKE_XXX_DIR are relative to the parent project
and not the OpenBLAS project.
2016-05-25 09:13:28 +02:00
Aleksey Kuleshov fca66262c4 mips64/axpy: fix error when INCY == 0 2016-05-23 13:30:27 +03:00
Werner Saar 412bcd187a optimized dtrsm_logic_LT_16x4_power8.S and dtrsm_macros_LT_16x4_power8.S 2016-05-23 11:20:41 +02:00
Werner Saar bd06b246cc Merge pull request #890 from wernsaar/develop
optimized dtrsm_kernel_LT for POWER8
2016-05-22 16:01:35 +02:00
Werner Saar 8b140220c8 optimized dtrsm_kernel_LT for POWER8 2016-05-22 15:20:04 +02:00
Werner Saar 8fb5a1aaff added optimized dtrsm_LT kernel for POWER8 2016-05-22 13:09:05 +02:00
Kaustubh Raste ad9f317870 STRSM optimization for MIPS P5600 and I6400 using MSA
Signed-off-by: Kaustubh Raste <kaustubh.raste@imgtec.com>
2016-05-20 10:59:03 +05:30
Shivraj Patil c4ba40e308 SGEMM optimization for MIPS P5600 and I6400 using MSA. Unrolled k loop in DGEMM kernel function
Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>
2016-05-19 11:04:42 +05:30
Zhang Xianyi 7a19065369 Merge pull request #878 from ksraste/develop
DTRSM bug fix for MIPS P5600 and I6400
2016-05-19 11:16:43 +08:00
Werner Saar 6a2bde7a2d optimized dgemm and dgetrf for POWER8 2016-05-17 14:45:27 +02:00
Kaustubh Raste d7cbc7ac13 DTRSM bug fix for MIPS P5600 and I6400
Signed-off-by: Kaustubh Raste <kaustubh.raste@imgtec.com>
2016-05-17 15:48:02 +05:30
Werner Saar 88011f625d Merge pull request #876 from wernsaar/develop
optimized dgemm on power8 for 20 threads
2016-05-16 14:52:40 +02:00
Werner Saar 8310d4d3f7 optimized dgemm for 20 threads 2016-05-16 14:14:25 +02:00
Kaustubh Raste edb5980c13 DTRSM optimization for MIPS P5600 and I6400 using MSA
Signed-off-by: Kaustubh Raste <kaustubh.raste@imgtec.com>
2016-05-09 15:15:26 +05:30
Shivraj Patil 085cf236c2 conflict resolved by syncing with 'xianyi:develop'
Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>
2016-05-04 11:07:14 +05:30
Shivraj Patil b7b3d8ec8e DGEMM optimization for MIPS P5600 and I6400 using MSA
Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>
2016-05-03 14:42:26 +05:30
Zhang Xianyi cd7af5260a Merge pull request #847 from sva-img/develop
MIPS P5600(32 bit) and I6400(64 bit) cores support added.
2016-04-29 11:44:36 -04:00
Werner Saar 56948dbf0f optimized dgemm for POWER8 2016-04-29 12:52:47 +02:00
Werner Saar 0d0c6f7d7d optimized dgemm for POWER8 2016-04-27 14:01:08 +02:00
Werner Saar 298b13bba4 updated some kernel files for EXCAVATOR 2016-04-25 10:36:23 +02:00
Werner Saar 78b05f6476 bugfix for EXCAVATOR and DYNAMIC_ARCH 2016-04-25 10:13:30 +02:00
Werner Saar a3da10662f added sgemm_tcopy_8_power8.S 2016-04-23 10:04:41 +02:00
Werner Saar d46f07bb4e added cgemm_tcopy_8_power8.S 2016-04-23 07:37:18 +02:00
Werner Saar 879a51165f Optimized zgemm and tested zgemm again 2016-04-22 13:07:12 +02:00
Shivraj Patil 2c3dfe2bf3 MIPS P5600(32 bit) and I6400(64 bit) cores support added.
Seperated mips and mips64 files.
Configurations support for mips 32 bit.

Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>
2016-04-22 14:03:18 +05:30
Werner Saar 9276c9012f Optimized sgemm and dgemm and tested again. 2016-04-21 11:37:57 +02:00
wernsaar 6fbca2a4a1 Merge pull request #845 from wernsaar/develop
optimized sgemm for power8
2016-04-20 13:44:22 +02:00
Werner Saar 0001260f4b optimized sgemm 2016-04-20 13:06:38 +02:00
Werner Saar 3c6294ca3d added optimized sgemm_tcopy for power8 2016-04-19 16:08:54 +02:00
Zhang Xianyi dd43661cfd Init IBM z system (s390x) porting. 2016-04-15 18:02:24 -04:00
Zhang Xianyi f24d5307cf Refs #834. Fix zgemv config bug on Steamroller. 2016-04-12 22:26:11 +08:00
Werner Saar 8037d78eed bugfix for arm scal.c and zscal.c 2016-04-11 11:21:36 +02:00
wernsaar 0a4276bc2f Merge pull request #837 from wernsaar/develop
updated zgemm- and ztrmm-kernel for POWER8
2016-04-08 11:13:27 +02:00
Werner Saar e173c51c04 updated zgemm- and ztrmm-kernel for POWER8 2016-04-08 09:05:37 +02:00
Werner Saar 9c42f0374a Updated cgemm- and sgemm-kernel for POWER8 SMP 2016-04-07 15:08:15 +02:00
Zhang Xianyi d4380c1fe4 Refs xianyi/OpenBLAS-CI#10 , Fix sdot for scipy test_iterative.test_convergence test failure on AMD bulldozer and piledriver. 2016-04-07 01:44:18 +08:00
Werner Saar a51102e9b7 bugfixes for sgemm- and cgemm-kernel 2016-04-06 11:15:21 +02:00
Werner Saar c5b1fbcb2e updated optimized cgemm- and ctrmm-kernel for POWER8 2016-04-04 09:12:08 +02:00
Werner Saar d4c0330967 updated cgemm- and ctrmm-kernel for POWER8 2016-04-03 14:30:49 +02:00
Werner Saar 6a9bbfc227 updated sgemm- and strmm-kernel for POWER8 2016-04-02 17:16:36 +02:00
Werner Saar 68a69c5b50 added optimized dgemv_n kernel for POWER8 2016-03-30 11:10:53 +02:00
Werner Saar c2464a7c4a added optimized casum kernel for POWER8 2016-03-28 14:12:08 +02:00
Werner Saar 294f933869 added optimized zasum kernel for POWER8 2016-03-28 13:37:32 +02:00
Werner Saar f59c9bd6ef added optimized sasum kernel for POWER8 2016-03-28 12:44:25 +02:00
Werner Saar c53be46d78 added optimized dasum kernel for POWER8 2016-03-28 12:17:15 +02:00
Werner Saar 659ed16591 added otimized cswap and zswap kernels for POWER8 2016-03-27 18:31:37 +02:00
Werner Saar 35c98a3556 added optimized zscal kernel for POWER8 2016-03-27 16:31:50 +02:00
Werner Saar f1a5dd06c5 added optimized sscal kernel for POWER8 2016-03-27 11:05:56 +02:00
wernsaar e125a3dc33 Merge pull request #824 from wernsaar/develop
added optimized drot-kernel and srot-kernel for POWER8
2016-03-27 10:43:17 +02:00
Werner Saar 35f1f21a7f added drot- and srot-kernel optimimized for POWER8 2016-03-27 08:57:11 +02:00
Zhang Xianyi 7b4b7179ba Merge pull request #819 from ashwinyes/develop_20160324_fixes_optimizations
Cortex-A57: Fixes and Optimizations
2016-03-27 00:04:20 -04:00
Werner Saar 3d9a50e841 added optimized sswap kernel for POWER8 2016-03-25 17:34:55 +01:00
Werner Saar 828c849b44 added optimized ccopy kernel for POWER8 2016-03-25 16:54:25 +01:00
Werner Saar ecc0bc9813 added optimized scopy kernel for POWER8 2016-03-25 16:06:56 +01:00
Werner Saar 12f209b7b0 added optimized zswap kernel for POWER8 2016-03-25 15:27:34 +01:00
Werner Saar 7316a87930 added optimized dswap kernel for POWER8 2016-03-25 14:35:43 +01:00
Werner Saar 0bff057a87 added optimized dcopy kernel for POWER8 2016-03-25 13:03:02 +01:00
Werner Saar 1e6cf9808c added optimized dscal kernel for POWER8 2016-03-25 09:42:08 +01:00
Ashwin Sekhar T K 278511ad2d Cortex-A57: Fix clang compilation errors 2016-03-24 10:42:04 +05:30
Ashwin Sekhar T K 3b5ffb49d3 Cortex-A57: Improve DGEMM 8x4 Implementation 2016-03-24 10:25:18 +05:30
Werner Saar 55eda3813b added optimized zaxpy kernel for POWER8 2016-03-23 11:20:23 +01:00
Werner Saar 0664ba4c97 added optimized daxpy kernel for POWER8 2016-03-22 14:50:03 +01:00
Werner Saar 11c44dede1 added optimized sdot kernel for POWER8 2016-03-21 13:18:23 +01:00
Werner Saar 9e4584d069 added optimized zdot kernel for POWER8 2016-03-21 10:12:07 +01:00
Werner Saar cd9fafc054 ddot for POWER8: updated licence information 2016-03-20 11:19:27 +01:00
Werner Saar 84b92e6373 added optimized ddot kernel for POWER8 2016-03-20 11:06:06 +01:00
wernsaar c279a53ed8 Merge pull request #806 from wernsaar/develop
adding optimized single precision blas level3 kernels for POWER8
2016-03-18 12:46:16 +01:00
Werner Saar e1df5a6e23 fixed sgemm- and strmm-kernel 2016-03-18 12:12:03 +01:00
Werner Saar 5c658f8746 add optimized cgemm- and ctrmm-kernel for POWER8 2016-03-18 08:17:25 +01:00
Ashwin Sekhar T K 5ac02f6dc7 Optimize Dgemm 4x4 for Cortex A57 2016-03-14 19:35:23 +05:30
Ashwin Sekhar T K 7aa1ad4923 Functional Assembly Kernels for CortexA57
Adding functional (non-optimized) kernels for Cortex-A57
with the following layouts.
SGEMM - 16x4, 8x8
CGEMM - 8x4
DGEMM - 8x4, 4x8
2016-03-14 19:33:21 +05:30
Werner Saar dcd15b546c BUGFIX: KERNEL.POWER8 2016-03-14 14:36:59 +01:00
Werner Saar 96284ab295 added sgemm- and strmm-kernel for POWER8 2016-03-14 13:52:44 +01:00
Werner Saar faa5e2e5e3 FIX: forgot the add the files cgemv_n_4.c and cgemv_t_4.c 2016-03-10 11:10:38 +01:00
Werner Saar fdf291be30 Added optimized cgemv_n and cgemv_t kernels for bulldozer, piledriver and steamroller 2016-03-10 09:42:07 +01:00
Werner Saar c99cc41cbd Added optimized zgemv_n kernel for bulldozer, piledriver and steamroller 2016-03-09 14:02:03 +01:00
Werner Saar acdff55a6a Bugfix for ztrmv 2016-03-07 09:39:34 +01:00
Zhang Xianyi 7d6b68eb4a Refs #786. Revert to default assembly kernel. 2016-03-07 11:34:58 +08:00
Werner Saar cd5241d0cf modified KERNEL for power, to use the generic DSDOT-KERNEL 2016-03-06 09:07:24 +01:00
Zhang Xianyi 8c43d7fa5f Merge remote-tracking branch 'origin/power8' into develop
Refs #774
2016-03-05 06:03:19 -05:00
Werner Saar 085f215257 Modified assembly label name, so that they are hidden.
Added license informations.
2016-03-05 10:27:27 +01:00
Zhang Xianyi 8f758eeff9 Refs #786. avoid old assembly c/zgemv kernels. 2016-03-05 08:32:03 +08:00
Werner Saar 0afc76fd65 enabled gemm_beta assembly kernels 2016-03-04 15:01:15 +01:00
Werner Saar 91e1c5080c modified configuration, to use power6 sgemm kernel for power8 2016-03-04 13:38:57 +01:00
Werner Saar 73f04c2c72 enabled hemv assemly function for power8 2016-03-04 13:20:50 +01:00
Werner Saar 3e633152c6 enabled symv assembly kernels on power8 2016-03-04 13:08:18 +01:00
Werner Saar d5130ce7e3 enabled gemv assembly on power8 2016-03-04 12:53:31 +01:00
Werner Saar 4824b88fcb enabled all level1 assembly kernels for power8 2016-03-04 12:35:25 +01:00
Werner Saar b752858d6c added dgemm-, dtrmm-, zgemm- and ztrmm-kernel for power8 2016-03-01 07:33:56 +01:00
Zhang Xianyi efa4f5c936 Refs #695 #783. Replace default x86_64 cgemv_t
asm kernel by C kernel.
2016-03-01 11:18:56 +08:00
Zhang Xianyi 74b0672223 Fix c/zaxpyc kernel bug on Cortex-A57. 2016-02-23 22:47:53 +00:00
Zhang Xianyi 6e7be06e07 Refs JuliaLang/julia#5728. Fix gemv performance bug on Haswell Mac OSX.
On Mac OS X, it should use .align 4 (equal to .align 16 on Linux).
I didn't get the performance benefit from .align. Thus, I deleted it.
2016-02-19 17:56:07 -05:00
Zhang Xianyi d06b92906a Add gemm3m building for CMake. 2016-02-12 05:02:51 +08:00
Zhang Xianyi 962376664d Refs #768. Swap the result of zdot x87 fp kernel. 2016-02-02 09:15:02 +08:00
Zhang Xianyi c44ff4d648 Refs #714. avoid compiling warnings. 2016-01-28 04:38:07 +08:00
Werner Saar 63a7d7fb24 updated gemv_n_vfpv3.S for armv7 2016-01-25 15:00:13 +01:00
Werner Saar b4ede558a5 updated nrm2 kernel for armv7 2016-01-25 11:55:25 +01:00
Werner Saar de3e2d4349 updated trmm kernels for armv7 2016-01-25 11:08:56 +01:00
Werner Saar a0e51e96f1 updated gemm kernels for armv7 2016-01-25 10:46:10 +01:00
Werner Saar c2891330bc updated KERNEL.ARMV6 2016-01-24 17:12:07 +01:00
Werner Saar ceaa931e48 updated gemv kernel for armv6 2016-01-24 16:31:19 +01:00
Werner Saar eaa63165df updated cgemv and zgemv kernels for armv6 2016-01-24 14:42:38 +01:00
Werner Saar c65357c566 updated trmm_kernels for armv6 2016-01-24 13:03:33 +01:00
Werner Saar e63e9f9f26 updated gemm_kernels for armv6 2016-01-24 11:55:50 +01:00
Werner Saar aafd3ab60e updated cdot and zdot on arm 2016-01-24 10:56:49 +01:00
Werner Saar d2f84c9c8a Ref #740: updated nrm2_vfp.S 2016-01-23 17:47:58 +01:00
Werner Saar ca32253f32 Ref #740: updated asum_vfp.S and iamax_vfp.S 2016-01-23 14:44:34 +01:00
Werner Saar 9066d1f982 Ref #750 and Ref #740 : bugfix for sdot, dsdot and ddot on arm 2016-01-23 11:59:51 +01:00
Werner Saar 692d9c881c Ref #740: simple solution to clear floating point register on arm 2016-01-17 15:37:12 +01:00
Zhang Xianyi 3602a2cd1f #736 Revert #733 patch to fix bus error on ARM. 2016-01-12 22:19:58 +00:00
Zhang Xianyi e3e20e2242 Merge pull request #733 from yuyichao/arm-asm
Do not use vsub to clear the register values
2016-01-05 19:35:12 -06:00
Yichao Yu 594b9f4c73 Do not use vsub to clear the register values since it doesn't work with non-normal numbers. 2016-01-05 16:54:05 +00:00
Werner Saar c8f2c5d636 added optimized trsm_kernels 2016-01-05 13:05:05 +01:00
Ashwin Sekhar T K 318f0949c3 lapack-test fixes in nrm2 kernels for Cortex A57 2015-11-23 13:43:36 +05:30
Ashwin Sekhar T K 98965da2e8 lapack-test fixes for Cortex A57 2015-11-20 01:15:04 +05:30
Ashwin Sekhar T K c99c43d51e Optimized trmm kernels for CORTEXA57 2015-11-09 14:15:54 +05:30
Ashwin Sekhar T K 1397b47197 Optimized zgemm kernel for CORTEXA57 2015-11-09 14:15:53 +05:30
Ashwin Sekhar T K 45f78963ac Optimized cgemm kernel for CORTEXA57
Also, add a generic ztrmm 4x4 kernel
2015-11-09 14:15:53 +05:30
Ashwin Sekhar T K 402443bf9c Optimized dgemm kernel for CORTEXA57 2015-11-09 14:15:53 +05:30
Ashwin Sekhar T K 19fdbee291 Improve the sgemm kernel for CORTEXA57 2015-11-09 14:15:53 +05:30
Ashwin Sekhar T K 3b0cdfab1e Optimized gemv kernels for CORTEXA57
Co-Authored-By: Ralph Campbell <ralph.campbell@broadcom.com>
2015-11-09 14:15:52 +05:30
Ashwin Sekhar T K 46efa6a1da Optimized swap kernels for CORTEXA57
Co-Authored-By: Ralph Campbell <ralph.campbell@broadcom.com>
2015-11-09 14:15:52 +05:30
Ashwin Sekhar T K ea1465cdf8 Optimized scal kernels for CORTEXA57
Co-Authored-By: Ralph Campbell <ralph.campbell@broadcom.com>
2015-11-09 14:15:52 +05:30
Ashwin Sekhar T K fb4be3b3eb Optimized rot kernels for CORTEXA57
Co-Authored-By: Ralph Campbell <ralph.campbell@broadcom.com>
2015-11-09 14:15:52 +05:30
Ashwin Sekhar T K 6c2f4ddbcd Optimized nrm2 kernels for CORTEXA57 2015-11-09 14:15:51 +05:30
Ashwin Sekhar T K 870c4d49c0 Optimized dot kernels for CORTEXA57
Co-Authored-By: Ralph Campbell <ralph.campbell@broadcom.com>
2015-11-09 14:15:51 +05:30
Ashwin Sekhar T K cd7684097c Optimized copy kernels for CORTEXA57
Co-Authored-By: Ralph Campbell <ralph.campbell@broadcom.com>
2015-11-09 14:15:51 +05:30
Ashwin Sekhar T K 2690b71b1f Optimized axpy kernels for CORTEXA57
Co-Authored-By: Ralph Campbell <ralph.campbell@broadcom.com>
2015-11-09 14:15:51 +05:30
Ashwin Sekhar T K 3e4acedf0e Optimized asum kernels for CORTEXA57
Co-Authored-By: Ralph Campbell <ralph.campbell@broadcom.com>
2015-11-09 14:15:51 +05:30
Ashwin Sekhar T K 2610752dbb Optimized iamax kernels for CORTEXA57
Co-Authored-By: Ralph Campbell <ralph.campbell@broadcom.com>
2015-11-09 14:15:50 +05:30
Ashwin Sekhar T K dbb213655e Optimized amax kernels for CORTEXA57
Co-Authored-By: Ralph Campbell <ralph.campbell@broadcom.com>
2015-11-09 14:15:50 +05:30
Ashwin Sekhar T K f2f8a0fe8b Adding arm64 target CORTEXA57
Co-Authored-By: Ralph Campbell <ralph.campbell@broadcom.com>
2015-11-09 14:15:50 +05:30
Ralph Campbell c053559ed9 Minor C code fixes in kernel/arm 2015-11-09 14:15:49 +05:30
Ralph Campbell 55e4332f00 Remove duplicate -D args in kernel/Makefile.L1 2015-11-09 14:15:48 +05:30
Zhang Xianyi 3e8d6ea74f Init POWER8 kernels by POWER6. 2015-11-03 12:34:23 +08:00
Zhang Xianyi 69363622a8 Fix DYNAMIC_ARCH=1 bug. 2015-10-27 05:10:40 +08:00
Zhang Xianyi 53b6023a6c Fix cmake bug on MSVC 32-bit. 2015-10-26 14:52:13 -05:00
Zhang Xianyi 309875de3c Fix cmake bug on x86 32-bit.
e.g. Build 32-bit on 64-bit Linux.
cmake -DBINARY=32
2015-10-27 02:54:53 +08:00
Zhang Xianyi 8fade093aa Fixed cmake bug on Visual Studio. 2015-10-20 14:37:22 -05:00
Zhang Xianyi 96f0bbe067 Fixed cmake bug on haswell. 2015-10-21 02:24:54 +08:00
Zhang Xianyi d8392c1245 Fixe cmake config bugs. 2015-10-20 04:30:55 +08:00
Zhang Xianyi 94b125255f Merge branch 'develop' into cmake
Conflicts:
	driver/others/memory.c
2015-10-13 04:46:08 +08:00
Martin Koehler 711ca33bc6 Improved Ximatcopy when lda==ldb.
The Ximatcopy functions create a copy of the input matrix
although they seem to work inplace. The new routines
XIMATCOPY_K_YY perform the operations inplace if the leading
dimension does not change.
2015-09-07 14:36:16 +02:00
Zhang Xianyi 7df0820160 Use C kernels for s/dgemv on x86. 2015-08-19 08:07:47 -05:00
Zhang Xianyi f874465bb8 Use cmake to build OpenBLAS GENERIC Target on MSVC x86 64-bit.
Disable CBLAS and LAPACK.
2015-08-10 14:10:44 -05:00
Zhang Xianyi 898fc7552a Merge pull request #612 from ibmsoe/ppc64le
ppc64le platform support (ELF ABI v2)
2015-08-04 16:58:24 -05:00
Zhang Xianyi ab0a0a75fc Merge branch 'develop' into cmake 2015-08-03 23:59:01 -05:00
Zhang Xianyi 1cf2b10224 Use pure C generic target on x86 and x86_64.
make TARGET=GENERIC

?gemm3m is unimplemented on generic target.
2015-08-03 23:55:56 -05:00
Zhang Xianyi 7ac7e147d4 Fixed cmake building bugs on Linux. Disable LAPACK by default. 2015-08-04 04:37:05 +08:00
Matthew Brandyberry 7ba4fe5afb ppc64le platform support (ELF ABI v2) 2015-07-21 22:20:19 -05:00
Zhang Xianyi dcd5ba4443 Merge branch 'cmake' of https://github.com/hpanderson/OpenBLAS into hpanderson_cmake 2015-07-22 04:06:39 +08:00
Werner Saar e7c969e164 added optimized dtrmm_kernel for haswell 2015-06-13 16:16:29 +02:00
Werner Saar 9bd962f655 modified haswell parameter dgemm_unroll_n 2015-06-13 10:28:27 +02:00
Werner Saar 24f58c8bb1 added optimized cscal and zscal kernels for steamroller 2015-05-18 12:40:07 +02:00
Werner Saar 95b1faf667 added optimized cscal and zscal kernels for steamroller and piledriver 2015-05-18 10:50:57 +02:00
Werner Saar 2d9e406050 added optimized cscal kernel for sandybridge 2015-05-18 08:46:06 +02:00
Werner Saar 59083e3ce1 added optimized cscal kernel for bulldozer 2015-05-18 07:33:52 +02:00
wernsaar 685be40339 Merge pull request #571 from wernsaar/develop
added optimized cscal and zscal functions
2015-05-17 14:09:14 +02:00
Werner Saar 31c9e399e9 added optimized cscal kernel for haswell 2015-05-17 13:44:09 +02:00
Werner Saar 7de6bb9889 added optimized zscal kernel for bulldozer 2015-05-17 11:45:19 +02:00
Werner Saar d63034303b added optimized zscal kernel for haswell 2015-05-16 16:41:45 +02:00
Zhang Xianyi 51ff17d46e Add AMD Excavator target. 2015-05-13 16:16:30 -05:00
Werner Saar 18e90ee2e3 bugfix: added static to functions 2015-05-13 13:31:26 +02:00
Werner Saar e00cccc41e added optimized dscal kernel for piledriver 2015-05-13 13:05:35 +02:00
Werner Saar 73f09bf64f optimized dscal kernel for increment != 1 2015-05-13 12:14:39 +02:00
Werner Saar 02e772c7e4 added optimized dscal kernel for haswell 2015-05-12 17:19:58 +02:00
Werner Saar 7aee913991 added optimized dscal kernel for sandybridge 2015-05-12 16:27:43 +02:00
Werner Saar e50a933037 added optimized dscal kernel for bulldozer 2015-05-12 12:28:44 +02:00
Werner Saar 133c11a156 updated dgemv_n kernel for nehalem 2015-04-30 14:38:06 +02:00
Werner Saar 30f52d53df optimized dgemv_n kernel for haswell 2015-04-30 12:11:39 +02:00
Werner Saar 5e83d80725 optimized dger kernel for sandybridge 2015-04-28 16:58:11 +02:00
Werner Saar b2e1797dc6 added optimized sger kernel for sandybridge 2015-04-28 15:33:38 +02:00
Werner Saar e216f686cb optimized saxpy and daxpy for sandybridge 2015-04-28 10:18:32 +02:00
Werner Saar fc0e0391f3 bugfixes: replaced int with BLASLONG 2015-04-24 14:30:44 +02:00
Werner Saar c22068c406 optimized sdot.c for increments != 1 2015-04-24 13:13:20 +02:00
Werner Saar dee100d0e4 optimized saxpy.c for increments != 1 2015-04-24 11:52:59 +02:00
Werner Saar 0273966abb optimized daxpy kernel for increments != 1 2015-04-24 11:39:17 +02:00
Werner Saar 3a67daa954 optimized ddot.c for increments != 1 2015-04-24 10:56:55 +02:00
Werner Saar b4f2153dcd added optimized ssymv kernels for sandybridge 2015-04-23 12:19:24 +02:00
Werner Saar 1c4b0eeae3 added optimized ssymv kernels for haswell 2015-04-23 10:23:13 +02:00
Werner Saar 1bec9abb9a added optimized dsymv kernels for sandybridge 2015-04-22 12:09:43 +02:00
Werner Saar 3814bf60d3 added optimized dsymv kernels for haswell 2015-04-22 10:42:50 +02:00
Werner Saar 6d0db0151f added optimized zaxpy-kernels 2015-04-16 11:19:37 +02:00
Zhang Xianyi 37b9033c90 Merge pull request #543 from jeromerobert/develop
Fix a buffer overflow with MAX_STACK_ALLOC size in dgemv_t
2015-04-15 11:18:14 -05:00
Werner Saar 13889515b3 added optimized caxpy-kernel for sandybridge 2015-04-15 16:29:25 +02:00
Werner Saar 248c9340c3 added optimized caxpy-kernel for haswell 2015-04-15 15:16:31 +02:00
Werner Saar e9f33b4ca7 added optimized caxpy-kernel for steamroller 2015-04-15 13:49:23 +02:00
Werner Saar f5d847122a updated caxpy_microk_bulldozer-2.c and caxpy.c 2015-04-15 11:59:38 +02:00
Jerome Robert a4c96eca67 Fix a buffer overflow with MAX_STACK_ALLOC size in dgemv_t
Refs #478, #482, 9798481, fd9fd42
2015-04-15 11:46:48 +02:00
Werner Saar baa0363ea2 add optimized ddot-kernel for piledriver 2015-04-14 15:09:13 +02:00
Werner Saar 34ba66606a add optimized daxpy-kernel for piledriver 2015-04-14 14:23:29 +02:00
Werner Saar f615dc7603 added optimized saxpy kernel for steamroller 2015-04-14 09:09:39 +02:00
Werner Saar 331c417637 optimized saxpy for piledriver 2015-04-14 08:34:11 +02:00
Werner Saar d7a17ad85d optimized sdot-kernel for pilediver 2015-04-13 13:19:21 +02:00
Werner Saar d35f6c63c2 add optimized daxpy-kernel for steamroller 2015-04-13 12:22:43 +02:00
Werner Saar 166d76e864 added optimized sdot-kernel for steamroller 2015-04-11 08:48:18 +02:00
Werner Saar f9f127d838 added optimized ddot kernel for steamroller 2015-04-10 16:18:03 +02:00
wernsaar 62231ab337 Merge pull request #538 from wernsaar/develop
Added optimized cdot- and zdot-kernels
2015-04-10 16:03:37 +02:00
Werner Saar 3119def9a7 updated cdot and zdot 2015-04-10 11:10:31 +02:00
Werner Saar 33b332372a add optimized cdot- and zdot-kernel for sandybridge 2015-04-10 09:37:26 +02:00
Werner Saar fd838c75bc add optimized cdot- and zdot-kernel for haswell 2015-04-09 15:13:52 +02:00
Werner Saar b57a60dac8 updated cdot and zdot for piledriver 2015-04-09 10:33:46 +02:00
Werner Saar 5c51163972 added optimized cdot- and zdot-kernel for steamroller 2015-04-09 09:45:23 +02:00
Werner Saar 9299d8cfd6 added optimized cdot- and zdot-kernels for bulldozer 2015-04-08 16:29:55 +02:00
Zhang Xianyi 0a3d3b945d Refs #535. Fix the wrong vector instruction in sgemm sandy bridge kernel. 2015-04-08 03:55:49 +08:00
Werner Saar 60c6dec6e6 updated some lines for bulldozer 2015-04-06 18:47:16 +02:00
Werner Saar 47898cca35 added optimized saxpy- and daxpy-kernel for sandybridge 2015-04-06 16:05:16 +02:00
Werner Saar 53bb924287 added optimized saxpy- and daxpy-kernel for haswell 2015-04-06 12:33:16 +02:00
Werner Saar a901b065d3 added optimized ddot-kernel for sandybridge 2015-04-05 20:19:38 +02:00
Werner Saar 3937e2a0a0 add optimized sdot-kernel for sandybridge 2015-04-05 19:47:05 +02:00
Werner Saar 9707d608d5 removed double definition line 2015-04-05 18:35:34 +02:00
Werner Saar 701b9d7556 added optimized sdot- and ddot-kernel for HASWELL 2015-04-05 17:57:53 +02:00
Zhang Xianyi e5b96e55a7 Fix build bug for ARM64. 2015-03-24 15:27:17 -05:00
Hank Anderson 84d90d6ed8 Fixed some compiler errors/warnings for clang. 2015-02-25 11:52:25 -06:00
Hank Anderson 518e2424a8 Fixed bad filename for cpuid.S compile. 2015-02-25 11:51:29 -06:00
Zhang Xianyi ea7f9dacf4 Refs #509. Fixed geadd building bug with DYNAMIC_ARCH=1. 2015-02-26 01:47:11 +08:00
Hank Anderson 0d8e227ea7 Changed strategy for setting preprocessor definitions.
Instead of generating separate object files for each permutation of
defines for a source file, GenerateNamedObjects now writes an entirely
new source file and inserts the defines as #define c statements.

This solves a problem I ran into with ar.exe where it was refusing to
link objects that had the same filename despite having different paths.
2015-02-24 12:26:33 -06:00
Hank Anderson 12d1fb2e40 Fixed incorrect object name in kernel CMakeLists.txt 2015-02-24 10:30:16 -06:00
Hank Anderson 1b7f427401 Added conj gemv objects for complex build. 2015-02-23 10:24:31 -06:00
Hank Anderson b2284647a3 More complex objects. 2015-02-23 07:51:05 -06:00
Hank Anderson a6116e5859 Added some more complex-only objects. 2015-02-22 17:49:28 -06:00
Hank Anderson 714638c187 Added some TRMM objects for complex types. 2015-02-19 16:11:51 -06:00
Hank Anderson e27c372e53 Fixed reuse of float_char from parent loop.
Fixed in/it/on/otcopy names.
2015-02-19 13:53:29 -06:00
Hank Anderson f3f2b3d768 Added complex and single netlib-lapack fortran sources to lapack.cmake. 2015-02-19 12:26:11 -06:00
Hank Anderson 9492298048 Added other float types to Makefile.L3. 2015-02-18 13:01:05 -06:00
Hank Anderson 14fd3d35de Added checks for missing defines in kernel. 2015-02-18 10:25:01 -06:00
Hank Anderson cebc07cebd ParseMakefileVars now recursively parses included makefiles. 2015-02-17 22:09:41 -06:00
Hank Anderson 33c5e8db7f Added a helper function for setting the L1 kernel defaults.
Added loop to build objects with different KERNEL defines.
2015-02-17 21:36:23 -06:00
Martin Koehler 39cc6b21d3 Add ATLAS-style ?geadd function 2015-02-16 13:46:20 +01:00
Hank Anderson 4662a0b13a Changed generate functions to iterate through a list of float types.
This will generate obj files for SINGLE/DOUBLE/COMPLEX/DOUBLE COMPLEX.
2015-02-15 17:44:37 -06:00
Hank Anderson 162791e30e Added common objects from kernel Makefile. 2015-02-10 12:42:05 -06:00
Hank Anderson c0624a26be Fixed some dgemm_copy function names. 2015-02-09 14:34:29 -06:00
Hank Anderson 4bfaf1ce66 Removed some list appends I missed. 2015-02-09 12:56:55 -06:00
Hank Anderson e8c39138c6 Removed return value from GenerateNamedObjects.
It sets DBLAS_OBJS directly to save a bunch of list appending in the
CMakeLists.txt files.
2015-02-09 12:28:09 -06:00
Hank Anderson f992799226 Added the rest of Makefile.L3. 2015-02-09 10:47:35 -06:00
Hank Anderson 4c65afcce1 Changed kernel filenames to vars. These will need to be read from KERNEL.
Added some kernel/L3 objects.
2015-02-09 09:52:14 -06:00
Hank Anderson 7fa5c4e2fd Fixed some case issues with ARCH.
Added some kernel and driver/others objects.
2015-02-08 15:29:18 -06:00
Hank Anderson fa0e6a6c93 Added the rest of the L1 kernel makefile. 2015-02-07 21:37:46 -06:00
Hank Anderson 38681fb1c6 Added more kernel files. 2015-02-07 12:54:30 -06:00
Hank Anderson 189fadfde0 Started implementing kernel/Makefile in cmake. 2015-02-05 21:05:11 -06:00
Zhang Xianyi 229ce2ccd1 Add cortex-a9 and cortex-a15 targets. 2015-01-12 08:55:29 +00:00
Zhang Xianyi 41aad0407f Merge pull request #482 from jeromerobert/develop
Allow to do gemv and ger buffer allocation on the stack
2015-01-02 02:26:17 +08:00
Werner Saar ddf983d643 added optimizations for steamroller 2014-12-30 20:14:45 +08:00
Werner Saar 4319769b79 added target processor STEAMROLLER 2014-12-28 20:16:46 +08:00
Jerome Robert e9d9a8eae3 Allow to do gemv and ger buffer allocation on the stack
ger and gemv call blas_memory_alloc/free which in their turn
call blas_lock. blas_lock create thread contention when matrices
are small and the number of thread is high enough. We avoid
call blas_memory_alloc by replacing it with stack allocation.
This can be enabled with:
make -DMAX_STACK_ALLOC=2048
The given size (in byte) must be high enough to avoid thread contention
and small enough to avoid stack overflow.

Fix #478
2014-12-27 14:33:12 +01:00
Werner Saar 587e16fba3 Ref #458: Backport, sandybrigde uses nehalem zgemm kernel 2014-12-22 17:01:18 +01:00
Werner Saar 6261342de3 small optimization on dgemm_kernel for N=1 2014-12-18 20:35:51 +01:00
Werner Saar bc5fff7085 changed inline assembler labels to short form 2014-12-07 12:38:54 +01:00