Martin Kroeker
125343cc88
Drop test for zero incx,incy in armv7 AXPY
...
...to pass the related utest (see #1469 )
2018-04-24 22:39:50 +02:00
Martin Kroeker
8a3b6fa108
Use generic zrot.c on ppc64/POWER6 to work around utest failure from … ( #1535 )
...
* Use generic C implementation of zrot on ppc64/POWER6 to work around utest failure from #1469
2018-04-23 19:05:49 +02:00
Martin Kroeker
9c5518319a
Revert "Fix 32bit HASWELL builds"
2018-04-22 20:20:04 +02:00
Martin Kroeker
2ca0faf495
Merge pull request #1515 from martin-frbg/mipsdot
...
Correct precision of mips dsdot
2018-04-11 08:21:25 +02:00
Martin Kroeker
0fe434598b
Fix precision of mips dsdot
2018-04-10 23:30:59 +02:00
Martin Kroeker
c7b55b6082
Merge pull request #1499 from quickwritereader/develop
...
Implemented missing vsx simd kernels for power8 blas1/2 double. z13 modifications
2018-03-27 21:43:23 +02:00
Martin Kroeker
840e01061f
Merge pull request #1491 from martin-frbg/ddot_mt
...
Add multithreading support for Haswell DDOT
2018-03-27 21:43:05 +02:00
QWR QWR
28ca97015d
power8:Added initial zgemv_(t|n) ,i(d|z)amax,i(d|z)amin,dgemv_t(transposed),zrot
...
z13: improved zgemv_(t|n)_4,zscal,zaxpy
2018-03-27 14:54:41 +00:00
Martin Kroeker
6a6ffaff1e
Merge pull request #1494 from martin-frbg/x86_dsdot
...
Use generic/dot.c instead of the inferior arm/dot.c for x86 DSDOT
2018-03-17 15:26:47 +01:00
Martin Kroeker
28ac9ea5a6
Use generic/dot.c instead of the inferior arm/dot.c for x86 DSDOT
...
to resolve dsdot utest failure seen in #1492
2018-03-17 13:49:15 +01:00
Martin Kroeker
a55694dd5b
Declare dot_compute static to avoid conflicts in multiarch builds
2018-03-16 22:23:36 +01:00
Martin Kroeker
85a41e9cdb
Add multithreading support for Haswell DDOT
...
copied from ashwinyes' implementation in dot_thunderx2t99.c
2018-03-16 16:58:47 +01:00
Martin Kroeker
81215711a2
Re-enable DAXPY microkernels for x86_64
...
as the inaccuracies seen in the original testcase for #1332 appear to be due to an artefact that amplifies the very small rounding differences between FMA and discrete multiply+add
2018-03-04 19:37:03 +01:00
Martin Kroeker
22167170b3
Merge pull request #1477 from quickwritereader/develop
...
Power8 blas3 copy-pack routines
2018-02-28 18:46:54 +01:00
Ashwin Sekhar T K
fa9ca65c0e
ARM64: Fix utest dsdot errors
2018-02-27 10:47:55 +00:00
Martin Kroeker
719b68f077
Merge pull request #1473 from martin-frbg/p2align
...
Replace .align with .p2aligns in dscal.c and the Nehalem microkernels as well
2018-02-27 08:28:20 +01:00
Martin Kroeker
fe9f15f2d8
Merge pull request #1472 from martin-frbg/utest-fixes
...
Fix limited DSDOT precision on arm,aarch64 and zarch
2018-02-26 22:48:07 +01:00
Martin Kroeker
497f0c3d8a
Replace .align with .p2align in the Nehalem microkernels
2018-02-26 20:58:33 +01:00
Martin Kroeker
ea37db828e
Convert .align to .p2align for OSX compatibility
2018-02-26 20:48:03 +01:00
Martin Kroeker
6e70287776
Use generic/dot.c for DSDOT on ARMV5 and above
...
The default arm/dot.c is less precise when used for DSDOT, as shown by utest
2018-02-25 19:57:23 +01:00
Martin Kroeker
58f236ad73
Use generic/dot.c for DSDOT on zarch
2018-02-25 19:52:14 +01:00
Martin Kroeker
e207107150
Use generic/dot.c for DSDOT on z13
...
The implementation in arm/dot.c has lower precision, as shown by the utest for dsdot.
2018-02-25 19:51:25 +01:00
Martin Kroeker
c9d408064a
Use dot.S also for DSDOT on CORTEXA57
2018-02-25 19:48:09 +01:00
Martin Kroeker
288d1a3f6e
Use dot.S also for DSDOT on ARMV8
2018-02-25 19:45:16 +01:00
Martin Kroeker
7c1925acec
Use .p2align instead of .align for compatibility on Sandybridge as well
2018-02-24 19:43:15 +01:00
Martin Kroeker
2359c7c1a9
Use .p2align instead of .align for portability
...
The OSX assembler apparently mishandles the argument to decimal .align, leading to a significant loss of performance
as observed in #730 , #901 and most recently #1470
2018-02-24 17:50:13 +01:00
Martin Kroeker
e7366a4161
Restore the remaining utests ( #1462 )
...
* Restore the remaining utests
* Try fork test on Cygwin and Linux only, it hangs on at least ARMv8/Android as well
* Use generic sswap/dswap kernels for NEHALEM 32bit to fix fault found by the restored swap utest
* Disable zdotu test for MS cl to work around runtime error -1073741819 on AppVeyor for now
(probably coding error in the initialization of the complex numbers or wrong choice of zdotu API)
2018-02-20 10:07:17 +01:00
the mslm
2c0a008281
dgemm_ncopy_4_ save/restore
2018-02-18 01:30:17 +00:00
the mslm
c5425daa6b
power8 ?gemm_tcopy save/restore
2018-02-16 23:36:46 +00:00
Martin Kroeker
b47e6822aa
Enable most assembly kernels in the generic ARMV8 target
...
ref #1439
2018-02-06 11:42:58 +01:00
Abdelrauf
60596a1abc
Merge branch 'develop' into develop
2018-01-31 16:17:04 -08:00
Abdelrauf
afd514c25d
small fix inside ifdef z13mvc . (z13mvc code is not used in production)
2018-01-31 18:30:59 -05:00
Martin Kroeker
f45776ec1f
Merge pull request #1440 from quickwritereader/develop
...
small corrections
2018-01-31 23:48:47 +01:00
Martin Kroeker
e388459a27
Merge pull request #1419 from brada4/develop
...
Initialize unitialized values for repeated calls
2018-01-31 23:48:34 +01:00
Abdelrauf
f653e7a18d
small fix
...
small fix inside ifdef z13mvc . (z13mvc code is not used in production)
2018-01-31 07:49:38 -08:00
the mslm
f946a89432
zscal (case: real alpha=0 ) mikrokernel shift&mem fix , da_i as input reg. small typo fixes
2018-01-26 19:25:27 -08:00
Martin Kroeker
485df77612
Make USE_TRMM depend on TARGET_CORE not TARGET
...
Fixes #1432 (and possibly other DTRMM-related failures on Haswell and related architectures when built with cmake)
2018-01-26 23:20:00 +01:00
Martin Kroeker
e4c71a799a
Merge pull request #1426 from quickwritereader/develop
...
(Z13 ) Blas1 mikrokernels can be inlined by gcc. Refactoring,fixes,tunings
2018-01-20 17:34:54 +01:00
the mslm
2619ad7ea5
Blas1 mikrokernels can be inlined by gcc. Refactoring ( symbolic operan
...
names). Some fixes and tunings
2018-01-19 19:24:35 -08:00
Andrew
e5cc3d72c0
core.IdenticalExpr clang501 checker
2018-01-19 23:17:43 +01:00
Andrew
4938faa822
core.IdenticalExpr clang501 checker
2018-01-19 23:15:58 +01:00
Andrew
9fa986337d
add missing brackets to silence indentation warnings gcc721
2018-01-19 23:11:12 +01:00
Andrew
3eed97f6b9
Initialize values to silence cppcheck
2018-01-12 22:35:00 +01:00
Andrew
13e137fbc9
Initialize uninitialized variables (cppcheck)
2018-01-12 22:33:41 +01:00
Martin Kroeker
3d23f45107
Merge pull request #1415 from quickwritereader/develop
...
(Z systems Z13) small fixes, some (i(dz)amin,i(dz)amax,(dz)dot,(dz)asum) mikrokernels…
2018-01-11 08:35:02 +01:00
Abdelrauf
87669d1c0a
small fixes, some (i(dz)amin,i(dz)amax,(dz)dot,(dz)asum) mikrokernels can be inlined
2018-01-10 20:36:53 -05:00
Martin Kroeker
42285d8e70
Merge pull request #1410 from brada4/develop
...
Address warnings #1357
2018-01-06 20:02:46 +01:00
Andrew
d602b99386
LAPACK helpers in C that need care too
2018-01-02 14:38:50 +01:00
Andrew
4d0b005e5b
Eliminate remaining unused results in kernels (clang5 analyzer)
2018-01-01 20:54:39 +01:00
Martin Kroeker
b81656936f
Merge pull request #1409 from martin-frbg/issue1292-2
...
Tag %1 and %2 as both input and output operands
2017-12-31 20:18:48 +01:00
Martin Kroeker
b973990df2
Tag %1 and %2 as both input and output operands
...
fix from #1292 extended to the other gemv microkernels
2017-12-31 18:03:36 +01:00
Martin Kroeker
1e31124eb0
Merge pull request #1406 from martin-frbg/issue1292
...
Tag %1 and %2 as both input and output
2017-12-30 14:52:03 +01:00
Martin Kroeker
cc9500db41
Merge pull request #1403 from brada4/develop
...
Address few more warnings
2017-12-30 14:51:34 +01:00
Martin Kroeker
723f396a20
Tag %1 and %2 as both input and output
...
The inline assembly modifies its input operands, so mark them as output to avoid surprises with optimization. Fixes #1292
2017-12-29 23:56:41 +01:00
Andrew
03e5ff0687
initialize potentially unitialized variables (clang5)
2017-12-26 09:24:24 +01:00
Andrew
47deec2c1a
fix couple of dead assignment warnings
2017-12-22 00:56:35 +01:00
Martin Kroeker
43c0622e7b
Retire Piledriver/Steamroller/Excavator daxpy microkernels as well
...
related to issue #1332
2017-12-13 18:40:39 +01:00
Martin Kroeker
0623636c98
Use Sandybridge daxpy kernel on Haswell and Zen for now
...
The testcase from #1332 exposes a problem in daxpy_microk_haswell-2.c that is not seen with
any of the other Intel x86_64 microkernels.
2017-12-10 19:24:31 +01:00
Andrew
281a2b952f
warning cleanup ( #1380 )
...
* dead increments in driver/level2
* dead increments in kernel/generic
* part dead increments in kernel/x86_64
2017-12-05 19:54:10 +01:00
Martin Kroeker
8213385ab8
Work around compiler warnings for unused variables in the generic zgemm3m_Xcopy kernels
2017-12-02 22:51:58 +01:00
Martin Kroeker
db00a51e6b
Merge pull request #1371 from martin-frbg/develop
...
Add trivially optimized DSDOT for POWER8
2017-11-29 19:55:21 +01:00
martin
7a4b3cfbf8
Add trivially optimized DSDOT for POWER8
2017-11-28 18:38:07 +01:00
Martin Kroeker
6c77b5f267
Merge pull request #1369 from martin-frbg/dsdot
...
Add optimized dsdot to all other x86_64 kernels that use sdot.c
2017-11-28 18:15:31 +01:00
Andrew
441a9c8385
more dead increments clang4 scan-build deadcode.deadstores
2017-11-26 17:24:08 +01:00
Andrew
1236dbe5a6
Eliminate 2-8 dead increments code
2017-11-26 13:26:11 +01:00
Martin Kroeker
c92cd6d162
Add trivially optimized dsdot based on sdot
2017-11-24 20:05:27 +01:00
Martin Kroeker
cae5d9a20b
Add trivially optimized dsdot based on sdot
2017-11-24 20:04:29 +01:00
Martin Kroeker
3d891c3106
Add trivially optimized dsdot based on sdot
2017-11-24 20:03:40 +01:00
Martin Kroeker
4fbdcfa823
Add trivially optimized dsdot based on sdot
2017-11-24 20:02:28 +01:00
Martin Kroeker
1bb6a96ebc
Add trivially optimized dsdot based on sdot
2017-11-24 20:01:42 +01:00
Martin Kroeker
6bd163f37a
Add trivially optimized dsdot based on sdot
2017-11-24 20:00:23 +01:00
Martin Kroeker
f0333333d1
Add trivially optimized dsdot based on sdot
2017-11-24 19:59:28 +01:00
Andrew
e89b979b2c
fix spurious compiler warning fix (no code change)
2017-11-24 18:39:04 +01:00
Andrew
7e9b29b9b8
fix spurious compiler warning (no code change)
2017-11-24 18:36:37 +01:00
Martin Kroeker
6157d0902a
Merge pull request #1358 from martin-frbg/unused_vars
...
Clean up spurious unused variables in the kernels
2017-11-15 11:31:43 +01:00
Martin Kroeker
3fea849bbf
Remove unused variables from Haswell dtrmm and Bulldozer dtrsm
2017-11-14 23:35:10 +01:00
Martin Kroeker
8f177621bc
Remove unused variables at0...at3 from ?symv_U
2017-11-14 23:32:25 +01:00
Martin Kroeker
5f402b7759
Remove unused (loop?) variable j from the gemv_n_4 implementations
2017-11-14 23:29:42 +01:00
Martin Kroeker
65bf0a343c
Remove unused variable btpr
2017-11-14 23:25:50 +01:00
Martin Kroeker
acf3d34bc5
Silence an unused variable warning with a cast
...
l2 cache size is not universally needed to assign default unrolling limits, but neither putting its declaration inside an ifdef nor cloning it into all ifdef sections that need it really makes sense here.
2017-11-14 23:23:44 +01:00
Martin Kroeker
ab87ee6b48
Merge pull request #1329 from martin-frbg/dsdot
...
(Trivial) optimized dsdot implementation for HASWELL
2017-10-25 19:13:38 +02:00
Martin Kroeker
a07807caac
Eliminate loop code when called as/from dsdot
2017-10-25 16:45:41 +02:00
Ashwin Sekhar T K
a0128aa489
ARM64: Convert all labels to local labels
...
While debugging/profiling applications using perf or other tools, the
kernels appear scattered in the profile reports. This is because the labels
within the kernels are not local and each label is shown as a separate
function.
To avoid this, all the labels within the kernels are changed to local
labels.
2017-10-24 11:40:05 +00:00
Martin Kroeker
0e2cf102e1
Fix 32bit HASWELL
2017-10-24 10:07:44 +02:00
Martin Kroeker
5e3e91d0fc
Split the microkernel workload into chunks of 32 floats for dsdot mode to limit loss of precision
2017-10-22 18:18:51 +02:00
Martin Kroeker
28c3fa8950
Add dsdot
2017-10-16 23:29:03 +02:00
Martin Kroeker
8ac87c1cb6
Implement DSDOT with unchanged sdot microkernels
2017-10-16 23:27:51 +02:00
Martin Kroeker
c7a8512d12
Cmake fixes for DYNAMIC_ARCH builds and whitespace in path names ( #1323 )
...
* prebuild.cmake: Put quotes around path names that may contain whitespace
(Copied from alexkaratakis' PR #1295 )
* kernel/CMakeLists.txt: Fix common_lapack header inclusion and DYNAMIC_ARCH generation of ?neg_tcopy and ?laswp_ncopy files
* lapack/CMakeLists.txt: Use correct template for ?laswp_(plus,minus) functions
2017-10-09 23:34:18 +02:00
Martin Kroeker
97ecd4996a
Merge pull request #1319 from martin-frbg/issue601
...
Fix out-of-bounds memory accesses exposed by xccblat3 testcase
2017-10-08 23:31:06 +02:00
Martin Kroeker
1eb43cccad
Merge pull request #1317 from martin-frbg/power8-asm
...
Save and restore VSX registers
2017-10-08 23:30:46 +02:00
Martin Kroeker
9d92f526dd
Comment out a code block that performs out-of-bounds memory accesses
...
...and does not appear to be needed even when it stays within the bounds of the array
2017-10-06 23:51:32 +02:00
Martin Kroeker
514d237257
Merge pull request #1279 from xsacha/develop
...
CMake improvements
2017-10-06 21:13:45 +02:00
Martin Kroeker
f96afd94b0
Fix out-of-bounds accesses where the data should be zero anyway
2017-10-01 01:06:39 +02:00
Martin Kroeker
9c017a2218
Save and restore VSX registers
2017-09-28 12:17:09 +02:00
Shivraj Patil
e3d844b062
Added mips I6500 core
...
Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>
2017-09-22 11:57:43 +05:30
Martin Kroeker
46c9357c72
Merge pull request #1288 from quickwritereader/develop
...
Optimized standard Blas Level-1,2 (excluding nrm2 functions) for z13 (double precision). Issue 884
2017-09-09 23:47:17 +02:00
Abdurrauf
1cfdb2295d
Optimized standard Blas Level-1,2 (excluding nrm2 functions) for z13 (double precision)
2017-09-06 16:41:08 +04:00
Sacha Refshauge
47ebce4d1a
Clean up, fix old typos. Simplify arch usages. Move system arch check to earlier position.
2017-08-21 00:37:29 +10:00
Sacha Refshauge
69b560751c
Improvements to previous commit (cross-compile).
...
Fix typos and bad if statements discovered in 0.2.20.
2017-08-20 22:50:31 +10:00
Sacha Refshauge
11911fd941
Add kernel/Makefile.LA to CMake
2017-08-20 00:59:14 +10:00
Isuru Fernando
d3b677fe87
Add commonobjs
2017-08-07 23:12:40 +05:30
Isuru Fernando
505b218829
Merge remote-tracking branch 'upstream/develop' into dyn
2017-08-06 19:07:00 +05:30
Isuru Fernando
d9346930dd
Merge remote-tracking branch 'upstream/develop' into develop
2017-08-04 07:57:55 +05:30
Ashwin Sekhar T K
4899d67f7d
THUDNERX2T99: Fix clang compilation
2017-08-02 11:28:45 -07:00
Isuru Fernando
1d1854032b
Add missing EXCAVATOR
2017-08-02 19:03:04 +05:30
Isuru Fernando
2c51a990ac
Fix extra whitespaces. CMake parser macro fails with it
...
TODO: Fix the parser macro to strip trailing whitespaces
2017-08-02 18:26:57 +05:30
Isuru Fernando
7892434572
Add hemm3m and symm3m objects
2017-08-02 18:24:54 +05:30
Isuru Fernando
d798487213
Fixes for dynamic_arch. almost there
2017-08-02 17:25:49 +05:30
Isuru Fernando
251715d9ef
configure kernel_core.h
2017-08-01 23:23:55 +05:30
Isuru Fernando
50deeb49b7
configure setparam
2017-08-01 22:32:47 +05:30
Isuru Fernando
4260215adf
Support DYNAMIC_ARCH with cmake
2017-08-01 22:25:52 +05:30
Isuru Fernando
d245caa49a
Support out-of-source build
2017-08-01 15:16:14 +05:30
Isuru Fernando
ca17b4b75c
Fix complex support for MSVC headers
2017-07-28 11:50:29 +05:30
Isuru Fernando
dc24914415
check compiler is msvc instead of msvc
2017-07-28 11:49:39 +05:30
Zhang Xianyi
d5ef0dee9a
Merge pull request #1226 from ashwinyes/develop_arm_clang_ual_fix
...
arm: Fix clang compilation for ARMv7
2017-07-10 20:04:42 +08:00
Zhang Xianyi
4239dd65ce
Merge branch 'develop' into develop_arm_softfp
2017-07-10 20:02:36 +08:00
Ashwin Sekhar T K
f02d535fde
arm: Fix clang compilation for ARMv7
...
clang is not recognizing some pre-UAL VFP mnemonics like fnmacs, fnmacd,
fnmuls and fnmuld. Replaced them with equivalent UAL mnemonics which are
vmls.f32, vmls.f64, vnmul.f32 and vnmul.f64 respectively.
2017-07-07 12:35:58 +05:30
Zhang Xianyi
a6515bb858
Merge pull request #1218 from m-brow/power9
...
Optimise loads on Power9 LE
2017-07-03 13:48:29 +08:00
Ashwin Sekhar T K
37efb5bc1d
arm: Remove unnecessary files/code
...
Since softfp code has been added to all required vfp kernels,
the code for auto detection of abi is no longer required.
The option to force softfp ABI on make command line by giving
ARM_SOFTFP_ABI=1 is retained. But there is no need to give this option
anymore.
Also the newly added C versions of 4x4/4x2 gemm/trmm kernels are removed.
These are longer required. Moreover these kernels has bugs.
2017-07-02 03:06:36 +05:30
Ashwin Sekhar T K
97d671eb61
arm: add softfp support in zgemm/ztrmm vfp kernels
2017-07-02 02:54:32 +05:30
Ashwin Sekhar T K
305cd2e8b4
arm: add softfp support in cgemm/ctrmm vfp kernels
2017-07-02 02:42:32 +05:30
Ashwin Sekhar T K
09bc6ebe5b
arm: add softfp support in dgemm/dtrmm vfp kernels
2017-07-02 02:24:38 +05:30
Ashwin Sekhar T K
872a11a2bf
arm: add softfp support in sgemm/strmm vfp kernels
2017-07-02 02:23:48 +05:30
Ashwin Sekhar T K
eda9e8632a
generic: Bug fixes in generic 4x2 and 4x4 gemm kernels
2017-07-02 02:00:48 +05:30
Ashwin Sekhar T K
8f83d3f961
arm: add softfp support in vfp gemv kernels
2017-07-02 01:03:31 +05:30
Ashwin Sekhar T K
83bd547517
arm: add softfp support in kernel/arm/swap_vfp.S
2017-07-01 20:37:40 +05:30
Ashwin Sekhar T K
e25f4c01d6
arm: add softfp support in kernel/arm/nrm2_vfp*.S
2017-07-01 19:57:28 +05:30
Ashwin Sekhar T K
54915ce343
arm: add softfp support in kernel/arm/*dot_vfp.S
2017-06-30 23:46:02 +05:30
Ashwin Sekhar T K
0150fabdb6
arm: add softfp support in kernel/arm/rot_vfp.S
2017-06-30 21:52:32 +05:30
Ashwin Sekhar T K
4f0773f07d
arm: add softfp support in kernel/arm/axpy_vfp.S
2017-06-30 20:25:59 +05:30
Ashwin Sekhar T K
aa5edebc80
arm: add softfp support in kernel/arm/asum_vfp.S
2017-06-30 18:21:05 +05:30
Ashwin Sekhar T K
89924b3d5b
arm: Use assembly implementations based on the ARM abi
...
In case of softfp abi, assembly implementations of only those APIs are
used which doesnt have a floating point argument or return value.
In case of hard abi, all assembly implementations are used.
2017-06-30 18:21:05 +05:30
Ashwin Sekhar T K
da7f0ff425
generic: add some generic gemm and trmm kernels
...
Added generic 4x4 and 4x2 gemm kernels
Added generic 4x2 trmm kernel
2017-06-30 18:21:05 +05:30
Zhang Xianyi
482015f8d6
Merge branch 'arm_soft_fp_abi' into develop
2017-06-23 11:35:25 +08:00
Matt Brown
bd831a03a8
Optimise sscal for POWER9
...
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 17:02:46 +10:00
Matt Brown
edc97918f8
Optimise srot for POWER9
...
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 17:02:35 +10:00
Matt Brown
e0034de22d
Optimise sdot for POWER9
...
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 17:02:19 +10:00
Matt Brown
32c7fe6bff
Optimise sasum for POWER9
...
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 17:02:10 +10:00
Matt Brown
19bdf9d52b
Optimise casum for POWER9
...
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 17:00:07 +10:00
Matt Brown
4f09030fdc
Optimise cswap for POWER9
...
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 16:59:53 +10:00
Matt Brown
6f4eca5ea4
Optimise sswap for POWER9
...
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 16:59:13 +10:00
Matt Brown
be55f96cbd
Optimise scopy for POWER9
...
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 16:59:13 +10:00
Matt Brown
96dd0ef4f7
Optimise ccopy for POWER9
...
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 16:58:59 +10:00
Alan Modra
dc40bc7368
Power8 inline assembly tweaks
...
Further fixes on top of 9e2f316ed
. Writing some doco for gcc on
inline assembly woke me up to some more errors.
- dgemv_kernel_4x4 asm did not mention *ap as a memory input, and
*y is both read and write.
- sasum_kernel_32 and casum_kernel_16 did not use %x for a vsx insn
operand, a problem if the "=f" sum output was ever allocated a vsx
reg in the altivec set. This might be possible with inlining and
future gcc optimisation.
2017-04-04 23:13:54 +09:30
Zhang Xianyi
b5c96fcfcd
Support ARM SOFTFP ABI for saxpy, sdot, snrm2, sscal, sgemv, sger.
2017-03-20 17:39:25 +08:00
Denis Steckelmacher
c9ff735da6
Add ZEN support (tested for auto-detected static backend)
2017-03-19 15:32:50 +01:00
Martin Kroeker
cd135e2b59
Merge pull request #1130 from quickwritereader/develop
...
Blas 3 for single precision
2017-03-15 10:00:52 +01:00
Martin Kroeker
a6efabf155
Replace gnu _real_ , _imag_ extensions in initializers
2017-03-13 00:38:37 +01:00
Abdurrauf
08786c4b95
strmm and ctrmm
2017-03-13 01:23:16 +04:00
Zhang Xianyi
90e02ccf68
Support ARM softfp ABI for sgemm on ARMV7.
...
make ARM_SOFTFP_ABI=1
2017-03-06 22:16:13 +08:00
Abdurrauf
82e80fa82b
initial strmm(sgemm). not tuned yet
2017-03-06 04:27:40 +04:00
Martin Kroeker
d1fe040d9b
Merge pull request #1110 from quickwritereader/develop
...
Conventional usage of the register save area.
2017-03-01 23:08:07 +01:00
Abdurrauf
411982715c
conventional usage of the register save area
2017-03-01 20:39:39 +04:00
Abdurrauf
e831d6924e
changed to conventional register save area
2017-03-01 03:13:21 +04:00
Martin Kroeker
ffc1d6c468
Merge pull request #1108 from ashwinyes/develop_20170203_thunderx2t99
...
Optimized Implementations for ThunderX2T99
2017-02-28 16:02:19 +01:00
Ashwin Sekhar T K
67473d09dd
THUNDERX2T99: Bug Fixes in D/Z NRM2 and ZGEMM
2017-02-28 01:11:38 -08:00
Ashwin Sekhar T K
19ba133383
THUNDERX2T99: Add Optimized ZGEMM Implementation
2017-02-28 05:31:41 +00:00
Abdurrauf
0d96b0e2a7
Merge branch 'z13' into develop
2017-02-26 06:17:33 +04:00
Abdurrauf
848cb27b1e
ztrmm kernel.
2017-02-26 06:14:12 +04:00
Martin Kroeker
dc34a0da96
Merge pull request #915 from mdong/small_fix_for_icc
...
remove input from clobbered list
2017-02-23 20:00:22 +01:00
Ashwin Sekhar T K
a3935f0dfb
THUNDERX2T99: Add Optimized D/Z NRM2 Implementation
2017-02-23 10:02:15 -08:00
Ashwin Sekhar T K
738628e9a8
ARM64: Remove unused code
2017-02-21 21:42:32 -08:00
Ashwin Sekhar T K
ab3ffab96a
THUNDERX2T99: Add Optimized C/Z DOT Implementation
2017-02-21 03:40:59 -08:00
Ashwin Sekhar T K
f036be9ce2
THUNDERX2T99: Add Optimized SDOT Implementation
2017-02-21 03:24:32 -08:00
Ashwin Sekhar T K
faba876fda
THUNDERX2T99: Bug fix in C/Z IAMAX
2017-02-19 23:11:50 -08:00
Ashwin Sekhar T K
172a62d73e
THUNDERX2T99: Add Optimized C/Z IAMAX Implementation
2017-02-17 03:06:32 -08:00
Ashwin Sekhar T K
228c75a69c
THUNDERX2T99: Add parallel SCNRM2 Implementation
2017-02-14 04:10:06 -08:00
Martin Kroeker
9e2f316ede
Power8 inline assembly fixes
...
Quoting patch author amodra from #1078
Lots of issues here.
- The vsx regs weren't listed as clobbered.
- Poor choice of vsx regs, which along with the lack of clobbers led to
trashing v0..v21 and fr14..fr23. Ideally you'd let gcc choose all
temp vsx regs, but asms currently have a limit of 30 i/o parms.
- Other regs were clobbered unnecessarily, seemingly in an attempt to
clobber inputs, with gcc-7 complaining about the clobber of r2.
(Changed inputs should be also listed as outputs or as an i/o.)
- "r" constraint used instead of "b" for gprs used in insns where the
r0 encoding means zero rather than r0.
- There were unused asm inputs too.
- All memory was clobbered rather than hooking up memory outputs with
proper memory constraints, and that and the lack of proper memory
input constraints meant the asms needed to be volatile and their
containing function noinline.
- Some parameters were being passed unnecessarily via memory.
- When a copy of a
2017-02-13 23:38:50 +01:00
Ashwin Sekhar T K
8e89668f62
THUNDERX2T99: Fix bug in SNRM2
2017-02-07 02:14:33 -08:00
Ashwin Sekhar T K
f63deae9de
THUNDERX2T99: Add Optimized S/D IAMAX Implementation
2017-02-07 01:35:55 -08:00
Martin Kroeker
60eea75409
Merge pull request #1076 from ashwinyes/develop_20170130_thunderx2t99
...
More optimized implementations for ThunderX2T99
2017-02-04 17:25:43 +01:00
Ashwin Sekhar T K
071a830e8b
THUNDERX2T99: Add optimized S/D/C/Z SWAP Implementations
2017-02-03 03:55:06 -08:00
Ashwin Sekhar T K
d09f88192c
THUNDERX2T99: Add optimized S/D/C/Z COPY Implementations
2017-02-02 15:26:38 +05:30
Ashwin Sekhar T K
e58233460a
THUDNERX2T99: Add optimized D/C/Z ASUM Implementations
2017-02-02 15:26:22 +05:30
Ashwin Sekhar T K
99bd2892bf
THUNDERX2T99: Add optimized CASUM Implementation
2017-01-30 17:44:32 +05:30
Ashwin Sekhar T K
ff6f572f2e
THUNDERX2T99: Rename labels in for DDOT and SNRM2
2017-01-30 17:44:32 +05:30
Ashwin Sekhar T K
e0dc5f58c5
THUNDERX2T99: Remove Duplicate Code
2017-01-30 17:44:32 +05:30
Ashwin Sekhar T K
2757b49767
THUNDERX2T99: Add Optimized CGEMM Implementation
2017-01-30 17:44:26 +05:30
Zhang Xianyi
ff41e13385
Merge pull request #1074 from ashwinyes/develop_20170116_thunderx2t99_sgemm
...
Add more THUNDERX2T99 Optimized APIs
2017-01-25 22:17:05 +08:00
Ashwin Sekhar T K
907e286eb6
THUNDERX2T99: Add threaded SNRM2 Implementation
2017-01-24 21:39:29 +05:30
Ashwin Sekhar T K
cde3aee08b
ARM64: Rename kernel files to have consistent naming
2017-01-24 14:53:34 +05:30
Ashwin Sekhar T K
ee6ea7e988
THUNDERX2T99: Add Optimized CNRM2 Implementation
2017-01-24 10:23:32 +05:30
Ashwin Sekhar T K
ca0b36b012
THUNDERX2T99: Add Optimized SNRM2 Implementation
2017-01-24 10:23:21 +05:30
Ashwin Sekhar T K
d0a79ca6e0
THUNDERX2T99: Add threaded DDOT Implementation
2017-01-19 11:11:42 +05:30
Ashwin Sekhar T K
0c07003ccf
THUNDERX2T99: Add Optimized DDOT Implementation
2017-01-19 11:11:07 +05:30
Ashwin Sekhar T K
f33fcedb30
THUNDERX2T99: Improve SGEMM
2017-01-19 11:11:07 +05:30
Ashwin Sekhar T K
0f1d6e8b39
THUNDERX2T99: Improve DGEMM
2017-01-19 11:11:07 +05:30
Ashwin Sekhar T K
981064acc6
THUNDERX2T99: Add Optimized DAXPY Implementation
2017-01-19 11:10:57 +05:30
Shivraj Patil
a4d97d980f
Added rot functions.
...
Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>
2017-01-17 12:15:07 +05:30
Ashwin Sekhar T K
f279ff4789
THUNDERX2T99: Add Optimized SGEMM Implementation
2017-01-16 21:44:33 +05:30
Ashwin Sekhar T K
759f37feba
ARM64: Let target VULCAN inherit THUNDERX2T99 properties
2017-01-16 21:44:19 +05:30
Zhang Xianyi
0863a0d4b4
Merge pull request #1061 from ashwinyes/develop_aarch64_vulcan_thunderx_patch
...
Add new targets for ARM64
2017-01-16 13:20:10 +08:00
Werner Saar
28e2fab33e
prepared kernel/setparam-ref.c for UNROLL values, that are not a power of two
2017-01-11 11:56:50 +01:00
Ashwin Sekhar T K
4b55fae337
ARM64: Add Cavium THUNDERX2T99 Target
2017-01-11 11:18:40 +05:30
Andrew Pinski
95649dee28
THUNDERX: Add optimized version of daxpy
...
This is better for single core but does not change anything for multiple cores
2017-01-11 11:18:36 +05:30
Andrew Pinski
8fdb0655e9
THUNDERX: Add an optimized version of ddot
2017-01-10 15:01:37 +05:30
Andrew Pinski
fb200c7245
ARM64: Add Cavium THUNDERX Target
2017-01-10 15:01:37 +05:30
Ashwin Sekhar T K
0b8e876d89
VULCAN: Add optimized DGEMM implementation
2017-01-10 15:01:37 +05:30
Ashwin Sekhar T K
4713e7c47f
ARM64: Add the VULCAN Target
2017-01-10 15:01:17 +05:30
Ashwin Sekhar T K
6085386b10
CORTEXA57: Add assembly kernels for copy routines
2017-01-10 15:01:05 +05:30