Commit Graph

830 Commits

Author SHA1 Message Date
Zhang Xianyi a6515bb858 Merge pull request #1218 from m-brow/power9
Optimise loads on Power9 LE
2017-07-03 13:48:29 +08:00
Zhang Xianyi 482015f8d6 Merge branch 'arm_soft_fp_abi' into develop 2017-06-23 11:35:25 +08:00
Matt Brown bd831a03a8 Optimise sscal for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 17:02:46 +10:00
Matt Brown edc97918f8 Optimise srot for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 17:02:35 +10:00
Matt Brown e0034de22d Optimise sdot for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 17:02:19 +10:00
Matt Brown 32c7fe6bff Optimise sasum for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 17:02:10 +10:00
Matt Brown 19bdf9d52b Optimise casum for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 17:00:07 +10:00
Matt Brown 4f09030fdc Optimise cswap for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 16:59:53 +10:00
Matt Brown 6f4eca5ea4 Optimise sswap for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 16:59:13 +10:00
Matt Brown be55f96cbd Optimise scopy for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 16:59:13 +10:00
Matt Brown 96dd0ef4f7 Optimise ccopy for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
2017-06-14 16:58:59 +10:00
Alan Modra dc40bc7368 Power8 inline assembly tweaks
Further fixes on top of 9e2f316ed.  Writing some doco for gcc on
inline assembly woke me up to some more errors.

- dgemv_kernel_4x4 asm did not mention *ap as a memory input, and
  *y is both read and write.
- sasum_kernel_32 and casum_kernel_16 did not use %x for a vsx insn
  operand, a problem if the "=f" sum output was ever allocated a vsx
  reg in the altivec set.  This might be possible with inlining and
  future gcc optimisation.
2017-04-04 23:13:54 +09:30
Zhang Xianyi b5c96fcfcd Support ARM SOFTFP ABI for saxpy, sdot, snrm2, sscal, sgemv, sger. 2017-03-20 17:39:25 +08:00
Denis Steckelmacher c9ff735da6 Add ZEN support (tested for auto-detected static backend) 2017-03-19 15:32:50 +01:00
Martin Kroeker cd135e2b59 Merge pull request #1130 from quickwritereader/develop
Blas 3 for single precision
2017-03-15 10:00:52 +01:00
Martin Kroeker a6efabf155 Replace gnu _real_ , _imag_ extensions in initializers 2017-03-13 00:38:37 +01:00
Abdurrauf 08786c4b95 strmm and ctrmm 2017-03-13 01:23:16 +04:00
Zhang Xianyi 90e02ccf68 Support ARM softfp ABI for sgemm on ARMV7.
make ARM_SOFTFP_ABI=1
2017-03-06 22:16:13 +08:00
Abdurrauf 82e80fa82b initial strmm(sgemm). not tuned yet 2017-03-06 04:27:40 +04:00
Martin Kroeker d1fe040d9b Merge pull request #1110 from quickwritereader/develop
Conventional usage of the register save area.
2017-03-01 23:08:07 +01:00
Abdurrauf 411982715c conventional usage of the register save area 2017-03-01 20:39:39 +04:00
Abdurrauf e831d6924e changed to conventional register save area 2017-03-01 03:13:21 +04:00
Martin Kroeker ffc1d6c468 Merge pull request #1108 from ashwinyes/develop_20170203_thunderx2t99
Optimized Implementations for ThunderX2T99
2017-02-28 16:02:19 +01:00
Ashwin Sekhar T K 67473d09dd THUNDERX2T99: Bug Fixes in D/Z NRM2 and ZGEMM 2017-02-28 01:11:38 -08:00
Ashwin Sekhar T K 19ba133383 THUNDERX2T99: Add Optimized ZGEMM Implementation 2017-02-28 05:31:41 +00:00
Abdurrauf 0d96b0e2a7 Merge branch 'z13' into develop 2017-02-26 06:17:33 +04:00
Abdurrauf 848cb27b1e ztrmm kernel. 2017-02-26 06:14:12 +04:00
Martin Kroeker dc34a0da96 Merge pull request #915 from mdong/small_fix_for_icc
remove input from clobbered list
2017-02-23 20:00:22 +01:00
Ashwin Sekhar T K a3935f0dfb THUNDERX2T99: Add Optimized D/Z NRM2 Implementation 2017-02-23 10:02:15 -08:00
Ashwin Sekhar T K 738628e9a8 ARM64: Remove unused code 2017-02-21 21:42:32 -08:00
Ashwin Sekhar T K ab3ffab96a THUNDERX2T99: Add Optimized C/Z DOT Implementation 2017-02-21 03:40:59 -08:00
Ashwin Sekhar T K f036be9ce2 THUNDERX2T99: Add Optimized SDOT Implementation 2017-02-21 03:24:32 -08:00
Ashwin Sekhar T K faba876fda THUNDERX2T99: Bug fix in C/Z IAMAX 2017-02-19 23:11:50 -08:00
Ashwin Sekhar T K 172a62d73e THUNDERX2T99: Add Optimized C/Z IAMAX Implementation 2017-02-17 03:06:32 -08:00
Ashwin Sekhar T K 228c75a69c THUNDERX2T99: Add parallel SCNRM2 Implementation 2017-02-14 04:10:06 -08:00
Martin Kroeker 9e2f316ede Power8 inline assembly fixes
Quoting patch author amodra from #1078
Lots of issues here.
- The vsx regs weren't listed as clobbered.
- Poor choice of vsx regs, which along with the lack of clobbers led to
  trashing v0..v21 and fr14..fr23.  Ideally you'd let gcc choose all
  temp vsx regs, but asms currently have a limit of 30 i/o parms.
- Other regs were clobbered unnecessarily, seemingly in an attempt to
  clobber inputs, with gcc-7 complaining about the clobber of r2.
  (Changed inputs should be also listed as outputs or as an i/o.)
- "r" constraint used instead of "b" for gprs used in insns where the
  r0 encoding means zero rather than r0.
- There were unused asm inputs too.
- All memory was clobbered rather than hooking up memory outputs with
  proper memory constraints, and that and the lack of proper memory
  input constraints meant the asms needed to be volatile and their
  containing function noinline.
- Some parameters were being passed unnecessarily via memory.
- When a copy of a
2017-02-13 23:38:50 +01:00
Ashwin Sekhar T K 8e89668f62 THUNDERX2T99: Fix bug in SNRM2 2017-02-07 02:14:33 -08:00
Ashwin Sekhar T K f63deae9de THUNDERX2T99: Add Optimized S/D IAMAX Implementation 2017-02-07 01:35:55 -08:00
Martin Kroeker 60eea75409 Merge pull request #1076 from ashwinyes/develop_20170130_thunderx2t99
More optimized implementations for ThunderX2T99
2017-02-04 17:25:43 +01:00
Ashwin Sekhar T K 071a830e8b THUNDERX2T99: Add optimized S/D/C/Z SWAP Implementations 2017-02-03 03:55:06 -08:00
Ashwin Sekhar T K d09f88192c THUNDERX2T99: Add optimized S/D/C/Z COPY Implementations 2017-02-02 15:26:38 +05:30
Ashwin Sekhar T K e58233460a THUDNERX2T99: Add optimized D/C/Z ASUM Implementations 2017-02-02 15:26:22 +05:30
Ashwin Sekhar T K 99bd2892bf THUNDERX2T99: Add optimized CASUM Implementation 2017-01-30 17:44:32 +05:30
Ashwin Sekhar T K ff6f572f2e THUNDERX2T99: Rename labels in for DDOT and SNRM2 2017-01-30 17:44:32 +05:30
Ashwin Sekhar T K e0dc5f58c5 THUNDERX2T99: Remove Duplicate Code 2017-01-30 17:44:32 +05:30
Ashwin Sekhar T K 2757b49767 THUNDERX2T99: Add Optimized CGEMM Implementation 2017-01-30 17:44:26 +05:30
Zhang Xianyi ff41e13385 Merge pull request #1074 from ashwinyes/develop_20170116_thunderx2t99_sgemm
Add more THUNDERX2T99 Optimized APIs
2017-01-25 22:17:05 +08:00
Ashwin Sekhar T K 907e286eb6 THUNDERX2T99: Add threaded SNRM2 Implementation 2017-01-24 21:39:29 +05:30
Ashwin Sekhar T K cde3aee08b ARM64: Rename kernel files to have consistent naming 2017-01-24 14:53:34 +05:30
Ashwin Sekhar T K ee6ea7e988 THUNDERX2T99: Add Optimized CNRM2 Implementation 2017-01-24 10:23:32 +05:30