Commit Graph

135 Commits

Author SHA1 Message Date
CodesWithWolves
d2bda3b56a Remove Unnecessary/Erroneous Reads In sgemm_tcopy_16.S COPY1x8 Macro
There appears to have been some code leak when copying from the COPY2x8
macro above where we're reading 8 bytes into d4-d7 directly after
reading 4 bytes into s4-s7. These 32 bytes in d4-7 are unused and can
possibly overrun the boundary of allocated memory -- Valgrind detected
this which is what dragged my attention to it for a 128,1 copy.

Additionally, there is no need to update the addresses stored in A0-A7
as the only possible paths after running this macro will overwrite A0-7
if looping to the next 8 rows, or overwrite A0-3 if moving to 4 rows --
in which case A4-7 are unused.
2021-03-31 15:44:25 -04:00
Martin Kroeker
b716c0ef01 Add workaround for NVIDIA HPC 2021-01-12 16:51:35 +01:00
Martin Kroeker
2efa3b70dc Add workaround for NVIDIA HPC 2021-01-12 16:49:39 +01:00
Martin Kroeker
49959d4f1c Add workaround for NVIDIA HPC 2021-01-12 16:47:15 +01:00
Martin Kroeker
0f27a03607 Add workaround for NVIDIA HPC mishandling of the asm DOT kernels 2021-01-12 16:39:35 +01:00
Martin Kroeker
c2a8ebfe69 Add workaround for NVIDIA HPC mishandling of the asm DOT kernels 2021-01-12 16:38:51 +01:00
Ashwin Sekhar T K
1b2508362b arm64: Fix nrm2 for input vectors with Inf
Fix double precision nrm2 kernels returning NaN when the
input vectors contain Inf/-Inf.
2021-01-01 02:49:37 -08:00
Martin Kroeker
8631e2976a Temporarily revert to the old nrm2 kernels 2020-12-21 07:45:13 +01:00
Martin Kroeker
2768bc1764 Temporarily revert to the old nrm2 kernels 2020-12-21 07:42:51 +01:00
Martin Kroeker
6f4698ee1f Temporarily revert to the old nrm2 kernel 2020-12-21 07:41:18 +01:00
Martin Kroeker
e1b7123bbe Merge pull request #2867 from Qiyu8/usimd-floatdot
Optimize the performance of dot by using universal intrinsics in X86/ARM
2020-10-10 12:10:25 +02:00
User User-User
d2333e7842 aarch64 fix std=c18 compilation 2020-10-03 18:00:34 +03:00
Qiyu8
60e6c68e38 Adapt ARM architect 2020-09-29 16:36:14 +08:00
Martin Kroeker
775a87242d Rename KERNEL.SILICON to KERNEL.VORTEX 2020-09-03 08:44:20 +02:00
Martin Kroeker
80794fe8fd Create KERNEL.SILICON 2020-09-02 22:56:58 +02:00
Ashwin Sekhar T K
4e1be0e481 ARM64: Add THUNDERX3T110 Target 2020-07-26 23:32:24 -07:00
ZhangDanfeng
bc6fd20a40 fix INIT8x4
Signed-off-by: ZhangDanfeng <467688405@qq.com>
2020-06-10 01:01:16 +08:00
ZhangDanfeng
9b7877ccf1 sgemm copy source init
Signed-off-by: ZhangDanfeng <467688405@qq.com>
2020-06-04 02:10:45 +08:00
ZhangDanfeng
f82fa802d1 Insert prefetch
Signed-off-by: ZhangDanfeng <467688405@qq.com>
2020-06-04 02:08:48 +08:00
张丹枫
9df79ae9a3 update sgemm and strmm kernel selecting strategy 2020-05-20 22:26:58 +08:00
张丹枫
a1fc6041cd use general register to speedup 2020-05-20 22:26:58 +08:00
张丹枫
edb423d772 align general register using to strmm_kernel_8x8 2020-05-20 22:26:58 +08:00
zhangdanfeng
0e6eb8c247 sgemm kernel use sgemm_kernel_8x8_cortexa53
Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>
2020-05-20 22:26:58 +08:00
zhangdanfeng
d475db29c6 optimized for cortex-a53
Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>
2020-05-20 22:26:58 +08:00
Ashwin Sekhar T K
8353cb245a ARM64: Improve DAXPY for ThunderX2
Improve performance of DAXPY for ThunderX2
when the vector fits in L1 Cache.
2020-05-07 09:22:50 -07:00
Martin Kroeker
144be81ca1 fix initialization to zero in the NEON SGEMM_BETA kernel as well 2020-03-31 16:53:56 +02:00
Martin Kroeker
07cdd5d05c Fix zero initialization for beta=0 case
use immediate initialization instead of multiplication in case register content is a NaN
2020-03-31 00:21:02 +02:00
s00548429
bec7923a0d Fix the functional bugs for zamax. 2020-03-09 15:36:50 +08:00
Ali Saidi
c623a965f9 Add Neoverse-N1 core
The implementation is a hybird of the ARMV8 one with some of the
improved TX2 rountines along with specifying -march=v8.2-a
2020-02-29 03:22:04 +00:00
Martin Kroeker
e57b11acca Add preliminary support for EMAG8180 2020-02-19 19:00:28 +01:00
Martin Kroeker
456ee2e1f0 Merge pull request #2357 from chenxuqiang/dgemm_beta_zero
kernel/arm64/dgemm_beta.S: add beta == zero branch
2020-01-02 22:28:36 +01:00
shengyang
80db5f11e1 update 2020-01-02 11:01:57 +08:00
chenxuqiang
52de4cc8fd kernel/arm64/dgemm_beta.S: add beta == zero branch
added beta == zero branch, and no need to load C matrix.

Signed by: Xuqiang Chen <chenxuqiang3@hisilicon.com>
2020-01-01 21:50:45 -05:00
Martin Kroeker
44028581cc Merge pull request #2355 from Zeyiii/dev-zeyi2
Use arm neon instructions to optimize sgemm_beta operation
2020-01-01 22:14:16 +01:00
Martin Kroeker
86ab939936 Merge pull request #2354 from ZuoQ3/develop
[WIP] Use arm neon instructions to optimize tcopy operation
2020-01-01 22:13:37 +01:00
shengyang
8d84403205 Use arm neon instructions to optimize ncopy operation
modified:   KERNEL.ARMV8
	modified:   KERNEL.TSV110
	new file:   sgemm_ncopy_4.S
2019-12-31 17:06:35 +08:00
w00421467
0833a4846a Use arm neon instructions to optimize sgemm_beta operation 2019-12-31 10:42:03 +08:00
zq
50f7fc1401 [WIP] Use arm neon instructions to optimize tcopy operation 2019-12-31 10:21:23 +08:00
w00421467
3ccf8885ac prefetching for dgemm_beta 2019-12-30 11:45:49 +08:00
w00421467
b7cc69ee62 declare DGEMM_BETA in KERNEL.ARMV8 rather than the generic KERNEL 2019-12-20 10:11:50 +08:00
w00421467
aeef942c4f use arm neon instructions to optimize gemm beta operation 2019-12-17 10:00:13 +08:00
Martin Kroeker
85ccdce8c4 Remove the IOS fallbacks to generic C kernels 2019-10-25 23:02:37 +02:00
Martin Kroeker
a448884a63 Remove automatic label postfixes from macro included only once 2019-10-08 08:37:50 +02:00
Martin Kroeker
3a2df19db6 Fix accidental duplication of jump instruction 2019-10-08 08:09:26 +02:00
Martin Kroeker
56837e9d92 Make local labels in macro compatible with the xcode assembler
... which does not perform the automatic numbering on instantiation that the _@ suffix signifies
2019-10-04 14:53:23 +02:00
Martin Kroeker
3e3ccb9011 Add ARM64 implementations of ?sum
as trivial copies of the respective ?asum kernels with the fabs calls removed
2019-03-30 22:13:36 +01:00
maomao194313
783ba8058f HiSilicon tsv110 CPUs optimization branch
add HiSilicon tsv110 CPUs  optimization branch
2019-03-04 16:30:50 +08:00
Martin Kroeker
7639f2e1f0 Rewrite the conditional for OSX to fix cmake parsing on others
The Makefile variable parser in utils.cmake currently does not handle conditionals. Having the definitions for non-OSX last will at least make cmake builds work again on non-OSX platforms.
2018-12-06 14:04:27 +01:00
Martin Kroeker
6ba30e270d Fix typo that broke CNRM2 on ARMV8 since 0.3.0
must have happened in my #1449
2018-12-06 13:42:25 +01:00
Renato Golin
310ea55f29 Simplifying ARMv8 build parameters
ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode
(which is not right because TX2 is ARMv8.1) as well as requiring a few
redundancies in the defines, making it harder to maintain and understand
what core has what. A few other minor issues were also fixed.

Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX,
ThunderX2, and XGene.

Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester.

A summary:
 * Removed TX2 code from ARMv8 build, to make sure it is compatible with
   all ARMv8 cores, not just v8.1. Also, the TX2 code has actually
   harmed performance on big cores.
 * Commoned up ARMv8 architectures' defines in params.h, to make sure
   that all will benefit from ARMv8 settings, in addition to their own.
 * Adding a few more cores, using ARMv8's include strategy, to benefit
   from compiler optimisations using mtune. Also updated cache
   information from the manuals, making sure we set good conservative
   values by default. Removed Vulcan, as it's an alias to TX2.
 * Auto-detecting most of those cores, but also updating the forced
   compilation in getarch.c, to make sure the parameters are the same
   whether compiled natively or forced arch.

Benefits:
 * ARMv8 build is now guaranteed to work on all ARMv8 cores
 * Improved performance for ARMv8 builds on some cores (A72, Falkor,
   ThunderX1 and 2: up to 11%) over current develop
 * Improved performance for *all* cores comparing to develop branch
   before TX2's patch (9% ~ 36%)
 * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than
   current develop's branch and 8% faster than deveop before tx2 patches

Issues:
 * Regression from current develop branch for A53 (-12%) and A57 (-3%)
   with ARMv8 builds, but still faster than before TX2's commit (+15%
   and +24% respectively). This can be improved with a simplification of
   TX2's code, to be done in future patches. At least the code is
   guaranteed to be ARMv8.0 now.

Comments:
 * CortexA57 builds are unchanged on A57 hardware from develop's branch,
   which makes sense, as it's untouched.
 * CortexA72 builds improve over A57 on A72 hardware, even if they're
   using the same includes due to new compiler tunning in the makefile.
2018-11-19 16:41:49 +00:00