OpenBLAS

Author	SHA1	Message	Date
Jia-Chen	302f22693a	MOD: optimize normal DGEMM on ARMV8 cortex-A53 & cortex-A55	2021-11-18 21:14:43 +08:00
Martin Kroeker	22bf5c27ba	Add basic support for the Fujitsu A64FX (#3415 ) * Add initial support for Fujitsu A64FX as generic ARMV8	2021-10-18 15:00:19 +02:00
Martin Kroeker	8c20ca345a	Use Neoverse's current mix of ThunderX2 kernels for Vortex as well	2021-10-06 11:06:43 +02:00
Martin Kroeker	90cc944625	Move alphaI to x22 to leave x18 unused (reserved on OSX)	2021-09-17 09:53:18 +02:00
Martin Kroeker	590fbff06e	move alpha to x19/x20 to leave x18 unused for OSX	2021-09-17 09:42:17 +02:00
Martin Kroeker	380940271b	Move temp to x21 to leave x18 unused (reserved on OSX)	2021-09-17 09:28:19 +02:00
Martin Kroeker	7d75177446	Move temp to x21 to leave x18 unused (reserved on OSX)	2021-09-17 09:24:11 +02:00
Martin Kroeker	0a4ac4b585	Use x21 for I to leave x18 unused (reserved on OSX)	2021-09-17 09:19:51 +02:00
Martin Kroeker	7d4a221579	Remove unused TEMP2 and reshuffle to leave x18 unused (reserved on OSX)	2021-09-17 09:18:25 +02:00
User User-User	39ef0880ae	copy conf	2021-06-19 21:49:58 +02:00
Gilles Gouaillardet	9d292d37b2	arm64: add the missing d9 register to the clobber list Refs. numpy/numpy#18422 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2021-06-14 17:01:28 +09:00
CodesWithWolves	d2bda3b56a	Remove Unnecessary/Erroneous Reads In sgemm_tcopy_16.S COPY1x8 Macro There appears to have been some code leak when copying from the COPY2x8 macro above where we're reading 8 bytes into d4-d7 directly after reading 4 bytes into s4-s7. These 32 bytes in d4-7 are unused and can possibly overrun the boundary of allocated memory -- Valgrind detected this which is what dragged my attention to it for a 128,1 copy. Additionally, there is no need to update the addresses stored in A0-A7 as the only possible paths after running this macro will overwrite A0-7 if looping to the next 8 rows, or overwrite A0-3 if moving to 4 rows -- in which case A4-7 are unused.	2021-03-31 15:44:25 -04:00
Martin Kroeker	b716c0ef01	Add workaround for NVIDIA HPC	2021-01-12 16:51:35 +01:00
Martin Kroeker	2efa3b70dc	Add workaround for NVIDIA HPC	2021-01-12 16:49:39 +01:00
Martin Kroeker	49959d4f1c	Add workaround for NVIDIA HPC	2021-01-12 16:47:15 +01:00
Martin Kroeker	0f27a03607	Add workaround for NVIDIA HPC mishandling of the asm DOT kernels	2021-01-12 16:39:35 +01:00
Martin Kroeker	c2a8ebfe69	Add workaround for NVIDIA HPC mishandling of the asm DOT kernels	2021-01-12 16:38:51 +01:00
Ashwin Sekhar T K	1b2508362b	arm64: Fix nrm2 for input vectors with Inf Fix double precision nrm2 kernels returning NaN when the input vectors contain Inf/-Inf.	2021-01-01 02:49:37 -08:00
Martin Kroeker	8631e2976a	Temporarily revert to the old nrm2 kernels	2020-12-21 07:45:13 +01:00
Martin Kroeker	2768bc1764	Temporarily revert to the old nrm2 kernels	2020-12-21 07:42:51 +01:00
Martin Kroeker	6f4698ee1f	Temporarily revert to the old nrm2 kernel	2020-12-21 07:41:18 +01:00
Martin Kroeker	e1b7123bbe	Merge pull request #2867 from Qiyu8/usimd-floatdot Optimize the performance of dot by using universal intrinsics in X86/ARM	2020-10-10 12:10:25 +02:00
User User-User	d2333e7842	aarch64 fix std=c18 compilation	2020-10-03 18:00:34 +03:00
Qiyu8	60e6c68e38	Adapt ARM architect	2020-09-29 16:36:14 +08:00
Martin Kroeker	775a87242d	Rename KERNEL.SILICON to KERNEL.VORTEX	2020-09-03 08:44:20 +02:00
Martin Kroeker	80794fe8fd	Create KERNEL.SILICON	2020-09-02 22:56:58 +02:00
Ashwin Sekhar T K	4e1be0e481	ARM64: Add THUNDERX3T110 Target	2020-07-26 23:32:24 -07:00
ZhangDanfeng	bc6fd20a40	fix INIT8x4 Signed-off-by: ZhangDanfeng <467688405@qq.com>	2020-06-10 01:01:16 +08:00
ZhangDanfeng	9b7877ccf1	sgemm copy source init Signed-off-by: ZhangDanfeng <467688405@qq.com>	2020-06-04 02:10:45 +08:00
ZhangDanfeng	f82fa802d1	Insert prefetch Signed-off-by: ZhangDanfeng <467688405@qq.com>	2020-06-04 02:08:48 +08:00
张丹枫	9df79ae9a3	update sgemm and strmm kernel selecting strategy	2020-05-20 22:26:58 +08:00
张丹枫	a1fc6041cd	use general register to speedup	2020-05-20 22:26:58 +08:00
张丹枫	edb423d772	align general register using to strmm_kernel_8x8	2020-05-20 22:26:58 +08:00
zhangdanfeng	0e6eb8c247	sgemm kernel use sgemm_kernel_8x8_cortexa53 Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>	2020-05-20 22:26:58 +08:00
zhangdanfeng	d475db29c6	optimized for cortex-a53 Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>	2020-05-20 22:26:58 +08:00
Ashwin Sekhar T K	8353cb245a	ARM64: Improve DAXPY for ThunderX2 Improve performance of DAXPY for ThunderX2 when the vector fits in L1 Cache.	2020-05-07 09:22:50 -07:00
Martin Kroeker	144be81ca1	fix initialization to zero in the NEON SGEMM_BETA kernel as well	2020-03-31 16:53:56 +02:00
Martin Kroeker	07cdd5d05c	Fix zero initialization for beta=0 case use immediate initialization instead of multiplication in case register content is a NaN	2020-03-31 00:21:02 +02:00
s00548429	bec7923a0d	Fix the functional bugs for zamax.	2020-03-09 15:36:50 +08:00
Ali Saidi	c623a965f9	Add Neoverse-N1 core The implementation is a hybird of the ARMV8 one with some of the improved TX2 rountines along with specifying -march=v8.2-a	2020-02-29 03:22:04 +00:00
Martin Kroeker	e57b11acca	Add preliminary support for EMAG8180	2020-02-19 19:00:28 +01:00
Martin Kroeker	456ee2e1f0	Merge pull request #2357 from chenxuqiang/dgemm_beta_zero kernel/arm64/dgemm_beta.S: add beta == zero branch	2020-01-02 22:28:36 +01:00
shengyang	80db5f11e1	update	2020-01-02 11:01:57 +08:00
chenxuqiang	52de4cc8fd	kernel/arm64/dgemm_beta.S: add beta == zero branch added beta == zero branch, and no need to load C matrix. Signed by: Xuqiang Chen <chenxuqiang3@hisilicon.com>	2020-01-01 21:50:45 -05:00
Martin Kroeker	44028581cc	Merge pull request #2355 from Zeyiii/dev-zeyi2 Use arm neon instructions to optimize sgemm_beta operation	2020-01-01 22:14:16 +01:00
Martin Kroeker	86ab939936	Merge pull request #2354 from ZuoQ3/develop [WIP] Use arm neon instructions to optimize tcopy operation	2020-01-01 22:13:37 +01:00
shengyang	8d84403205	Use arm neon instructions to optimize ncopy operation modified: KERNEL.ARMV8 modified: KERNEL.TSV110 new file: sgemm_ncopy_4.S	2019-12-31 17:06:35 +08:00
w00421467	0833a4846a	Use arm neon instructions to optimize sgemm_beta operation	2019-12-31 10:42:03 +08:00
zq	50f7fc1401	[WIP] Use arm neon instructions to optimize tcopy operation	2019-12-31 10:21:23 +08:00
w00421467	3ccf8885ac	prefetching for dgemm_beta	2019-12-30 11:45:49 +08:00

1 2 3

146 Commits