OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Bine Brank	ce329ab686	add sve zhemm copy routines	2022-01-03 15:56:05 +01:00
Bine Brank	0140373802	add sve ztrmm	2022-01-02 19:15:33 +01:00
Bine Brank	f7b6912868	ztrmm sve copy kernels	2021-12-30 21:00:16 +01:00
Bine Brank	40b14e4957	fix zgemm kernel	2021-12-29 11:42:04 +01:00
Bine Brank	6ec4aab875	zgemm sve copy routines	2021-12-26 17:05:46 +01:00
Bine Brank	878064f394	sve zgemm kernel	2021-12-26 08:44:05 +01:00
Bine Brank	683a7548bf	added macros for sve zgemm kernels	2021-12-25 11:46:41 +01:00
Martin Kroeker	7b146e590c	fix function typecast	2021-12-24 20:01:52 +01:00
Martin Kroeker	e9a0e52201	fix function typecast	2021-12-24 20:00:50 +01:00
Martin Kroeker	d1ee6ff73f	fix function typecasts	2021-12-21 18:45:28 +01:00
Bine Brank	e3c9947c0f	prepare kernel for sve zgemm	2021-12-21 11:19:27 +01:00
gxw	8d9b9c6b2a	loongarch64: Optimize dgemm_kernel	2021-12-21 09:33:06 +08:00
Wu Zhigang	92b7b949dd	fix bug in zscal function memset can not be used in zscal because of the stride parameters. Signed-off-by: Wu Zhigang <zhigang.wu@starfivetech.com>	2021-12-15 01:23:30 -08:00
Martin Kroeker	b0a590f4fe	Merge pull request #3475 from wjc404/optimize-A53-dgemm optimize cgemm on ARM cortex A53 & cortex A55	2021-12-12 19:09:08 +01:00
Martin Kroeker	f4d1f0333b	Merge pull request #3474 from rafaelcfsousa/rafael/cmake_power Add CMake support for Power	2021-12-12 19:08:27 +01:00
Jia-Chen	b610d2de37	optimize cgemm on ARM cortex A53 & cortex A55	2021-12-12 17:22:52 +08:00
Martin Kroeker	697e2752d7	Merge pull request #3464 from binebrank/arm_sve_sgemm Add sgemm part for Arm SVE	2021-12-11 20:35:22 +01:00
Bine Brank	a8f62a347b	fix UNROLL_MN and add to targets for SVE	2021-12-11 16:37:23 +01:00
Bine Brank	774267fdac	adjust Makefile.L3 for SVE	2021-12-11 16:35:08 +01:00
Rafael Cardoso Fernandes Sousa	23a7561353	Fix error cmake (small kernels)	2021-12-09 09:57:39 -06:00
Martin Kroeker	5378046abd	roll back DGEMM kernels to 4x8 when compiling for DYNAMIC_ARCH	2021-12-06 19:43:54 +01:00
Bine Brank	a1fea1fe2a	sgemm v2x8 SVE kernel	2021-12-05 18:47:29 +01:00
Bine Brank	abe1ce3434	strmm sve v1x8 kernel	2021-12-05 14:03:08 +01:00
Martin Kroeker	54d321d742	Merge pull request #3466 from rafaelcfsousa/rafael/small_matrix_p10 [POWER] Add small matrix for sgemm/dgemm on Power10	2021-12-03 12:12:20 +01:00
Martin Kroeker	0882db30a2	Merge pull request #3455 from cenewcombe/develop Fix unsafe read during final iteration of zsymv_L_sse2.S	2021-12-03 10:01:20 +01:00
Bine Brank	0de36f7b5c	trmm sve copy fucntions for single precision	2021-11-29 21:25:05 +01:00
Rafael Cardoso Fernandes Sousa	c78fdcc80d	[POWER] Add support for SMALL_MATRIX_OPT	2021-11-28 12:41:16 -06:00
Bine Brank	86ae89bf33	add sgemm kernel and copy functions for sgemm and ssymm	2021-11-28 18:12:47 +01:00
Martin Kroeker	454edd741c	Merge pull request #3425 from binebrank/arm_sve_dgemm Add dgemm kernel for arm64 SVE	2021-11-26 16:14:55 +01:00
Martin Kroeker	bcfbdc81b2	Merge pull request #3459 from rafaelcfsousa/fix_cmake Fix issues when building OpenBLAS with cmake	2021-11-26 15:19:24 +01:00
Bine Brank	1af73ce38e	Adapt CMake for SVE	2021-11-26 10:35:01 +01:00
Martin Kroeker	e7fca060db	Merge pull request #3457 from wjc404/optimize-A53-dgemm MOD: optimize zgemm on cortex-A53/cortex-A55	2021-11-26 10:30:47 +01:00
Jia-Chen	5c1cd5e0c2	MOD: add comments to a53 zgemm kernel	2021-11-25 22:48:48 +08:00
Rafael Cardoso Fernandes Sousa	d5c9353f1b	Modify the order that cmake set the KERNEL variables (generic now is fallback)	2021-11-24 20:08:35 -06:00
Jia-Chen	9f59b19fcd	MOD: optimize zgemm on cortex-A53/cortex-A55	2021-11-24 21:51:45 +08:00
Bine Brank	531a28b6a0	removed unused code (compiler warnings)	2021-11-22 10:12:34 +01:00
Bine Brank	9b9cb90bb1	modify Makefile for SVE copy	2021-11-22 09:54:20 +01:00
Bine Brank	9388f05a3c	configure SVE Makefile	2021-11-21 18:33:43 +01:00
Bine Brank	b58d4f31ab	some clean-up & commentary	2021-11-21 14:56:27 +01:00
Martin Kroeker	b7df500106	Add generic mips32 target	2021-11-20 17:31:51 +01:00
Bine Brank	e6ed4be02e	symm SVE copy rutines	2021-11-20 16:35:29 +01:00
Caroline Newcombe	feeb8283a5	Fix unsafe read during final iteration of zsymv_L_sse2.S	2021-11-19 14:29:32 -06:00
Jia-Chen	302f22693a	MOD: optimize normal DGEMM on ARMV8 cortex-A53 & cortex-A55	2021-11-18 21:14:43 +08:00
Bine Brank	3c7eed0e53	add remaining trmm copy rutines for SVE	2021-11-14 16:00:10 +01:00
Bine Brank	7d996b1c36	dtrmm_utcopy sve function	2021-11-13 18:48:53 +01:00
Bine Brank	ab7917910d	add v2x8 kernel + fix sve dtrmm	2021-11-07 20:37:51 +01:00
Bine Brank	7093372e32	add ARMV8SVE target	2021-11-01 22:53:21 +01:00
Bine Brank	a8fbdbac34	fix sve dgemm kernel + sve dtrmm	2021-10-31 10:24:25 +01:00
Bine Brank	746b4f0f17	added SVE ncopy and tcopy	2021-10-30 12:11:44 +02:00
Bine Brank	1a10d3e09d	add sve dgemm prototype	2021-10-27 16:37:18 +02:00
Martin Kroeker	22bf5c27ba	Add basic support for the Fujitsu A64FX (#3415 ) * Add initial support for Fujitsu A64FX as generic ARMV8	2021-10-18 15:00:19 +02:00
Wangyang Guo	63a103ba6e	sbgemm: spr: disable small matrix path by default	2021-10-17 19:08:03 -07:00
Wangyang Guo	82194ea9d2	sbgemm: spr: implement otcopy_16	2021-10-17 19:08:03 -07:00
Wangyang Guo	8632380a96	sbgemm: spr: reuse ncopy_16 from cooperlake as incopy	2021-10-17 19:08:03 -07:00
Wangyang Guo	6bc8204ce5	sbgemm: spr: optimization for tmp_c buffer	2021-10-17 19:08:03 -07:00
Wangyang Guo	f018aa342a	sbgemm: spr: kernel handle alpha != 1.0	2021-10-17 19:08:03 -07:00
Wangyang Guo	a52456b168	sbgemm: spr: oncopy: use tile load/store instead	2021-10-17 19:08:03 -07:00
Wangyang Guo	f2485352a6	sbgemm: spr: only load A once in tail_k handling	2021-10-17 19:08:03 -07:00
Wangyang Guo	9ab33228bb	sbgemm: spr: process k2 and odd k at the same time	2021-10-17 19:08:03 -07:00
Wangyang Guo	10d52646e2	sbgemm: spr: oncopy: avoid handling too much pointer at a time	2021-10-17 19:08:03 -07:00
Wangyang Guo	88154ed02d	sbgemm: spr: reduce tile conf loading by seperate tail k handling	2021-10-17 19:08:03 -07:00
Wangyang Guo	a70bfb52d5	sbgemm: spr: kernel works for NN case when alpha is 1.0	2021-10-17 19:08:03 -07:00
Wangyang Guo	6051c86741	sbgemm: spr: kernel works for m32 in NN case	2021-10-17 19:08:03 -07:00
Wangyang Guo	d0b253ac6e	sbgemm: spr: implement oncopy_16	2021-10-17 19:08:03 -07:00
Wangyang Guo	1d48b7cb16	sbgemm: spr: add dummy source files	2021-10-17 19:08:03 -07:00
Wangyang Guo	3dc6052c7e	initial support for Sapphire Rapids platform	2021-10-12 01:30:40 -07:00
Martin Kroeker	8c20ca345a	Use Neoverse's current mix of ThunderX2 kernels for Vortex as well	2021-10-06 11:06:43 +02:00
Martin Kroeker	8e4c209002	Merge pull request #3398 from kavanabhat/aix_p10_gnuas Big Endian Changes for Power10 kernels	2021-10-05 18:59:47 +02:00
kavanabhat	9cc95e5657	AIX changes for P10 with GNU Compiler	2021-10-01 05:18:35 -05:00
kavanabhat	fe3c778c51	AIX changes for P10 with GNU Compiler	2021-09-30 06:06:27 -05:00
Wangyang Guo	ee5ca8a328	x86_64: BFLOAT16: fix build warning	2021-09-28 18:30:06 +08:00
Martin Kroeker	90cc944625	Move alphaI to x22 to leave x18 unused (reserved on OSX)	2021-09-17 09:53:18 +02:00
Martin Kroeker	590fbff06e	move alpha to x19/x20 to leave x18 unused for OSX	2021-09-17 09:42:17 +02:00
Martin Kroeker	380940271b	Move temp to x21 to leave x18 unused (reserved on OSX)	2021-09-17 09:28:19 +02:00
Martin Kroeker	7d75177446	Move temp to x21 to leave x18 unused (reserved on OSX)	2021-09-17 09:24:11 +02:00
Martin Kroeker	0a4ac4b585	Use x21 for I to leave x18 unused (reserved on OSX)	2021-09-17 09:19:51 +02:00
Martin Kroeker	7d4a221579	Remove unused TEMP2 and reshuffle to leave x18 unused (reserved on OSX)	2021-09-17 09:18:25 +02:00
Martin Kroeker	d3a9c7ef7f	Merge pull request #3382 from rafaelcfsousa/rafael/cwarnings [POWER] Remove unused variable warnings.	2021-09-17 09:15:16 +02:00
Martin Kroeker	8dfa61a61c	Initialize abs_mask1 with itself to silence a gcc warning	2021-09-15 22:11:35 +02:00
Martin Kroeker	99aa10b3ff	Initialize abs_mask1 with itself to silence a gcc warning actual initialization is via the _mm_cmpeq_ep18, which I've seen claimed to be the fastest way to set an xmm register to all 1s	2021-09-15 22:10:43 +02:00
Rafael Cardoso Fernandes Sousa	b751edf624	Fix unused variable warnings on Power	2021-09-15 13:36:07 -05:00
Martin Kroeker	80346b8813	Merge pull request #3379 from martin-frbg/issue3369-2 Add casts to fix compiler warnings for SkylakeX sasum/dasum	2021-09-15 07:18:57 +02:00
Martin Kroeker	ce036a2fc0	Add casts	2021-09-14 21:41:53 +02:00
Martin Kroeker	ddf106f769	Add dedicated entries for BFLOAT16 kernels	2021-09-14 16:17:18 +02:00
Martin Kroeker	af8843875a	Merge pull request #3376 from martin-frbg/issue3370 Fix a few harmless compiler warnings	2021-09-12 00:01:31 +02:00
Martin Kroeker	0925dfe2c9	One instance of kernel_4x1 is used even on SKX	2021-09-11 15:30:19 +02:00
Martin Kroeker	7d873a329f	Add ifdefs around conditionally used functions	2021-09-11 14:38:47 +02:00
Martin Kroeker	ef24712030	Move a conditionally used variable	2021-09-11 14:37:44 +02:00
Martin Kroeker	d17238599b	Add casts	2021-09-11 13:38:28 +02:00
Wangyang Guo	59a1114d03	sbgemm: cooperlake: tuning for small matrix	2021-09-07 21:30:46 +08:00
Wangyang Guo	682d66555d	sbgemm: cooperlake: implement ncopy_16	2021-09-07 21:30:46 +08:00
Wangyang Guo	beccb83b16	sbgemm: cooperlake: add n24 kernel for tcopy_4	2021-09-07 21:30:46 +08:00
Wangyang Guo	5fcacad32b	sbgemm: cooperlake: implement tcopy_4	2021-09-07 21:30:46 +08:00
Wangyang Guo	bb1c4fa5bd	sbgemm: cooperlake: prefetch A & B	2021-09-07 21:30:46 +08:00
Wangyang Guo	7a2d1601ec	sbgemm: cooperlake: unroll core loop by 2	2021-09-07 21:30:46 +08:00
Wangyang Guo	45fdf951b6	sbgemm: cooperlake: reorder ptr increase for performance	2021-09-07 21:30:46 +08:00
Wangyang Guo	cece3541ab	sbgemm: cooperlake: fix bug in m64n12	2021-09-07 21:30:46 +08:00
Wangyang Guo	9df0953cde	sbgemm: cooperlake: kernel works for NN	2021-09-07 21:30:45 +08:00
Wangyang Guo	2ec9f3a8aa	sbgemm: cooperlake: change kernel size to 16x4	2021-09-07 21:30:45 +08:00
Wangyang Guo	ef8f5fecc8	sbgemm: cooperlake: implement sbgemm_tcopy_32	2021-09-07 21:30:45 +08:00
Wangyang Guo	4c294336e6	sbgemm: cooperlake: add dummy source files	2021-09-07 21:30:45 +08:00
Martin Kroeker	f1e3305974	Add workaround for Windows10 macro name clash	2021-09-01 21:36:50 +02:00
Wangyang Guo	619588fbab	sbgemm: remove unnecessary b0 files	2021-08-30 17:55:01 +08:00
Wangyang Guo	f39301935c	sbgemm: cooperlake: make sure hot buffer aligned to 64	2021-08-30 17:40:30 +08:00
Wangyang Guo	7d27b182fc	sbgemm: cooperlake: enable SBGEMM by small matrix path	2021-08-30 17:40:30 +08:00
Wangyang Guo	1d83ca4bca	Small Matrix: support BFLOAT16 data type	2021-08-30 17:40:20 +08:00
Martin Kroeker	bec9d9f63d	Merge pull request #3335 from guowangy/small-matrix-latest Add GEMM optimization for small matrix and single/double kernel for skylakex	2021-08-29 22:33:33 +02:00
Wangyang Guo	dbbb39199f	sgemv: skylakex: fix build warning	2021-08-25 07:13:00 +00:00
Wangyang Guo	e9acb46431	sgemv: skylakex: bug fix for sgemv_t kernel in corner case	2021-08-25 07:07:27 +00:00
Wangyang Guo	f9dba63c28	Small Matrix: skylakex: remove unnecessary b0 source files	2021-08-13 03:28:44 +00:00
Wangyang Guo	989e6bbdd3	Small Matrix: reduce generic kernel source files	2021-08-13 03:17:38 +00:00
Martin Kroeker	04255be948	Merge pull request #3344 from gxw-loongson/develop Delete the macro instruction "li" and use "li.d" instead	2021-08-12 15:16:46 +02:00
gxw	a7bc8ec1f1	Delete the macro instruction "li" and use "li.d" instead Change-Id: Icff7981e2eb7df29ba5af1f8eb5be8443c67450f	2021-08-12 17:02:54 +08:00
Rajalakshmi Srinivasaraghavan	b06880c2cd	POWER10: Improving dasum performance Unrolling a loop in dasum micro code to help in improving POWER10 performance.	2021-08-10 22:06:04 -05:00
Wangyang Guo	44d0032f3b	Small Matrix: skylakex: fix build error in old compiler	2021-08-05 04:43:47 +00:00
Chen, Guobing	5d86becdae	Add all SBGEMM kernels for IA AVX512-BF16 based platforms Added all SBGEMM kernels including NN/NT/TN/TT for both ColMajor and RowMajor, based on AVX512-BF16 ISA set on IA. Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2021-08-05 11:11:29 +08:00
Wangyang Guo	fee5abd84b	Small Matrix: support cmake build	2021-08-04 08:50:15 +00:00
Wangyang Guo	478d1086c1	Small Matrix: support DYNAMIC_ARCH build	2021-08-04 03:12:41 +00:00
Wangyang Guo	6b58bca18b	Small Matrix: disable low performance default kernel	2021-08-03 06:49:03 +00:00
Wangyang Guo	fa777f5517	Small Matrix: skylakex: add DGEMM_SMALL_M_PERMIT and tune for TN kernel	2021-08-02 07:06:54 +00:00
Wangyang Guo	8592c21af4	Small Matrix: skylakex: dgemm nn: fix typo in idx load	2021-08-02 07:06:54 +00:00
Wangyang Guo	3e79f6d89a	Small Matrix: skylakex: add dgemm tn kernel	2021-08-02 07:06:54 +00:00
Wangyang Guo	323d7da4f7	Small Matrix: skylakex: add dgemm tt kernel	2021-08-02 07:06:54 +00:00
Wangyang Guo	f57fc932ac	Small Matrix: skylakex: add dgemm nt kernel	2021-08-02 07:06:54 +00:00
Wangyang Guo	91ec21202b	Small Matrix: skylakex: add dgemm nn kernel	2021-08-02 07:06:54 +00:00
Wangyang Guo	72e070539c	Small Matrix: skylakex: add sgemm tt kernel	2021-08-02 07:06:54 +00:00
Wangyang Guo	02c6e764f2	Small Matrix: skylakex: add SGEMM_SMALL_M_PERMIT and tune for TN kernel	2021-08-02 07:06:54 +00:00
Wangyang Guo	5dc7c3c8e5	Small Matrix: add GEMM_SMALL_MATRIX_PERMIT to tune small matrics case	2021-08-02 07:06:54 +00:00
Wangyang Guo	642c393879	Small Matrix: skylakex: add sgemm tn kernel	2021-08-02 07:06:54 +00:00
Wangyang Guo	ae3f5c737c	Small Matrix: skylakex: sgemm nt: optimize for M < 12	2021-08-02 07:06:54 +00:00
Wangyang Guo	0d72d75bf9	Small Matrix: skylakex: add sgemm nt kernel	2021-08-02 07:06:54 +00:00
Wangyang Guo	ca7682e3a3	Small Matrix: skylakex: sgemm nn: fix n6 conflicts with n4	2021-08-02 07:06:54 +00:00
Wangyang Guo	9967e61abb	Small Matrix: skylakex: sgemm nn: fix error when beta not zero	2021-08-02 07:06:54 +00:00
Wangyang Guo	a87736346f	Small Matrix: skylakex: sgemm nn: add n6 to improve performance	2021-08-02 07:06:54 +00:00
Wangyang Guo	4c9d9940fd	Small Matrix: skylakex: sgemm nn: reduce store 4 N at a time	2021-08-02 07:06:54 +00:00
Wangyang Guo	13b32f69b7	Small Matrix: skylakex: sgemm nn: reduce store 4 M at a time	2021-08-02 07:06:54 +00:00
Wangyang Guo	3d8c6d9607	Small Matrix: skylakex: sgemm nn: clean up unused code	2021-08-02 07:06:54 +00:00
Wangyang Guo	49b61a3f30	Small Matrix: skylakex: sgemm_nn: optimize for M <= 8	2021-08-02 07:06:54 +00:00
Wangyang Guo	f88470323b	Optimize M < 16 using AVX512 mask	2021-08-02 07:06:54 +00:00
Wangyang Guo	9186456a12	small matrix: SkylakeX: add SGEMM NN kernel	2021-08-02 07:06:54 +00:00
Xianyi Zhang	6022e5629c	Refs #2587 fix small matrix c/zgemm bug.	2021-08-02 07:06:54 +00:00
Xianyi Zhang	57ed58cefe	Refs #2587 Add small matrix optimization reference kernel for c/zgemm.	2021-08-02 07:06:54 +00:00
Xianyi Zhang	17d32a4a82	Change a1b0 gemm to b0 gemm.	2021-08-02 07:06:54 +00:00
Xianyi Zhang	59cb5de46b	Refs #2587 Fix typos.	2021-08-02 07:06:54 +00:00
Xianyi Zhang	be3349405d	Add alpha=1.0 beta=0.0 for small gemm.	2021-08-02 07:01:47 +00:00
Xianyi Zhang	0a2077901c	Add small marix optimization kernel interface. make SMALL_MATRIX_OPT=1	2021-08-02 07:01:47 +00:00
gxw	0b8f7c8c10	Add cmake support for LOONGARCH64	2021-08-02 10:00:41 +08:00
gxw	af0a69f355	Add support for LOONGARCH64	2021-07-27 15:29:12 +08:00
Martin Kroeker	49bbf330ca	Empirical workaround for numpy SVD NaN problem from issue 3318	2021-07-18 22:19:19 +02:00
Martin Kroeker	5b4b385ecf	Temporarily disable the SkylakeX sgemv_t microkernel due to LAPACK testsuite failures	2021-07-14 20:50:14 +02:00
User User-User	39ef0880ae	copy conf	2021-06-19 21:49:58 +02:00
Martin Kroeker	c4b464cac6	Merge pull request #3273 from austinpagan/sbgemm_gcc10_fix Power10: Fix for SBGEMM	2021-06-15 22:58:48 +02:00
Gordon Fossum	e6dd44d989	Power10: Fix for SBGEMM While testing bfloat16 sbgemm kernel, there are some failures for odd value inputs due to updating result for additional bytes.	2021-06-15 13:07:47 -05:00
Gilles Gouaillardet	9d292d37b2	arm64: add the missing d9 register to the clobber list Refs. numpy/numpy#18422 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2021-06-14 17:01:28 +09:00
Martin Kroeker	2e8ff4a781	Merge pull request #3266 from martin-frbg/powerparam Remove spurious casts from PPC parameters and fix compilation for older targets	2021-06-10 18:05:47 +02:00
Martin Kroeker	dbba381dc3	Merge pull request #3260 from intelmy/sgemv_t_opt Optimized sgemv_t for small N based on AVX512	2021-06-10 16:08:24 +02:00
Martin Kroeker	efdbdd8f82	Add prefetch values for power3	2021-06-10 11:20:29 +02:00
Martin Kroeker	3906ef3b0f	Add prefetch values for power3	2021-06-10 11:19:40 +02:00
Martin Kroeker	8adf0971d8	Add prefetch values for power3	2021-06-10 11:18:22 +02:00
Martin Kroeker	08e2e60762	Add prefetch values for power3	2021-06-10 11:17:33 +02:00
Martin Kroeker	fb9e678235	Fix caxpy/zaxpy for big-endian	2021-06-10 11:15:48 +02:00
Martin Kroeker	dc4fcb48df	Fix inverted conditional for caxpy/zaxpy	2021-06-10 11:14:03 +02:00
Martin Kroeker	7a48247761	fix c/zrot and sgemv for POWER5	2021-06-10 11:11:56 +02:00
Rajalakshmi Srinivasaraghavan	cbb70438df	POWER10: Fixes for sbgemm kernel While testing bfloat16 sbgemm kernel, there are some failures for odd value inputs due to array access beyond the boundary.	2021-06-09 12:20:09 -05:00
Ma, Yu	706a08d4a0	Optimized sgemv_t for small N based on AVX512	2021-06-08 15:08:28 -04:00
Zhaofeng Li	590be3fae3	riscv64: Add Makefile	2021-06-07 22:55:56 +00:00
Zhaofeng Li	3521cd48cb	RISCV64_GENERIC: Use generic kernel for DSDOT for better precision The implementation in `riscv64/dot.c` fails the `test_dsdot` test, and the generic kernel seems to have better precision. Tested on SiFive FU740 (HiFive Unmatched) and QEMU. Also see #1469.	2021-06-07 22:50:23 +00:00
Zhaofeng Li	1e0192a5cc	riscv64/imin: Fix wrong comparison Same as #1990.	2021-06-07 22:49:39 +00:00
Martin Kroeker	5f677e782e	Merge pull request #3196 from guowangy/skylakex-gemm-batch-k GEMM: skylake: improve the performance when m is small	2021-05-22 19:25:28 +02:00
Martin Kroeker	02087a62e7	Merge pull request #3205 from intelmy/sgemv_n_opt optimize on sgemv_n for small n	2021-05-17 17:49:01 +02:00
Martin Kroeker	4ecf631f95	Merge pull request #3228 from martin-frbg/issue3226 filter out -mavx flag on Sandybridge zgemm/ztrmm kernels	2021-05-15 09:06:12 +02:00
Martin Kroeker	310b76aad7	Merge pull request #3231 from martin-frbg/issue3227 Support compilation with pre-C99 versions of MSVC	2021-05-14 23:28:06 +02:00
Martin Kroeker	c4da892ba0	Only filter out -mavx on Sandybridge ZGEMM/ZTRMM kernels	2021-05-14 23:19:10 +02:00
Martin Kroeker	8b90e5f202	Drop redundant inclusion of complex.h	2021-05-14 15:06:44 +02:00
Martin Kroeker	bd60fb6ffc	filter out -mavx flag on zgemm kernels as it can cause problems with older gcc	2021-05-13 23:05:00 +02:00
Martin Kroeker	37ea8702ee	Merge pull request #3192 from damonyu1989/develop Update the intrinsic api to the offical name.	2021-05-11 16:00:45 +02:00
Martin Kroeker	c0ca63ea46	Fix missing conditionals for non-SKX kernels	2021-05-05 14:55:36 +02:00
pnp	3d4ccd2a13	fix for build error	2021-04-30 12:25:33 -04:00
pnp	c59652f0ce	optimize on sgemv_n for small n	2021-04-30 12:14:58 -04:00
Wangyang Guo	aa7b3dc3db	GEMM: skylake: improve the performance when m is small	2021-04-28 13:56:06 +00:00
damonyu	ceb44bef14	update the intrinsic api to the offical name.	2021-04-27 11:12:29 +08:00
Martin Kroeker	3d511f0e66	replace spurious avx512 requirement with fma check	2021-04-26 21:55:30 +02:00
Rajalakshmi Srinivasaraghavan	2379abaa5e	POWER10: Improve dgemm performance This patch uses vector pair pointer for input load operation which helps to generate power10 lxvp instructions.	2021-04-13 22:30:06 -05:00
Rajalakshmi Srinivasaraghavan	55bb9f639a	POWER10: Optimized zgemv This patch makes use of Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1 for zgemv_n and zgemv_t.	2021-04-10 19:00:24 -05:00
Martin Kroeker	2dfb24730d	Use "old" compute(24) function with clang due to register limitations	2021-04-06 19:58:32 +02:00
Martin Kroeker	147e0a75fd	Merge pull request #3170 from CodesWithWolves/sgemm_tcopy_16-invalid-read Remove Unnecessary/Erroneous Adds/Reads In sgemm_tcopy_16.S COPY1x8 Macro	2021-04-03 19:49:47 +02:00
Rajalakshmi Srinivasaraghavan	2dbcddd83d	POWER10: Adding check for little endian This patch makes sure that recent POWER10 patches are used only for little endian.	2021-03-31 21:32:42 -05:00
CodesWithWolves	d2bda3b56a	Remove Unnecessary/Erroneous Reads In sgemm_tcopy_16.S COPY1x8 Macro There appears to have been some code leak when copying from the COPY2x8 macro above where we're reading 8 bytes into d4-d7 directly after reading 4 bytes into s4-s7. These 32 bytes in d4-7 are unused and can possibly overrun the boundary of allocated memory -- Valgrind detected this which is what dragged my attention to it for a 128,1 copy. Additionally, there is no need to update the addresses stored in A0-A7 as the only possible paths after running this macro will overwrite A0-7 if looping to the next 8 rows, or overwrite A0-3 if moving to 4 rows -- in which case A4-7 are unused.	2021-03-31 15:44:25 -04:00
Martin Kroeker	bdd6e3a153	Merge pull request #3157 from martin-frbg/issue3020-final Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler on PPC	2021-03-19 15:23:12 +01:00
Martin Kroeker	7b8f580941	Merge pull request #3156 from martin-frbg/omatcopy_d Move x86_64 DOMATCOPY_RT back to the C implementation	2021-03-19 15:22:48 +01:00
Martin Kroeker	86c5a0013f	Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler	2021-03-19 11:47:58 +01:00
Martin Kroeker	ef85c22474	Add workaround for LAPACK test failures with the NVIDIA HPC compiler	2021-03-19 11:46:25 +01:00
Martin Kroeker	d3555d2e50	Add workaround for LAPACK test failures with the NVIDIA HPC compiler	2021-03-19 11:44:31 +01:00
Martin Kroeker	0f5e86a0d9	Remove premature entry for DOMATCOPY_RT	2021-03-18 21:53:50 +01:00
Martin Kroeker	7b294a99fd	Move common.h back to the top of the file so that SKYLAKEX (from config.h) is defined in time	2021-03-18 21:28:19 +01:00
Martin Kroeker	0934568d9c	Move includes under the ifdef for compilers w/o intrinsics support	2021-03-12 12:42:05 +01:00
Rajalakshmi Srinivasaraghavan	09d47af2c0	Optimize zscal function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-03-10 17:15:33 -06:00
Martin Kroeker	ef0238ba2b	Merge pull request #3130 from martin-frbg/issue3128 Replace spurious AVX512 requirement in the Haswell srot microkernel with an AVX2/FMA3 guard	2021-03-06 19:15:53 +01:00
Martin Kroeker	a9f6f7ad39	Remove spurious AVX512 requirement and add AVX2/FMA3 guard	2021-03-06 14:35:49 +01:00
Rajalakshmi Srinivasaraghavan	41646ed006	Optimize s/dasum function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-03-05 16:22:36 -06:00
Rajalakshmi Srinivasaraghavan	0571c3187b	POWER10: Rename mma builtins The LLVM and GCC teams agreed to rename the __builtin_mma_assemble_pair and __builtin_mma_disassemble_pair built-ins to __builtin_vsx_assemble_pair and __builtin_vsx_disassemble_pair respectively. This patch is to make corresponding changes in dgemm kernel. Also made changes in inputs to those builtins to avoid some potential typecasting issues. Reference gcc commit id:77ef995c1fbcab76a2a69b9f4700bcfd005d8e62	2021-02-26 20:56:34 -06:00
Martin Kroeker	292d1af1a0	Update omatcopy_rt.c	2021-02-24 09:34:14 +01:00
Martin Kroeker	325b398e3c	Update omatcopy_rt.c	2021-02-24 09:13:12 +01:00
Martin Kroeker	6f5667b4d4	Enable optimized S/D OMATCOPY_RT	2021-02-24 09:03:41 +01:00
Martin Kroeker	cceeee7806	Add optimized omatcopy_rt	2021-02-24 09:00:54 +01:00
Martin Kroeker	0a4546b742	Typo fix	2021-02-23 13:14:35 +01:00
Martin Kroeker	b1eed27a54	Replace naive omatcopy_rt with 4x4 blocked implementation as suggested by MigMuc in issue 2532	2021-02-22 21:35:42 +01:00
Martin Kroeker	47691c031f	Use Haswell optimizations for Zen as well	2021-02-11 09:26:15 +01:00
Martin Kroeker	ce7ddd8921	Use Haswell optimizations for Zen as well	2021-02-11 09:25:36 +01:00
Martin Kroeker	950c047b49	Use Haswell optimizations for Zen as well	2021-02-11 09:24:51 +01:00
Martin Kroeker	46509953a9	Use Haswell optimizations for Zen as well	2021-02-11 09:24:16 +01:00
Martin Kroeker	db348dcff2	Enable optimized srot/drot kernels from Haswell	2021-02-11 09:23:05 +01:00
Rajalakshmi Srinivasaraghavan	2056ffc227	Optimize cscal function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-29 13:51:43 -06:00
Rajalakshmi Srinivasaraghavan	3ede843d50	Optimize s/dscal function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-24 07:48:28 -06:00
Martin Kroeker	69a5558203	Merge pull request #3059 from Guobing-Chen/BF16_gemm Initial code for Cooperlake BF16 GEMM kernel	2021-01-23 19:08:05 +01:00
Martin Kroeker	d6905403e3	Merge pull request #3068 from alexhenrie/scan-build scan-build fixes	2021-01-23 19:06:29 +01:00
Rajalakshmi Srinivasaraghavan	439b93f6d2	Optimize s/drot function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-21 13:24:45 -06:00
Rajalakshmi Srinivasaraghavan	eff7c9166e	Optimize cdot function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-15 13:40:34 -06:00
Alex Henrie	202fc9e8ed	Fix uninitialized argument value in dasum_k	2021-01-14 19:40:31 -07:00
Martin Kroeker	e378b24487	Merge pull request #3067 from albertziegenhagel/fix-generic-cmake Fix building "generic" TRMM kernel with CMake	2021-01-14 21:35:19 +01:00
Albert Ziegenhagel	e3f4063683	Fix building "generic" TRMM kernel with CMake The CMake "TARGET_CORE" variables stores the "generic" target name in all lowercase letters, but gets compared to an all uppercase string, which results in the wrong TRMM kernel being selected. This commit converts the TARGET_CORE to all uppercase before comparing its value to make sure case mismatches are not an issue in the future anymore.	2021-01-14 10:00:49 +01:00
Martin Kroeker	b716c0ef01	Add workaround for NVIDIA HPC	2021-01-12 16:51:35 +01:00
Martin Kroeker	2efa3b70dc	Add workaround for NVIDIA HPC	2021-01-12 16:49:39 +01:00
Martin Kroeker	49959d4f1c	Add workaround for NVIDIA HPC	2021-01-12 16:47:15 +01:00
Martin Kroeker	0f27a03607	Add workaround for NVIDIA HPC mishandling of the asm DOT kernels	2021-01-12 16:39:35 +01:00
Martin Kroeker	c2a8ebfe69	Add workaround for NVIDIA HPC mishandling of the asm DOT kernels	2021-01-12 16:38:51 +01:00
Martin Kroeker	43aac5bacc	Support NVIDIA HPC compiler	2021-01-12 16:36:12 +01:00
Chen, Guobing	b0beb0b1ca	Initial code for Cooperlake BF16 GEMM kernel	2021-01-11 02:15:21 +08:00
Rajalakshmi Srinivasaraghavan	601b711c78	Optimize swap function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-08 08:01:36 -06:00
Ashwin Sekhar T K	1b2508362b	arm64: Fix nrm2 for input vectors with Inf Fix double precision nrm2 kernels returning NaN when the input vectors contain Inf/-Inf.	2021-01-01 02:49:37 -08:00
Martin Kroeker	3559c5d7a2	Merge pull request #3048 from martin-frbg/issue2998 Temporarily revert to the old NRM2 kernels for ThunderX2/3 and NeoverseN1	2020-12-21 13:30:08 +01:00
Martin Kroeker	8631e2976a	Temporarily revert to the old nrm2 kernels	2020-12-21 07:45:13 +01:00
Martin Kroeker	2768bc1764	Temporarily revert to the old nrm2 kernels	2020-12-21 07:42:51 +01:00
Martin Kroeker	6f4698ee1f	Temporarily revert to the old nrm2 kernel	2020-12-21 07:41:18 +01:00
Martin Kroeker	114eb159a4	Disable FMA intrinsics in the srot kernel when the compiler is PGI/NVIDIA	2020-12-19 22:15:58 +01:00
Martin Kroeker	005cce5507	Amend SkylakeX options to support the NVIDIA compiler	2020-12-19 22:11:49 +01:00
Xianyi Zhang	a3cac9cca0	Update sgemm kernel 1x4 for C910.	2020-12-18 11:53:23 +08:00
Martin Kroeker	c73d8ee40d	Conditionally add -mfma to compiler options where needed	2020-12-17 11:34:05 +01:00
Rajalakshmi Srinivasaraghavan	2fb11f873b	POWER10: Improve copy performance This patch aligns the stores to 32 byte boundary for scopy and dcopy before entering into vector pair loop. For ccopy, changed the store instructions to stxv to improve performance of unaligned cases.	2020-12-13 10:41:45 -06:00
Martin Kroeker	043128cbe5	Merge pull request #3029 from RajalakshmiSR/axpyp10 POWER10: Improve axpy performance	2020-12-10 22:49:28 +01:00
Martin Kroeker	3331ca492d	Merge pull request #3021 from austinpagan/trsm_p10 POWER: Added special unrolled vectorized versions of "Solve" for specific si…	2020-12-10 19:42:54 +01:00
Rajalakshmi Srinivasaraghavan	346e30a46a	POWER10: Improve axpy performance This patch aligns the stores to 32 byte boundary for saxpy and daxpy before entering into vector pair loop. Fox caxpy, changed the store instructions to stxv to improve performance of unaligned cases.	2020-12-10 11:51:42 -06:00
gxw	4b548857d6	Add msa support for loongson 1. Using core loongson3r3 and loongson3r4 for loongson 2. Add DYNAMIC_ARCH for loongson Change-Id: I1c6b54dbeca3a0cc31d1222af36a7e9bd6ab54c1	2020-12-09 10:28:46 +08:00
Martin Kroeker	7f11e33e8d	Merge pull request #3025 from TiredNotTear/develop MIPS: Fix two bugs	2020-12-08 09:39:27 +01:00
Martin Kroeker	53e0837809	Merge pull request #3022 from jinboson/develop Fix test errors reported by cblas_cgemm & cblas_ctrmm	2020-12-07 08:09:11 +01:00
Hao Chen	ad38bd0e89	Fix failed cgemv and zgemv test case after using msa optimization The cgemv and zgemv test case will call cgemv_n/t_msa.c zgemv_n/t_msa.c files in MIPS environment. When the macro CONJ is defined, the calculation result will be wrong due to the wrong definition of OP2. This patch updates the value of OP2 and passes the corresponding test.	2020-12-07 10:25:01 +08:00
Hao Chen	47b639cc9b	Fix failed sswap and dswap case by using msa optimization The swap test case will call sswap_msa.c and dswap_msa.c files in MIPS environmnet. When inc_x or inc_y is equal to zero, the calculation result of the two functions will be wrong. This patch adds the processing of inc_x or inc_y equal to zero, and the swap test case has passed.	2020-12-07 10:24:49 +08:00
Martin Kroeker	b660008c7e	Work around DOT and SWAP test failures	2020-12-06 19:15:37 +01:00
Martin Kroeker	f8346603cf	Fix compilation with SolarisStudio	2020-12-06 19:14:16 +01:00
Jin Bo	65de6f5957	Fix test errors reported by cblas_cgemm & cblas_ctrmm The file cgemm_kernel_8x4_msa.c holds the MSA optimization codes of cblas_cgemm and cblas_ctrmm. It defines two macros: CGEMM_SCALE_1X2 and CGEMM_TRMM_SCALE_1X2. The pc1 array index in the two macros should be 0 and 1.	2020-12-05 15:08:17 +08:00

... 3 4 5 6 7 ...

2023 Commits