OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Wangyang Guo	4c294336e6	sbgemm: cooperlake: add dummy source files	2021-09-07 21:30:45 +08:00
Martin Kroeker	f1e3305974	Add workaround for Windows10 macro name clash	2021-09-01 21:36:50 +02:00
Wangyang Guo	619588fbab	sbgemm: remove unnecessary b0 files	2021-08-30 17:55:01 +08:00
Wangyang Guo	f39301935c	sbgemm: cooperlake: make sure hot buffer aligned to 64	2021-08-30 17:40:30 +08:00
Wangyang Guo	7d27b182fc	sbgemm: cooperlake: enable SBGEMM by small matrix path	2021-08-30 17:40:30 +08:00
Wangyang Guo	1d83ca4bca	Small Matrix: support BFLOAT16 data type	2021-08-30 17:40:20 +08:00
Martin Kroeker	bec9d9f63d	Merge pull request #3335 from guowangy/small-matrix-latest Add GEMM optimization for small matrix and single/double kernel for skylakex	2021-08-29 22:33:33 +02:00
Wangyang Guo	dbbb39199f	sgemv: skylakex: fix build warning	2021-08-25 07:13:00 +00:00
Wangyang Guo	e9acb46431	sgemv: skylakex: bug fix for sgemv_t kernel in corner case	2021-08-25 07:07:27 +00:00
Wangyang Guo	f9dba63c28	Small Matrix: skylakex: remove unnecessary b0 source files	2021-08-13 03:28:44 +00:00
Wangyang Guo	989e6bbdd3	Small Matrix: reduce generic kernel source files	2021-08-13 03:17:38 +00:00
Martin Kroeker	04255be948	Merge pull request #3344 from gxw-loongson/develop Delete the macro instruction "li" and use "li.d" instead	2021-08-12 15:16:46 +02:00
gxw	a7bc8ec1f1	Delete the macro instruction "li" and use "li.d" instead Change-Id: Icff7981e2eb7df29ba5af1f8eb5be8443c67450f	2021-08-12 17:02:54 +08:00
Rajalakshmi Srinivasaraghavan	b06880c2cd	POWER10: Improving dasum performance Unrolling a loop in dasum micro code to help in improving POWER10 performance.	2021-08-10 22:06:04 -05:00
Wangyang Guo	44d0032f3b	Small Matrix: skylakex: fix build error in old compiler	2021-08-05 04:43:47 +00:00
Chen, Guobing	5d86becdae	Add all SBGEMM kernels for IA AVX512-BF16 based platforms Added all SBGEMM kernels including NN/NT/TN/TT for both ColMajor and RowMajor, based on AVX512-BF16 ISA set on IA. Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2021-08-05 11:11:29 +08:00
Wangyang Guo	fee5abd84b	Small Matrix: support cmake build	2021-08-04 08:50:15 +00:00
Wangyang Guo	478d1086c1	Small Matrix: support DYNAMIC_ARCH build	2021-08-04 03:12:41 +00:00
Wangyang Guo	6b58bca18b	Small Matrix: disable low performance default kernel	2021-08-03 06:49:03 +00:00
Wangyang Guo	fa777f5517	Small Matrix: skylakex: add DGEMM_SMALL_M_PERMIT and tune for TN kernel	2021-08-02 07:06:54 +00:00
Wangyang Guo	8592c21af4	Small Matrix: skylakex: dgemm nn: fix typo in idx load	2021-08-02 07:06:54 +00:00
Wangyang Guo	3e79f6d89a	Small Matrix: skylakex: add dgemm tn kernel	2021-08-02 07:06:54 +00:00
Wangyang Guo	323d7da4f7	Small Matrix: skylakex: add dgemm tt kernel	2021-08-02 07:06:54 +00:00
Wangyang Guo	f57fc932ac	Small Matrix: skylakex: add dgemm nt kernel	2021-08-02 07:06:54 +00:00
Wangyang Guo	91ec21202b	Small Matrix: skylakex: add dgemm nn kernel	2021-08-02 07:06:54 +00:00
Wangyang Guo	72e070539c	Small Matrix: skylakex: add sgemm tt kernel	2021-08-02 07:06:54 +00:00
Wangyang Guo	02c6e764f2	Small Matrix: skylakex: add SGEMM_SMALL_M_PERMIT and tune for TN kernel	2021-08-02 07:06:54 +00:00
Wangyang Guo	5dc7c3c8e5	Small Matrix: add GEMM_SMALL_MATRIX_PERMIT to tune small matrics case	2021-08-02 07:06:54 +00:00
Wangyang Guo	642c393879	Small Matrix: skylakex: add sgemm tn kernel	2021-08-02 07:06:54 +00:00
Wangyang Guo	ae3f5c737c	Small Matrix: skylakex: sgemm nt: optimize for M < 12	2021-08-02 07:06:54 +00:00
Wangyang Guo	0d72d75bf9	Small Matrix: skylakex: add sgemm nt kernel	2021-08-02 07:06:54 +00:00
Wangyang Guo	ca7682e3a3	Small Matrix: skylakex: sgemm nn: fix n6 conflicts with n4	2021-08-02 07:06:54 +00:00
Wangyang Guo	9967e61abb	Small Matrix: skylakex: sgemm nn: fix error when beta not zero	2021-08-02 07:06:54 +00:00
Wangyang Guo	a87736346f	Small Matrix: skylakex: sgemm nn: add n6 to improve performance	2021-08-02 07:06:54 +00:00
Wangyang Guo	4c9d9940fd	Small Matrix: skylakex: sgemm nn: reduce store 4 N at a time	2021-08-02 07:06:54 +00:00
Wangyang Guo	13b32f69b7	Small Matrix: skylakex: sgemm nn: reduce store 4 M at a time	2021-08-02 07:06:54 +00:00
Wangyang Guo	3d8c6d9607	Small Matrix: skylakex: sgemm nn: clean up unused code	2021-08-02 07:06:54 +00:00
Wangyang Guo	49b61a3f30	Small Matrix: skylakex: sgemm_nn: optimize for M <= 8	2021-08-02 07:06:54 +00:00
Wangyang Guo	f88470323b	Optimize M < 16 using AVX512 mask	2021-08-02 07:06:54 +00:00
Wangyang Guo	9186456a12	small matrix: SkylakeX: add SGEMM NN kernel	2021-08-02 07:06:54 +00:00
Xianyi Zhang	6022e5629c	Refs #2587 fix small matrix c/zgemm bug.	2021-08-02 07:06:54 +00:00
Xianyi Zhang	57ed58cefe	Refs #2587 Add small matrix optimization reference kernel for c/zgemm.	2021-08-02 07:06:54 +00:00
Xianyi Zhang	17d32a4a82	Change a1b0 gemm to b0 gemm.	2021-08-02 07:06:54 +00:00
Xianyi Zhang	59cb5de46b	Refs #2587 Fix typos.	2021-08-02 07:06:54 +00:00
Xianyi Zhang	be3349405d	Add alpha=1.0 beta=0.0 for small gemm.	2021-08-02 07:01:47 +00:00
Xianyi Zhang	0a2077901c	Add small marix optimization kernel interface. make SMALL_MATRIX_OPT=1	2021-08-02 07:01:47 +00:00
gxw	0b8f7c8c10	Add cmake support for LOONGARCH64	2021-08-02 10:00:41 +08:00
gxw	af0a69f355	Add support for LOONGARCH64	2021-07-27 15:29:12 +08:00
Martin Kroeker	49bbf330ca	Empirical workaround for numpy SVD NaN problem from issue 3318	2021-07-18 22:19:19 +02:00
Martin Kroeker	5b4b385ecf	Temporarily disable the SkylakeX sgemv_t microkernel due to LAPACK testsuite failures	2021-07-14 20:50:14 +02:00
User User-User	39ef0880ae	copy conf	2021-06-19 21:49:58 +02:00
Martin Kroeker	c4b464cac6	Merge pull request #3273 from austinpagan/sbgemm_gcc10_fix Power10: Fix for SBGEMM	2021-06-15 22:58:48 +02:00
Gordon Fossum	e6dd44d989	Power10: Fix for SBGEMM While testing bfloat16 sbgemm kernel, there are some failures for odd value inputs due to updating result for additional bytes.	2021-06-15 13:07:47 -05:00
Gilles Gouaillardet	9d292d37b2	arm64: add the missing d9 register to the clobber list Refs. numpy/numpy#18422 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2021-06-14 17:01:28 +09:00
Martin Kroeker	2e8ff4a781	Merge pull request #3266 from martin-frbg/powerparam Remove spurious casts from PPC parameters and fix compilation for older targets	2021-06-10 18:05:47 +02:00
Martin Kroeker	dbba381dc3	Merge pull request #3260 from intelmy/sgemv_t_opt Optimized sgemv_t for small N based on AVX512	2021-06-10 16:08:24 +02:00
Martin Kroeker	efdbdd8f82	Add prefetch values for power3	2021-06-10 11:20:29 +02:00
Martin Kroeker	3906ef3b0f	Add prefetch values for power3	2021-06-10 11:19:40 +02:00
Martin Kroeker	8adf0971d8	Add prefetch values for power3	2021-06-10 11:18:22 +02:00
Martin Kroeker	08e2e60762	Add prefetch values for power3	2021-06-10 11:17:33 +02:00
Martin Kroeker	fb9e678235	Fix caxpy/zaxpy for big-endian	2021-06-10 11:15:48 +02:00
Martin Kroeker	dc4fcb48df	Fix inverted conditional for caxpy/zaxpy	2021-06-10 11:14:03 +02:00
Martin Kroeker	7a48247761	fix c/zrot and sgemv for POWER5	2021-06-10 11:11:56 +02:00
Rajalakshmi Srinivasaraghavan	cbb70438df	POWER10: Fixes for sbgemm kernel While testing bfloat16 sbgemm kernel, there are some failures for odd value inputs due to array access beyond the boundary.	2021-06-09 12:20:09 -05:00
Ma, Yu	706a08d4a0	Optimized sgemv_t for small N based on AVX512	2021-06-08 15:08:28 -04:00
Zhaofeng Li	590be3fae3	riscv64: Add Makefile	2021-06-07 22:55:56 +00:00
Zhaofeng Li	3521cd48cb	RISCV64_GENERIC: Use generic kernel for DSDOT for better precision The implementation in `riscv64/dot.c` fails the `test_dsdot` test, and the generic kernel seems to have better precision. Tested on SiFive FU740 (HiFive Unmatched) and QEMU. Also see #1469.	2021-06-07 22:50:23 +00:00
Zhaofeng Li	1e0192a5cc	riscv64/imin: Fix wrong comparison Same as #1990.	2021-06-07 22:49:39 +00:00
Martin Kroeker	5f677e782e	Merge pull request #3196 from guowangy/skylakex-gemm-batch-k GEMM: skylake: improve the performance when m is small	2021-05-22 19:25:28 +02:00
Martin Kroeker	02087a62e7	Merge pull request #3205 from intelmy/sgemv_n_opt optimize on sgemv_n for small n	2021-05-17 17:49:01 +02:00
Martin Kroeker	4ecf631f95	Merge pull request #3228 from martin-frbg/issue3226 filter out -mavx flag on Sandybridge zgemm/ztrmm kernels	2021-05-15 09:06:12 +02:00
Martin Kroeker	310b76aad7	Merge pull request #3231 from martin-frbg/issue3227 Support compilation with pre-C99 versions of MSVC	2021-05-14 23:28:06 +02:00
Martin Kroeker	c4da892ba0	Only filter out -mavx on Sandybridge ZGEMM/ZTRMM kernels	2021-05-14 23:19:10 +02:00
Martin Kroeker	8b90e5f202	Drop redundant inclusion of complex.h	2021-05-14 15:06:44 +02:00
Martin Kroeker	bd60fb6ffc	filter out -mavx flag on zgemm kernels as it can cause problems with older gcc	2021-05-13 23:05:00 +02:00
Martin Kroeker	37ea8702ee	Merge pull request #3192 from damonyu1989/develop Update the intrinsic api to the offical name.	2021-05-11 16:00:45 +02:00
Martin Kroeker	c0ca63ea46	Fix missing conditionals for non-SKX kernels	2021-05-05 14:55:36 +02:00
pnp	3d4ccd2a13	fix for build error	2021-04-30 12:25:33 -04:00
pnp	c59652f0ce	optimize on sgemv_n for small n	2021-04-30 12:14:58 -04:00
Wangyang Guo	aa7b3dc3db	GEMM: skylake: improve the performance when m is small	2021-04-28 13:56:06 +00:00
damonyu	ceb44bef14	update the intrinsic api to the offical name.	2021-04-27 11:12:29 +08:00
Martin Kroeker	3d511f0e66	replace spurious avx512 requirement with fma check	2021-04-26 21:55:30 +02:00
Rajalakshmi Srinivasaraghavan	2379abaa5e	POWER10: Improve dgemm performance This patch uses vector pair pointer for input load operation which helps to generate power10 lxvp instructions.	2021-04-13 22:30:06 -05:00
Rajalakshmi Srinivasaraghavan	55bb9f639a	POWER10: Optimized zgemv This patch makes use of Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1 for zgemv_n and zgemv_t.	2021-04-10 19:00:24 -05:00
Martin Kroeker	2dfb24730d	Use "old" compute(24) function with clang due to register limitations	2021-04-06 19:58:32 +02:00
Martin Kroeker	147e0a75fd	Merge pull request #3170 from CodesWithWolves/sgemm_tcopy_16-invalid-read Remove Unnecessary/Erroneous Adds/Reads In sgemm_tcopy_16.S COPY1x8 Macro	2021-04-03 19:49:47 +02:00
Rajalakshmi Srinivasaraghavan	2dbcddd83d	POWER10: Adding check for little endian This patch makes sure that recent POWER10 patches are used only for little endian.	2021-03-31 21:32:42 -05:00
CodesWithWolves	d2bda3b56a	Remove Unnecessary/Erroneous Reads In sgemm_tcopy_16.S COPY1x8 Macro There appears to have been some code leak when copying from the COPY2x8 macro above where we're reading 8 bytes into d4-d7 directly after reading 4 bytes into s4-s7. These 32 bytes in d4-7 are unused and can possibly overrun the boundary of allocated memory -- Valgrind detected this which is what dragged my attention to it for a 128,1 copy. Additionally, there is no need to update the addresses stored in A0-A7 as the only possible paths after running this macro will overwrite A0-7 if looping to the next 8 rows, or overwrite A0-3 if moving to 4 rows -- in which case A4-7 are unused.	2021-03-31 15:44:25 -04:00
Martin Kroeker	bdd6e3a153	Merge pull request #3157 from martin-frbg/issue3020-final Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler on PPC	2021-03-19 15:23:12 +01:00
Martin Kroeker	7b8f580941	Merge pull request #3156 from martin-frbg/omatcopy_d Move x86_64 DOMATCOPY_RT back to the C implementation	2021-03-19 15:22:48 +01:00
Martin Kroeker	86c5a0013f	Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler	2021-03-19 11:47:58 +01:00
Martin Kroeker	ef85c22474	Add workaround for LAPACK test failures with the NVIDIA HPC compiler	2021-03-19 11:46:25 +01:00
Martin Kroeker	d3555d2e50	Add workaround for LAPACK test failures with the NVIDIA HPC compiler	2021-03-19 11:44:31 +01:00
Martin Kroeker	0f5e86a0d9	Remove premature entry for DOMATCOPY_RT	2021-03-18 21:53:50 +01:00
Martin Kroeker	7b294a99fd	Move common.h back to the top of the file so that SKYLAKEX (from config.h) is defined in time	2021-03-18 21:28:19 +01:00
Martin Kroeker	0934568d9c	Move includes under the ifdef for compilers w/o intrinsics support	2021-03-12 12:42:05 +01:00
Rajalakshmi Srinivasaraghavan	09d47af2c0	Optimize zscal function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-03-10 17:15:33 -06:00
Martin Kroeker	ef0238ba2b	Merge pull request #3130 from martin-frbg/issue3128 Replace spurious AVX512 requirement in the Haswell srot microkernel with an AVX2/FMA3 guard	2021-03-06 19:15:53 +01:00
Martin Kroeker	a9f6f7ad39	Remove spurious AVX512 requirement and add AVX2/FMA3 guard	2021-03-06 14:35:49 +01:00
Rajalakshmi Srinivasaraghavan	41646ed006	Optimize s/dasum function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-03-05 16:22:36 -06:00
Rajalakshmi Srinivasaraghavan	0571c3187b	POWER10: Rename mma builtins The LLVM and GCC teams agreed to rename the __builtin_mma_assemble_pair and __builtin_mma_disassemble_pair built-ins to __builtin_vsx_assemble_pair and __builtin_vsx_disassemble_pair respectively. This patch is to make corresponding changes in dgemm kernel. Also made changes in inputs to those builtins to avoid some potential typecasting issues. Reference gcc commit id:77ef995c1fbcab76a2a69b9f4700bcfd005d8e62	2021-02-26 20:56:34 -06:00
Martin Kroeker	292d1af1a0	Update omatcopy_rt.c	2021-02-24 09:34:14 +01:00
Martin Kroeker	325b398e3c	Update omatcopy_rt.c	2021-02-24 09:13:12 +01:00
Martin Kroeker	6f5667b4d4	Enable optimized S/D OMATCOPY_RT	2021-02-24 09:03:41 +01:00
Martin Kroeker	cceeee7806	Add optimized omatcopy_rt	2021-02-24 09:00:54 +01:00
Martin Kroeker	0a4546b742	Typo fix	2021-02-23 13:14:35 +01:00
Martin Kroeker	b1eed27a54	Replace naive omatcopy_rt with 4x4 blocked implementation as suggested by MigMuc in issue 2532	2021-02-22 21:35:42 +01:00
Martin Kroeker	47691c031f	Use Haswell optimizations for Zen as well	2021-02-11 09:26:15 +01:00
Martin Kroeker	ce7ddd8921	Use Haswell optimizations for Zen as well	2021-02-11 09:25:36 +01:00
Martin Kroeker	950c047b49	Use Haswell optimizations for Zen as well	2021-02-11 09:24:51 +01:00
Martin Kroeker	46509953a9	Use Haswell optimizations for Zen as well	2021-02-11 09:24:16 +01:00
Martin Kroeker	db348dcff2	Enable optimized srot/drot kernels from Haswell	2021-02-11 09:23:05 +01:00
Rajalakshmi Srinivasaraghavan	2056ffc227	Optimize cscal function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-29 13:51:43 -06:00
Rajalakshmi Srinivasaraghavan	3ede843d50	Optimize s/dscal function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-24 07:48:28 -06:00
Martin Kroeker	69a5558203	Merge pull request #3059 from Guobing-Chen/BF16_gemm Initial code for Cooperlake BF16 GEMM kernel	2021-01-23 19:08:05 +01:00
Martin Kroeker	d6905403e3	Merge pull request #3068 from alexhenrie/scan-build scan-build fixes	2021-01-23 19:06:29 +01:00
Rajalakshmi Srinivasaraghavan	439b93f6d2	Optimize s/drot function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-21 13:24:45 -06:00
Rajalakshmi Srinivasaraghavan	eff7c9166e	Optimize cdot function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-15 13:40:34 -06:00
Alex Henrie	202fc9e8ed	Fix uninitialized argument value in dasum_k	2021-01-14 19:40:31 -07:00
Martin Kroeker	e378b24487	Merge pull request #3067 from albertziegenhagel/fix-generic-cmake Fix building "generic" TRMM kernel with CMake	2021-01-14 21:35:19 +01:00
Albert Ziegenhagel	e3f4063683	Fix building "generic" TRMM kernel with CMake The CMake "TARGET_CORE" variables stores the "generic" target name in all lowercase letters, but gets compared to an all uppercase string, which results in the wrong TRMM kernel being selected. This commit converts the TARGET_CORE to all uppercase before comparing its value to make sure case mismatches are not an issue in the future anymore.	2021-01-14 10:00:49 +01:00
Martin Kroeker	b716c0ef01	Add workaround for NVIDIA HPC	2021-01-12 16:51:35 +01:00
Martin Kroeker	2efa3b70dc	Add workaround for NVIDIA HPC	2021-01-12 16:49:39 +01:00
Martin Kroeker	49959d4f1c	Add workaround for NVIDIA HPC	2021-01-12 16:47:15 +01:00
Martin Kroeker	0f27a03607	Add workaround for NVIDIA HPC mishandling of the asm DOT kernels	2021-01-12 16:39:35 +01:00
Martin Kroeker	c2a8ebfe69	Add workaround for NVIDIA HPC mishandling of the asm DOT kernels	2021-01-12 16:38:51 +01:00
Martin Kroeker	43aac5bacc	Support NVIDIA HPC compiler	2021-01-12 16:36:12 +01:00
Chen, Guobing	b0beb0b1ca	Initial code for Cooperlake BF16 GEMM kernel	2021-01-11 02:15:21 +08:00
Rajalakshmi Srinivasaraghavan	601b711c78	Optimize swap function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2021-01-08 08:01:36 -06:00
Ashwin Sekhar T K	1b2508362b	arm64: Fix nrm2 for input vectors with Inf Fix double precision nrm2 kernels returning NaN when the input vectors contain Inf/-Inf.	2021-01-01 02:49:37 -08:00
Martin Kroeker	3559c5d7a2	Merge pull request #3048 from martin-frbg/issue2998 Temporarily revert to the old NRM2 kernels for ThunderX2/3 and NeoverseN1	2020-12-21 13:30:08 +01:00
Martin Kroeker	8631e2976a	Temporarily revert to the old nrm2 kernels	2020-12-21 07:45:13 +01:00
Martin Kroeker	2768bc1764	Temporarily revert to the old nrm2 kernels	2020-12-21 07:42:51 +01:00
Martin Kroeker	6f4698ee1f	Temporarily revert to the old nrm2 kernel	2020-12-21 07:41:18 +01:00
Martin Kroeker	114eb159a4	Disable FMA intrinsics in the srot kernel when the compiler is PGI/NVIDIA	2020-12-19 22:15:58 +01:00
Martin Kroeker	005cce5507	Amend SkylakeX options to support the NVIDIA compiler	2020-12-19 22:11:49 +01:00
Xianyi Zhang	a3cac9cca0	Update sgemm kernel 1x4 for C910.	2020-12-18 11:53:23 +08:00
Martin Kroeker	c73d8ee40d	Conditionally add -mfma to compiler options where needed	2020-12-17 11:34:05 +01:00
Rajalakshmi Srinivasaraghavan	2fb11f873b	POWER10: Improve copy performance This patch aligns the stores to 32 byte boundary for scopy and dcopy before entering into vector pair loop. For ccopy, changed the store instructions to stxv to improve performance of unaligned cases.	2020-12-13 10:41:45 -06:00
Martin Kroeker	043128cbe5	Merge pull request #3029 from RajalakshmiSR/axpyp10 POWER10: Improve axpy performance	2020-12-10 22:49:28 +01:00
Martin Kroeker	3331ca492d	Merge pull request #3021 from austinpagan/trsm_p10 POWER: Added special unrolled vectorized versions of "Solve" for specific si…	2020-12-10 19:42:54 +01:00
Rajalakshmi Srinivasaraghavan	346e30a46a	POWER10: Improve axpy performance This patch aligns the stores to 32 byte boundary for saxpy and daxpy before entering into vector pair loop. Fox caxpy, changed the store instructions to stxv to improve performance of unaligned cases.	2020-12-10 11:51:42 -06:00
gxw	4b548857d6	Add msa support for loongson 1. Using core loongson3r3 and loongson3r4 for loongson 2. Add DYNAMIC_ARCH for loongson Change-Id: I1c6b54dbeca3a0cc31d1222af36a7e9bd6ab54c1	2020-12-09 10:28:46 +08:00
Martin Kroeker	7f11e33e8d	Merge pull request #3025 from TiredNotTear/develop MIPS: Fix two bugs	2020-12-08 09:39:27 +01:00
Martin Kroeker	53e0837809	Merge pull request #3022 from jinboson/develop Fix test errors reported by cblas_cgemm & cblas_ctrmm	2020-12-07 08:09:11 +01:00
Hao Chen	ad38bd0e89	Fix failed cgemv and zgemv test case after using msa optimization The cgemv and zgemv test case will call cgemv_n/t_msa.c zgemv_n/t_msa.c files in MIPS environment. When the macro CONJ is defined, the calculation result will be wrong due to the wrong definition of OP2. This patch updates the value of OP2 and passes the corresponding test.	2020-12-07 10:25:01 +08:00
Hao Chen	47b639cc9b	Fix failed sswap and dswap case by using msa optimization The swap test case will call sswap_msa.c and dswap_msa.c files in MIPS environmnet. When inc_x or inc_y is equal to zero, the calculation result of the two functions will be wrong. This patch adds the processing of inc_x or inc_y equal to zero, and the swap test case has passed.	2020-12-07 10:24:49 +08:00
Martin Kroeker	b660008c7e	Work around DOT and SWAP test failures	2020-12-06 19:15:37 +01:00
Martin Kroeker	f8346603cf	Fix compilation with SolarisStudio	2020-12-06 19:14:16 +01:00
Jin Bo	65de6f5957	Fix test errors reported by cblas_cgemm & cblas_ctrmm The file cgemm_kernel_8x4_msa.c holds the MSA optimization codes of cblas_cgemm and cblas_ctrmm. It defines two macros: CGEMM_SCALE_1X2 and CGEMM_TRMM_SCALE_1X2. The pc1 array index in the two macros should be 0 and 1.	2020-12-05 15:08:17 +08:00
Gordon Fossum	213c0e7abb	Added special unrolled vectorized versions of "Solve" for specific sizes, in DTRSM and STRSM, to improve performance in Power9 and Power10.	2020-12-04 17:07:06 -06:00
Martin Kroeker	441c08c9ff	Merge pull request #3016 from xiegengxin/complex-asum Improve the performance of zasum and casum with AVX512 intrinsic	2020-12-04 22:07:16 +01:00
Gengxin Xie	0cb7a403b2	fix error declare function blas_level1_thread_with_return_value	2020-12-02 09:51:52 +08:00
Gengxin Xie	b766c1e9bb	Improve the performance of zasum and casum with AVX512 intrinsic	2020-12-01 16:49:26 +08:00
Rajalakshmi Srinivasaraghavan	7d46e31de1	POWER10: Optimize dgemv_n Handling as 4x8 with vector pairs gives better performance than existing code in POWER10.	2020-11-29 15:28:28 -06:00
Martin Kroeker	f1bf040b25	Merge pull request #2988 from xiegengxin/smp-asum Improve the performance of dasum and sasum when SMP is defined	2020-11-22 12:24:13 +01:00
Xianyi Zhang	7037849498	Merge branch 'develop' into risc-v	2020-11-22 16:04:50 +08:00
Martin Kroeker	7e9cb39a25	Merge pull request #2981 from Qiyu8/fix-sum Fix sum optimize issues	2020-11-16 08:40:46 +01:00
Gengxin Xie	d6e7e05bb3	Improve the performance of dasum and sasum when SMP is defined	2020-11-13 14:20:52 +08:00
Qiyu8	ae0b1dea19	modify system.cmake to enable fma flag	2020-11-13 10:20:24 +08:00
Qiyu8	e0dac6b53b	fix the CI failure of target specific option mismatch	2020-11-12 20:31:03 +08:00
Qiyu8	e5c2ceb675	fix the CI failure of lack the head	2020-11-12 17:35:17 +08:00
Qiyu8	a87e537b8c	modify macro	2020-11-11 15:53:48 +08:00
Qiyu8	5bc0a7583f	only FMA3 and vector larger than 128 have positive effects.	2020-11-11 15:18:01 +08:00
Qiyu8	8c0b206d4c	Optimize the performance of rot by using universal intrinsics	2020-11-11 14:33:12 +08:00
Qiyu8	c4c591ac5a	fix sum optimize issues	2020-11-10 16:16:38 +08:00
Xianyi Zhang	fc35b72ae1	Refs #2899 Merge branch 'openblas-open-910' of git://github.com/damonyu1989/OpenBLAS into damonyu1989-openblas-open-910	2020-11-10 09:38:04 +08:00
Xianyi Zhang	913cc9a4ca	Merge branch 'develop' into risc-v	2020-11-10 09:18:25 +08:00
Martin Kroeker	ff16329cb7	Merge pull request #2972 from xiegengxin/rot-intrinsic Improve the performance of rot by using AVX512 and AVX2 intrinsic	2020-11-08 22:43:00 +01:00
Martin Kroeker	110c7a6de0	Merge pull request #2979 from RajalakshmiSR/dot_power10 Optimize sdot/ddot for POWER10	2020-11-08 10:19:34 +01:00
Rajalakshmi Srinivasaraghavan	6e364981a8	Optimize sdot/ddot for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2020-11-07 15:21:58 -06:00
Martin Kroeker	b976a0bf40	Remove previous workaround for compiler flags related to cpu capabilities in x86_64 DYNAMIC_ARCH builds	2020-11-07 20:39:56 +01:00
Martin Kroeker	ff74319ea5	Merge pull request #2977 from martin-frbg/issue2976 Fix macro name used in ifdef for POWERPC/PGI	2020-11-07 14:41:34 +01:00
Martin Kroeker	28d2dfe2b3	Fix macro name used in ifdef	2020-11-07 12:17:49 +01:00
Gengxin Xie	725ffbf041	fix typo	2020-11-05 16:25:17 +08:00
Gengxin Xie	d9ba49165a	Improve the performance of rot by using AVX512 and AVX2 intrinsic	2020-11-05 15:12:36 +08:00
Rajalakshmi Srinivasaraghavan	dd7a9cc5bf	POWER10: Change dgemm unroll factors Changing the unroll factors for dgemm to 8 shows improved performance with POWER10 MMA feature. Also made some minor changes in sgemm for edge cases.	2020-10-31 18:28:57 -05:00
Rajalakshmi Srinivasaraghavan	b435491885	Optimize caxpy for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2020-10-29 14:57:51 -05:00
Chen, Guobing	a7b1f9b1bb	Implementation of BF16 based gemv 1. Add a new API -- sbgemv to support bfloat16 based gemv 2. Implement a generic kernel for sbgemv 3. Implement an avx512-bf16 based kernel for sbgemv Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2020-10-29 02:08:23 +08:00
Martin Kroeker	67f39ad813	Merge pull request #2939 from thrasibule/Makefile_cleanup reuse variables defined in Makefile.system	2020-10-28 09:38:40 +01:00
Rajalakshmi Srinivasaraghavan	c24ba8b1dd	Optimize saxpy for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2020-10-26 13:24:59 -05:00
Martin Kroeker	6f9460f0f6	Merge pull request #2937 from martin-frbg/pwr-buffersz Increase and unify BUFFERSIZE on POWER;fix gcc inline warning	2020-10-23 07:15:32 +02:00
Guillaume Horel	1917a4e7b8	reuse variables defined in Makefile.system	2020-10-22 22:04:25 -04:00
Martin Kroeker	34c3c407ef	label always_inline function as inline to silence a gcc warning	2020-10-22 22:14:26 +02:00
Martin Kroeker	2e48d560ba	Fix compiler version check	2020-10-22 16:23:29 +02:00
Rajalakshmi Srinivasaraghavan	ad745c0bae	Optimize scopy/ccopy for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores. Also reorganized all variants of copy functions to make use of same kernel.	2020-10-21 09:53:45 -05:00
İsmail Dönmez	4a1d00f589	Fix build with -Werror=return-type dgemm_tcopy_16_skylakex.c CNAME function should return an int, add a return 0 similar to other files.	2020-10-21 08:43:39 +02:00
Bart Oldeman	b073d759d0	x86_64: clobber all xmm registers after vzeroupper As observed using GCC 10 using -march=native -ftree-vectorize on Knights Landing, it is now smart enough to find clobbers inside non-inlined static functions. In particular, sgemv counted on a kernel to preserve the whole %ymm2 register (since it was not in the clobber list), but the top part was destroyed by vzeroupper. This caused many tests to fail. This patch makes sure all xmm (and ymm/zmm by extension) registers are listed as clobbered to avoid this happening, as most kernels already did correctly in fact.	2020-10-20 02:16:47 +00:00
Martin Kroeker	dc6e44c3f8	Merge pull request #2916 from martin-frbg/issue2911 Clean up duplicate definitions in POWER8 kernels and fix power10 option passing	2020-10-19 23:33:31 +02:00
Martin Kroeker	a61c086408	Fix spurious trailing whitespace in comment	2020-10-19 09:12:12 +02:00
Bart Oldeman	03e781b766	sgemm_direct_skylakex: fix `75eeb26` regression. The `#if defined(SKYLAKEX) \|\| defined (COOPERLAKE)` from that commit was before #include "common.h" so caused the compiled function to be empty, returning garbage results for qualifying sgemm's on those architectures. Closes #2914	2020-10-18 19:58:07 +00:00
Martin Kroeker	f1a4071d8c	Clean up STACKSIZE redefinition	2020-10-18 19:41:43 +02:00
Martin Kroeker	97cf10062f	Clean up STACKSIZE redefinition	2020-10-18 19:39:18 +02:00
Martin Kroeker	17e288e18d	Clean up STACKSIZE redefinition	2020-10-18 19:37:04 +02:00
Martin Kroeker	c1422f3e46	Clean up STACKSIZE redefinition	2020-10-18 19:31:01 +02:00
Martin Kroeker	d85b24e103	Clean up STACKSIZE redefinition	2020-10-18 19:29:45 +02:00
Zhang Xianyi	d7ba7679b6	Merge branch 'develop' into risc-v	2020-10-16 23:27:38 +08:00
Martin Kroeker	df70667043	fix core list for sse/sse2	2020-10-16 09:55:48 +02:00
Martin Kroeker	f071d1207a	add sse2	2020-10-15 22:10:32 +02:00
Martin Kroeker	dc6cefd2f5	Expressly enable -msse for 32bit DYNAMIC_ARCH kernels	2020-10-15 20:16:15 +02:00
Martin Kroeker	c339c40c01	Silence a redefinition warning	2020-10-15 19:08:12 +02:00
Martin Kroeker	10379fc83b	Use ifdef instead of if	2020-10-15 19:05:37 +02:00
Martin Kroeker	4c25910da0	Merge pull request #2896 from martin-frbg/intrin-double Add compiler flag for SSE4 where available	2020-10-15 11:12:35 +02:00
damonyu	ef8e7d0279	Add the support for RISC-V Vector. Change-Id: Iae7800a32f5af3903c330882cdf6f292d885f266	2020-10-15 16:09:02 +08:00
Martin Kroeker	ae6ac83991	Revert "add double precision SSE"	2020-10-15 08:37:02 +02:00
Qiyu8	4fac91ef37	adapt arm platform	2020-10-15 11:08:10 +08:00
Qiyu8	bfdf4b56da	Add double precision universal intrinsics for X86/ARM	2020-10-15 10:29:42 +08:00
Martin Kroeker	ebf0470fc2	add sse4.1 for DYNAMIC_ARCH kernels	2020-10-14 20:34:33 +02:00
Martin Kroeker	c9c3ae07af	Add double precision operations	2020-10-14 18:10:45 +02:00
Martin Kroeker	756802df61	Merge pull request #2890 from martin-frbg/s-d-sum Revert special handling of Windows xNRM2 and enable C+intrinsics kern…	2020-10-14 09:02:03 +02:00
Rajalakshmi Srinivasaraghavan	0826d68f93	POWER10: Change the packing format for bfloat16 As the new MMA instructions need the inputs in 4x2 order for bfloat16, changing the format in copy/packing code. This avoids permute instructions in the gemm kernel inner loop.	2020-10-13 16:05:10 -05:00
Rajalakshmi Srinivasaraghavan	b5d30b390d	Fix build issues with bfloat16 This patch fixes compilation errors due to recent renaming from SH to SB with BUILD_BFLOAT16.	2020-10-13 11:00:22 -05:00
Martin Kroeker	fecedc9c69	Add -mssse3	2020-10-13 11:55:41 +02:00
Martin Kroeker	0eacbca85f	Add Haswell and Zen to temporary sse3 whitelist	2020-10-13 11:42:39 +02:00
Martin Kroeker	6999086a2b	whitelist SANDYBRIDGE for SSE3	2020-10-13 10:32:19 +02:00
Martin Kroeker	8d2df7d066	Revert special handling of Windows xNRM2 and enable C+intrinsics kernel for SSUM/DSUM	2020-10-13 00:14:29 +02:00
Martin Kroeker	08929430cd	Merge pull request #2886 from martin-frbg/issue_2767 Rename "HALF" precision functions (sh prefix) to "BFLOAT16" with "sb" prefix	2020-10-13 00:04:35 +02:00
Martin Kroeker	0c84ffe05f	Merge pull request #2881 from mattip/fninit add fninit to reset fpu registers before assembler routines	2020-10-12 23:50:41 +02:00
Matti Picus	403eb513a0	use emms instead, add WIN guards	2020-10-12 18:15:01 +03:00
Qiyu8	0ed1f07660	Optimize the performance of sum by using universal intrinsics	2020-10-12 19:48:53 +08:00
Martin Kroeker	3aecafad80	Change "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-12 00:00:55 +02:00
Martin Kroeker	756062afa5	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:56:17 +02:00
Martin Kroeker	2061f7fdff	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:54:53 +02:00
Martin Kroeker	dc8a1afa63	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:53:50 +02:00
Martin Kroeker	fd94236042	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:42:07 +02:00
Martin Kroeker	68ce719fac	Rename shdot_microk_cooperlake.c to sbdot_microk_cooperlake.c	2020-10-11 23:41:13 +02:00
Martin Kroeker	d7dd9b396c	Rename shdot.c to sbdot.c	2020-10-11 23:40:43 +02:00
Martin Kroeker	9ae80490e0	rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:39:42 +02:00
Martin Kroeker	d314d1f49f	Rename shgemm_kernel_power10.c to sbgemm_kernel_power10.c	2020-10-11 23:37:38 +02:00
Martin Kroeker	c589c3e2a1	Merge pull request #2882 from martin-frbg/issue2709 Use generic C for (D/Z)NRM2 on Windows x86_64	2020-10-11 22:22:30 +02:00
Martin Kroeker	ec638a82bf	Merge pull request #2852 from martin-frbg/issue2588-cmake Support building only a subset of variable types	2020-10-11 22:21:33 +02:00
Martin Kroeker	6b6adf8a4a	Allow compiling only a subset of kernels for specific variable types	2020-10-11 14:52:09 +02:00
Martin Kroeker	ac653c94f3	Merge branch 'develop' into issue2588-cmake	2020-10-11 13:57:07 +02:00
Martin Kroeker	7a53128481	Add whitelist of DYNAMIC_ARCH kernels for which -msse3 needs to be enabled	2020-10-11 01:06:46 +02:00
Martin Kroeker	e1b7123bbe	Merge pull request #2867 from Qiyu8/usimd-floatdot Optimize the performance of dot by using universal intrinsics in X86/ARM	2020-10-10 12:10:25 +02:00
Qiyu8	f32d34a015	add sse3 compiler flag	2020-10-10 10:36:15 +08:00
Martin Kroeker	7812486091	Use generic C for D/Z nrm2 kernels on Windows to work around fpu exception bug	2020-10-06 21:33:16 +02:00
Matti Picus	a5b164946c	add fninit to reset fpu registers before assembler routines	2020-10-05 22:13:25 +03:00
User User-User	d2333e7842	aarch64 fix std=c18 compilation	2020-10-03 18:00:34 +03:00
Qiyu8	60e6c68e38	Adapt ARM architect	2020-09-29 16:36:14 +08:00
Qiyu8	1b1a757f5f	Optimize the performance of dot by using universal intrinsics in X86/ARM	2020-09-28 20:36:53 +08:00
Rajalakshmi Srinivasaraghavan	2df4235e00	Optimize dcopy/zcopy for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores. Tested in simulator and no new failures.	2020-09-27 21:42:32 -05:00
Martin Kroeker	dfbc62ef7e	Support building only a subset of types	2020-09-22 23:25:59 +02:00
Qiyu8	14f7dad3b7	performance improved	2020-09-22 16:52:15 +08:00
Qiyu8	325b539c26	Optimize the performance of daxpy by using universal intrinsics	2020-09-22 10:38:35 +08:00
Marius Hillenbrand	22aa81f3e5	s390x: fix cscal and zscal implementations The implementation of complex scalar * vector multiplication for Z14 makes some LAPACK tests fail because the numerical differences to the reference implementation exceed the threshold (as can be seen by running make lapack-test and replacing kernel/zarch/cscal.c with a generic implementation for comparison). The complex multiplication uses terms of the form a * b + c * d for both real and imaginary parts. The assembly code (and compiler-emitted code as well) uses fused multiply add operations for the second product and sum. The results can be "surprising", for example when both terms in the imaginary part nearly cancel each other out. In that case, the second product contributes more digits to the sum than the first product that has been rounded before. One option is to use separate multiplications (which then round the same way) and a distinct add. Change the code to pursue that path, by (1) requesting the compiler not to contract the operations into FMAs and (2) replacing the assembly kernel with corresponding vectorized C code (where change 1 also applies). Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-21 13:10:05 +02:00
Marius Hillenbrand	f91057cbad	s390x: move common vector definitions and utils into header ... to facilitate reuse beyond gemm_vec.c and avoid code duplication. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-21 11:32:08 +02:00
Rajalakshmi Srinivasaraghavan	be43d2cb96	Optimize daxpy/zaxpy for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores. Tested in simulator and no new failures.	2020-09-17 12:56:28 -05:00
Martin Kroeker	91c84e1c01	Merge pull request #2796 from Guobing-Chen/BF16_dot_coversion_apis Add bfloat16 based dot and conversion with single/double	2020-09-14 15:00:19 +02:00
Martin Kroeker	e72430fe46	Merge pull request #2803 from xiegengxin/AVX2-asum Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic	2020-09-06 18:32:15 +02:00
Chen, Guobing	deaeb6c5b8	Add bfloat16 based dot and conversion with single/double 1. Added bfloat16 based dot as new API: shdot 2. Implemented generic kernel and cooperlake-specific (AVX512-BF16) kernel for shdot 3. Added 4 conversion APIs for bfloat16 data type <=> single/double: shstobf16 shdtobf16 sbf16tos dbf16tod shstobf16 -- convert single float array to bfloat16 array shdtobf16 -- convert double float array to bfloat16 array sbf16tos -- convert bfloat16 array to single float array dbf16tod -- convert bfloat16 array to double float array 4. Implemented generic kernels for all 4 conversion APIs, and cooperlake-specific kernel for shstobf16 and shdtobf16 5. Update level1 thread facilitate functions and macros to support multi-threading for these new APIs 6. Fix Cooperlake platform detection/specify issue when under dynamic-arch building 7. Change the typedef of bfloat16 from unsigned short to more strict uint16_t Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2020-09-04 02:31:25 +08:00
Martin Kroeker	775a87242d	Rename KERNEL.SILICON to KERNEL.VORTEX	2020-09-03 08:44:20 +02:00
Gengxin Xie	1b0f17eeed	align to 64, using SSE when input size is small	2020-09-03 14:25:54 +08:00
Martin Kroeker	80794fe8fd	Create KERNEL.SILICON	2020-09-02 22:56:58 +02:00
Marius Hillenbrand	2ee5b899ce	s390x: enable S/DGEMM block with explicit loop unrolling + interleaving with clang The code for SGEMM 16x4 and DGEMM 8x4 blocks on z14 and z15 uses explicit unrolling and interleaving to improve performance. The code employs an empty inline asm statement with operands that constrain the compiler's instruction scheduling and thereby enforce proper overlapping of load and compute phases. Fix an ifdef to apply that for clang builds, as well. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:31 +02:00
Marius Hillenbrand	87e5bbd887	s390x: avoid variable-length arrays in struct for asm operands ... since it is not required and clang does not support that gcc extension. Instead, use a variable-length array directly for these operands. Note that, while the actual inline assembly code does not directly use these memory operands, they serve to inform the compiler that it cannot reorder reads or writes to/from the input and output data across the inline asm statements. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:31 +02:00
Marius Hillenbrand	b9b3265ec8	s390x: avoid inline assembly for vector loads for clang ... since clang does not support the instruction format for inline assembly and also it is not required for current versions of clang. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:30 +02:00
Marius Hillenbrand	a1616a0b86	s390x: replace nop with "nop 0" in inline assembly ... as a bandaid for building with clang until LLVM's internal assembler supports nops without operand. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:30 +02:00
Marius Hillenbrand	60ef193258	s390x: use "lghi" for immediate values to fix build with clang Some of the kernels written in assembly utilize a "load address" instruction for loading an immediate value into a register. That is both unnecessarily complex and LLVM's assembler does not understand that specific syntax. Thus, replace with the appropriate "load immediate" instruction, which is also clearer to read. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:30 +02:00
Gengxin Xie	448152cdd8	define __AVX2__ to ensure the haswell code compiled with avx2	2020-08-31 14:39:08 +08:00
Gengxin Xie	cb3c190a3a	Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic	2020-08-31 11:44:08 +08:00
Rajalakshmi Srinivasaraghavan	317ff27cda	POWER10: Avoid setting accumulators to zero in gemm kernels For the first iteration, it is better to use xvfger instead of xvfgerpp builtins which helps to avoid setting accumulators to zero. This helps to reduce few instructions.	2020-08-28 10:42:54 -05:00
Martin Kroeker	b2053239fc	Fix mssing dummy parameter (imag part of alpha) of zdot_thread_function	2020-08-23 15:08:16 +02:00
Martin Kroeker	9ee21a0a39	Merge pull request #2780 from Guobing-Chen/CPL_build_support Enable COOPERLAKE build target	2020-08-20 19:54:29 +02:00
Martin Kroeker	6f4dc7445d	Fix typo	2020-08-19 16:36:55 +02:00
Martin Kroeker	81fbe8d088	-march=cooperlake only available in gcc >= 10	2020-08-19 16:10:15 +02:00
Martin Kroeker	75eeb265d7	[WIP] Refactor the driver code for direct SGEMM (#2782 ) Move "direct SGEMM" functionality out of the SkylakeX SGEMM kernel and make it available (on x86_64 targets only for now) in DYNAMIC_ARCH builds * Add sgemm_direct targets in the kernel Makefile.L3 and CMakeLists.txt * Add direct_sgemm functions to the gotoblas struct in common_param.h * Move sgemm_direct_performant helper to separate file * Update gemm.c to macros for sgemm_direct to support dynamic_arch naming via common_s,h * (Conditionally) add sgemm_direct functions in setparam-ref.c	2020-08-19 14:51:09 +02:00
Chen, Guobing	e740c4873d	Enable COOPERLAKE build target Enable new build target platform -- COOPERLAKE. This target platform supports all the SKYLAKEX supported ISAs + avx512bf16. So all the SKYLAKEX specific kernels/drivers and related code are now extended to be also active on COOPERLAKE. Besides, new BF16 related kernels are active under this target.	2020-08-13 06:18:00 +08:00
Martin Kroeker	cbbe38bb88	Merge pull request #2772 from mhillenibm/s390x_gemm_tuning s390x: GEMM tuning for z14	2020-08-11 18:14:09 +02:00
Marius Hillenbrand	07c334e7be	s390x: Factor out small block sizes for SGEMM/DGEMM on z14 For small register blockings that are too small to fill up vector registers with column vectors, we currently use a generic code block. Replace that with instantiations of the generic code as individual functions, so that the compiler can optimize each one separately. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-08-11 12:56:39 +02:00
Marius Hillenbrand	e2828e30aa	s390x: Optimize SGEMM/DGEMM blocks for z14 with explicit loop unrolling/interleaving Improve performance of SGEMM and DGEMM on z14 and z15 by unrolling and interleaving the inner loop of the SGEMM 16x4 and DGEMM 8x4 blocks. Specifically, we explicitly interleave vector register loads and computation of two iterations. Note that this change only adds one C function, since SGEMM 16x4 and DGEMM 8x4 actually map to the same C code: they both hold intermediate results in a 4x4 grid of vector registers, and the C implementation is built around that. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-08-11 12:55:42 +02:00
Rajalakshmi Srinivasaraghavan	475b5c95b9	Remove extra symbol in Makefile While trying out different unroll values, noted that make failed due to this extra symbol.	2020-08-07 15:27:44 -05:00
Martin Kroeker	81dcfdcf39	Multiply by 2 instead of left-shifting a potentially negative number fixes GCC ubsan warning in the BLAS tests	2020-08-02 18:29:56 +02:00
Martin Kroeker	0ef4b3f1f2	Multiply instead of doing a left shift of a potentially negative number fixes GCC ubsan report in the BLAS tests	2020-08-02 18:27:40 +02:00
Martin Kroeker	aa53a8a5cb	Multiply by two instead of left-shifting one place fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests	2020-08-02 18:25:09 +02:00
Martin Kroeker	aa3a1e7d8c	Multiply by two rather than left shift by one place fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests	2020-08-02 18:22:31 +02:00
Rajalakshmi Srinivasaraghavan	f77b6a83f4	dgemv optimization for POWER10 Making use of new vector pair POWER10 instructions in dgemv_n and dgemv_t. Also adding a new block 4x128 to make use of Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1. Tested on simulator and there are no new test failures.	2020-07-29 18:59:32 -05:00
Rajalakshmi Srinivasaraghavan	d557584b71	Fix compilation issues with clang on POWER As gcc defaults to -malign-power, removing that option. Also adding -fno-integrated-as to use GNU assembler for powerpc assembly optimization files. Fixed other compilation errors reported in dgemv_t.c file.	2020-07-27 14:11:07 -05:00
Ashwin Sekhar T K	4e1be0e481	ARM64: Add THUNDERX3T110 Target	2020-07-26 23:32:24 -07:00
Rajalakshmi Srinivasaraghavan	9be2688c78	Fix to store results in correct order for POWER10 GEMM kernels There is a recent compiler change in __builtin_mma_disassemble_acc() which affects the order of storing result in POWER10. Also removing new LDFLAG -mno-power10-stub as it is handled by linker automatically.	2020-07-24 23:08:11 -05:00
Martin Kroeker	6a2a60038c	Merge pull request #2720 from martin-frbg/issue2694 WIP Further fixes for 32bit POWER8	2020-07-24 23:19:45 +02:00
Martin Kroeker	251a09ec90	Typo fix	2020-07-24 16:04:58 +00:00
Martin Kroeker	95d37e1575	Regroup the 32 and 64bit sections and restore 64bit CAXPY	2020-07-24 10:13:46 +00:00
Martin Kroeker	3523bb778e	Merge pull request #2721 from martin-frbg/p8align Fix alignment errors in the power8 saxpy kernel	2020-07-24 11:06:20 +02:00
Martin Kroeker	bf1f0734ff	Use OPENBLAS_MAKE_COMPLEX_FLOAT on PPC only	2020-07-23 20:40:13 +00:00
Martin Kroeker	ca3561cab9	Add ifdefs around call to altivec microkernel	2020-07-23 18:30:42 +00:00
Martin Kroeker	21072e502a	Typo fix	2020-07-23 17:34:56 +00:00
Martin Kroeker	7c6e56b5df	Rewrite assignment to complex for better portability	2020-07-23 17:10:59 +02:00
Martin Kroeker	661c6bfa5a	Exclude altivec code paths if the compiler does not support them	2020-07-23 17:08:20 +02:00
Martin Kroeker	0033f8be0d	Use vec_vsx_ld/st to fix misaligned accesses flagged by asan	2020-07-16 23:32:54 +02:00
Martin Kroeker	f308e741b2	remove debug output and revert changes to cdot and crot	2020-07-15 10:00:07 +02:00
Martin Kroeker	da17abec87	fix trailing whitespace	2020-07-14 18:20:03 +02:00
Martin Kroeker	f8c2697701	Use POWER6 GEMM, TRMM and DTRSM on 32bit POWER8	2020-07-14 18:11:19 +02:00
Martin Kroeker	b144423f0f	Do not define USE_TRMM for 32bit POWER8	2020-07-14 18:10:12 +02:00
Martin Kroeker	ed7e155c35	Merge branch 'develop' into aix	2020-07-07 18:52:06 +02:00
EGuesnet	634e1305f9	Update cgemm_kernel_8x4_power8.S	2020-06-30 15:16:39 +02:00
Martin Kroeker	28d69e0097	Merge pull request #2687 from martin-frbg/utfbom Strip UTF8 byte order marker from source files	2020-06-26 22:53:09 +02:00
Martin Kroeker	c2467c9619	Merge pull request #2686 from RajalakshmiSR/p10_shgemm powerpc: Optimized SHGEMM kernel for POWER10	2020-06-26 22:52:45 +02:00
Martin Kroeker	d199c2787d	Merge pull request #2680 from kavanabhat/aix_makefile_fix Fix for #2671	2020-06-26 11:27:28 +02:00
Martin Kroeker	e30ad0e521	Strip UTF8 byte order marker from source	2020-06-26 09:00:43 +02:00
Rajalakshmi Srinivasaraghavan	d23419accc	powerpc: Optimized SHGEMM kernel for POWER10 This patch introduces new optimized version of SHGEMM kernel using power10 Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1. This patch makes use of new POWER10 compute instructions for matrix multiplication operation. Tested on simulator and there are no new test failures.	2020-06-25 22:19:08 -05:00
Martin Kroeker	c854ef5471	Fix variable names in conditional	2020-06-25 13:29:52 +02:00
Martin Kroeker	c0afc11742	Fix POWERPC builds on AIX (gcc/gfortran 7) 1. macro preprocessing for POWER8 and later kernels only 2. default buffer size used by AIX version of m4 is too small	2020-06-25 13:12:36 +02:00
Gordon Fossum	bb2f52844b	powerpc: Optimized ZGEMM kernel for POWER10 This patch introduces new optimized version of ZGEMM kernel using power10 Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1. This patch makes use of new POWER10 compute instructions for matrix multiplication operation. Tested on simulator and there are no new test failures. Cycles count reduced by 30-50% compared to POWER9 version depending on M/N/K sizes.	2020-06-24 14:50:12 -05:00
Rajalakshmi Srinivasaraghavan	571eadb880	powerpc: Optimized SGEMM/DGEMM/CGEMM for POWER10 This patch introduces new optimized version of SGEMM, CGEMM and DGEMM using power10 Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1. This patch makes use of new POWER10 compute instructions for matrix multiplication operation. Tested on simulator and there are no new test failures. Cycles count reduced by 30-50% compared to POWER9 version depending on M/N/K sizes. MMA GCC patch for reference: https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=8ee2640bfdc62f835ec9740278f948034bc7d9f1	2020-06-24 14:48:15 -05:00
Kavana Bhat	df4ade070f	Fix for #2671	2020-06-24 04:25:47 -05:00
Martin Kroeker	93592d1260	Merge pull request #2675 from wjc404/develop AVX512 DGEMM TCOPY_16 Function	2020-06-23 09:29:02 +02:00
wjc404	086d87a302	AVX512 dgemm tcopy_16 function	2020-06-20 00:07:43 +08:00
Rajalakshmi Srinivasaraghavan	9fe930f205	powerpc: Add support for future processor This is the initial patch to support build infrastructure for POWER10 architecture.	2020-06-11 15:47:20 -05:00
ZhangDanfeng	bc6fd20a40	fix INIT8x4 Signed-off-by: ZhangDanfeng <467688405@qq.com>	2020-06-10 01:01:16 +08:00
Martin Kroeker	89091e6b64	Merge pull request #2645 from martin-frbg/misc_fixes Miscellaneous fixes	2020-06-07 19:44:50 +02:00
Martin Kroeker	c3574ffe53	Merge pull request #2646 from wjc404/develop Optimize AVX512 parallel DGEMM performance	2020-06-07 13:18:22 +02:00
wjc404	0e3ac4a06b	Add files via upload	2020-06-06 14:56:57 +08:00
Martin Kroeker	7f60fb6b91	Delete spurious copy of common_param.h	2020-06-05 10:04:16 +02:00
ZhangDanfeng	9b7877ccf1	sgemm copy source init Signed-off-by: ZhangDanfeng <467688405@qq.com>	2020-06-04 02:10:45 +08:00
ZhangDanfeng	f82fa802d1	Insert prefetch Signed-off-by: ZhangDanfeng <467688405@qq.com>	2020-06-04 02:08:48 +08:00
Martin Kroeker	b1ee81228a	Change complex DOT and ROT to generic kernels and switch CGEMM in response to test failures seen in #2628 and BLAS-Tester	2020-06-03 09:13:29 +02:00
张丹枫	9df79ae9a3	update sgemm and strmm kernel selecting strategy	2020-05-20 22:26:58 +08:00
张丹枫	a1fc6041cd	use general register to speedup	2020-05-20 22:26:58 +08:00
张丹枫	edb423d772	align general register using to strmm_kernel_8x8	2020-05-20 22:26:58 +08:00
zhangdanfeng	0e6eb8c247	sgemm kernel use sgemm_kernel_8x8_cortexa53 Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>	2020-05-20 22:26:58 +08:00
zhangdanfeng	d475db29c6	optimized for cortex-a53 Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>	2020-05-20 22:26:58 +08:00
Marius Hillenbrand	89fe17f20e	s390x: Use new sgemm kernel also for DGEMM and DTRMM on Z14 Apply our new GEMM kernel implementation, written in C with vector intrinsics, also for DGEMM and DTRMM on Z14 and newer (i.e., architectures with FP32 SIMD instructions). As a result, we gain around 10% in performance on z15, in addition to improving maintainability. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-20 10:23:35 +02:00
Marius Hillenbrand	bdd795ed03	s390x/GEMM: replace 0-init with peeled first iteration ... since it gains another ~2% of SGEMM and DGEMM performance on z15; also, the code just called for that cleanup. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-20 10:23:35 +02:00
Marius Hillenbrand	2840432e49	s390x: improvise vector alignment hints for older compilers Introduce inline assembly so that we can employ vector loads with alignment hints on older compilers (pre gcc-9), since these are still used in distributions such as RHEL 8 and Ubuntu 18.04 LTS. Informing the hardware about alignment can speed up vector loads. For that purpose, we can encode hints about 8-byte or 16-byte alignment of the memory operand into the opcodes. gcc-9 and newer automatically emit such hints, where applicable. Add a bit of inline assembly that achieves the same for older compilers. Since an older binutils may not know about the additional operand for the hints, we explicitly encode the opcode in hex. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-14 15:36:03 +02:00
Marius Hillenbrand	1b0b4349a1	s390x/Z14: Change register blocking for SGEMM to 16x4 Change register blocking for SGEMM (and STRMM) on z14 from 8x4 to 16x4 by adjusting SGEMM_DEFAULT_UNROLL_M and choosing the appropriate copy implementations. Actually make KERNEL.Z14 more flexible, so that the change in param.h suffices. As a result, performance for SGEMM improves by around 30% on z15. On z14, FP SIMD instructions can operate on float-sized scalars in vector registers, while z13 could do that for double-sized scalars only. Thus, we can double the amount of elements of C that are held in registers in an SGEMM kernel. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
Marius Hillenbrand	71b6eaf459	s390x: Use new sgemm kernel also for strmm on Z14 and newer Employ the newly added GEMM kernel also for STRMM on Z14. The implementation in C with vector intrinsics exploits FP32 SIMD operations and thereby gains performance over the existing assembly code. Extend the implementation for handling triangular matrix multiplication, accordingly. As added benefit, the more flexible C code enables us to adjust register blocking in the subsequent commit. Tested via make -C test / ctest / utest and by a couple of additional unit tests that exercise blocking. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
Marius Hillenbrand	43c0d4f312	s390x: Add vectorized sgemm kernel for Z14 and newer Add a new GEMM kernel implementation to exploit the FP32 SIMD operations introduced with z14 and employ it for SGEMM on z14 and newer architectures. The SIMD extensions introduced with z13 support operations on double-sized scalars in vector registers. Thus, the existing SGEMM code would extend floats to doubles before operating on them. z14 extended SIMD support to operations on 32-bit floats. By employing these instructions, we can operate on twice the number of scalars per instruction (four floats in each vector registers) and avoid the conversion operations. The code is written in C with explicit vectorization. In experiments, this kernel improves performance on z14 and z15 by around 2x over the current implementation in assembly. The flexibilty of the C code paves the way for adjustments in subsequent commits. Tested via make -C test / ctest / utest and by a couple of additional unit tests that exercise blocking (e.g., partial register blocks with fewer than UNROLL_M rows and/or fewer than UNROLL_N columns). Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
Martin Kroeker	2271c3506b	Work around excessive LAPACK test failures on Skylake-X Something in the plain C parts of x86_64 cscal.c and zscal.c appears to be miscompiled by both gfortran9 and ifort when compiling for skylakex-avx512, even when the optimized Haswell microkernel is not in use.	2020-05-09 23:49:18 +02:00
Rajalakshmi Srinivasaraghavan	bd9ff820bc	Fix cmake compilation issue - POWER9 This patch removes extra space in the sgemmotcopy filename thereby allowing it to create entry in kernel/Makefile created by cmake.	2020-05-08 20:31:56 -05:00
Ashwin Sekhar T K	8353cb245a	ARM64: Improve DAXPY for ThunderX2 Improve performance of DAXPY for ThunderX2 when the vector fits in L1 Cache.	2020-05-07 09:22:50 -07:00
Martin Kroeker	90dba9f716	Duplicate earlier Clang 9.0.0 workaround for corresponding Apple Clang version As discussed on the original PR #2329, the "Apple Clang 11.0.3" that appears to be based the same LLVM release produces the same miscompilation of this file.	2020-05-05 10:44:50 +02:00
Martin Kroeker	5dd14e3d48	Make building the bfloat16 functions conditional on option BUILD_HALF (#2590 ) * make building the bfloat16 BLAS functions conditional on BUILD_HALF * pass the BUILD_HALF option to gensymbol * Pass BUILD_HALF as a compiler define for dynamic_arch builds	2020-05-01 09:58:30 +02:00
Martin Kroeker	06208c8d01	Limit this fix to ELFv2 builds	2020-04-22 14:16:40 +02:00
Martin Kroeker	f5c4c28b98	Work around POWER8BE bugs on FreeBSD (ELFv2) for #2299	2020-04-21 17:17:17 +02:00
Martin Kroeker	fa42588e1f	Merge pull request #2565 from martin-frbg/mips24k Support MIPS32 24K family as P5600	2020-04-20 17:13:53 +02:00
Martin Kroeker	e55ec82bb9	Delete KERNEL.1004K	2020-04-19 15:44:30 +02:00
Martin Kroeker	7353ea5afc	Delete KERNEL.24K	2020-04-19 15:44:19 +02:00
Martin Kroeker	6a04efb122	Rename KERNEL files to include MIPS prefix	2020-04-19 15:43:54 +02:00
Martin Kroeker	d712ea724c	Add MIPS24K support	2020-04-18 21:10:18 +02:00
Rajalakshmi Srinivasaraghavan	22bb50fb81	cmake fixes	2020-04-17 13:35:17 -05:00
Rajalakshmi Srinivasaraghavan	67cc4b9e16	Fix warnings in clang and export symbol	2020-04-15 19:15:23 -05:00
Rajalakshmi Srinivasaraghavan	a87793e03c	Fix DYNAMIC_ARCH compilation errors	2020-04-15 09:09:50 -05:00
Rajalakshmi Srinivasaraghavan	ff010f496e	Build shgemm for all architecture	2020-04-14 20:38:53 -05:00
Rajalakshmi Srinivasaraghavan	7eb55504b1	RFC : Add half precision gemm for bfloat16 in OpenBLAS This patch adds support for bfloat16 data type matrix multiplication kernel. For architectures that don't support bfloat16, it is defined as unsigned short (2 bytes). Default unroll sizes can be changed as per architecture as done for SGEMM and for now 8 and 4 are used for M and N. Size of ncopy/tcopy can be changed as per architecture requirement and for now, size 2 is used. Added shgemm in kernel/power/KERNEL.POWER9 and tested in powerpc64le and powerpc64. For reference, added a small test compare_sgemm_shgemm.c to compare sgemm and shgemm output. This patch does not cover OpenBLAS test, benchmark and lapack tests for shgemm. Complex type implementation can be discussed and added once this is approved.	2020-04-14 14:55:08 -05:00
Martin Kroeker	5b0093b5fe	Convert aligned moves to unaligned should have no performance impact on reasonably modern cpus and fixes occasional crashes in actual user code.	2020-04-13 14:58:52 +02:00
Martin Kroeker	e9bfa2291a	Fix parameter overflow	2020-04-12 19:47:02 +02:00
gxw	8d07cf9b67	Fix compilation problem on loongson platform Using "make TARGET=GENERIC" on loongson platform will get the following error messages: "make[1]: *** No rule to make target 'sgemm_incopy.o', needed by 'libs'" Add kernel/mips64/KERNEL.generic to slove the problem.	2020-04-09 19:28:15 +08:00
Martin Kroeker	806f89166e	Make ARMV7 compile with xcode and add a CI job for it (#2537 ) * Add an ARMV7 iOS build on Travis * thread_local appears to be unavailable on ARMV7 iOS * Add no-thumb option for ARMV7 IOS build to get it to accept DMB ISH * Make local labels in macros of nrm2_vfpv3.S compatible with the xcode assembler	2020-04-02 10:30:37 +02:00
Martin Kroeker	c6af9bbb32	Merge pull request #2534 from martin-frbg/issue2496 Fix zero initialization for beta=0 case	2020-03-31 20:53:13 +02:00

... 5 6 7 8 9 ...

2023 Commits