OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Martin Kroeker	d85b24e103	Clean up STACKSIZE redefinition	2020-10-18 19:29:45 +02:00
Zhang Xianyi	d7ba7679b6	Merge branch 'develop' into risc-v	2020-10-16 23:27:38 +08:00
Martin Kroeker	df70667043	fix core list for sse/sse2	2020-10-16 09:55:48 +02:00
Martin Kroeker	f071d1207a	add sse2	2020-10-15 22:10:32 +02:00
Martin Kroeker	dc6cefd2f5	Expressly enable -msse for 32bit DYNAMIC_ARCH kernels	2020-10-15 20:16:15 +02:00
Martin Kroeker	c339c40c01	Silence a redefinition warning	2020-10-15 19:08:12 +02:00
Martin Kroeker	10379fc83b	Use ifdef instead of if	2020-10-15 19:05:37 +02:00
Martin Kroeker	4c25910da0	Merge pull request #2896 from martin-frbg/intrin-double Add compiler flag for SSE4 where available	2020-10-15 11:12:35 +02:00
damonyu	ef8e7d0279	Add the support for RISC-V Vector. Change-Id: Iae7800a32f5af3903c330882cdf6f292d885f266	2020-10-15 16:09:02 +08:00
Martin Kroeker	ae6ac83991	Revert "add double precision SSE"	2020-10-15 08:37:02 +02:00
Qiyu8	4fac91ef37	adapt arm platform	2020-10-15 11:08:10 +08:00
Qiyu8	bfdf4b56da	Add double precision universal intrinsics for X86/ARM	2020-10-15 10:29:42 +08:00
Martin Kroeker	ebf0470fc2	add sse4.1 for DYNAMIC_ARCH kernels	2020-10-14 20:34:33 +02:00
Martin Kroeker	c9c3ae07af	Add double precision operations	2020-10-14 18:10:45 +02:00
Martin Kroeker	756802df61	Merge pull request #2890 from martin-frbg/s-d-sum Revert special handling of Windows xNRM2 and enable C+intrinsics kern…	2020-10-14 09:02:03 +02:00
Rajalakshmi Srinivasaraghavan	0826d68f93	POWER10: Change the packing format for bfloat16 As the new MMA instructions need the inputs in 4x2 order for bfloat16, changing the format in copy/packing code. This avoids permute instructions in the gemm kernel inner loop.	2020-10-13 16:05:10 -05:00
Rajalakshmi Srinivasaraghavan	b5d30b390d	Fix build issues with bfloat16 This patch fixes compilation errors due to recent renaming from SH to SB with BUILD_BFLOAT16.	2020-10-13 11:00:22 -05:00
Martin Kroeker	fecedc9c69	Add -mssse3	2020-10-13 11:55:41 +02:00
Martin Kroeker	0eacbca85f	Add Haswell and Zen to temporary sse3 whitelist	2020-10-13 11:42:39 +02:00
Martin Kroeker	6999086a2b	whitelist SANDYBRIDGE for SSE3	2020-10-13 10:32:19 +02:00
Martin Kroeker	8d2df7d066	Revert special handling of Windows xNRM2 and enable C+intrinsics kernel for SSUM/DSUM	2020-10-13 00:14:29 +02:00
Martin Kroeker	08929430cd	Merge pull request #2886 from martin-frbg/issue_2767 Rename "HALF" precision functions (sh prefix) to "BFLOAT16" with "sb" prefix	2020-10-13 00:04:35 +02:00
Martin Kroeker	0c84ffe05f	Merge pull request #2881 from mattip/fninit add fninit to reset fpu registers before assembler routines	2020-10-12 23:50:41 +02:00
Matti Picus	403eb513a0	use emms instead, add WIN guards	2020-10-12 18:15:01 +03:00
Qiyu8	0ed1f07660	Optimize the performance of sum by using universal intrinsics	2020-10-12 19:48:53 +08:00
Martin Kroeker	3aecafad80	Change "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-12 00:00:55 +02:00
Martin Kroeker	756062afa5	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:56:17 +02:00
Martin Kroeker	2061f7fdff	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:54:53 +02:00
Martin Kroeker	dc8a1afa63	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:53:50 +02:00
Martin Kroeker	fd94236042	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:42:07 +02:00
Martin Kroeker	68ce719fac	Rename shdot_microk_cooperlake.c to sbdot_microk_cooperlake.c	2020-10-11 23:41:13 +02:00
Martin Kroeker	d7dd9b396c	Rename shdot.c to sbdot.c	2020-10-11 23:40:43 +02:00
Martin Kroeker	9ae80490e0	rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:39:42 +02:00
Martin Kroeker	d314d1f49f	Rename shgemm_kernel_power10.c to sbgemm_kernel_power10.c	2020-10-11 23:37:38 +02:00
Martin Kroeker	c589c3e2a1	Merge pull request #2882 from martin-frbg/issue2709 Use generic C for (D/Z)NRM2 on Windows x86_64	2020-10-11 22:22:30 +02:00
Martin Kroeker	ec638a82bf	Merge pull request #2852 from martin-frbg/issue2588-cmake Support building only a subset of variable types	2020-10-11 22:21:33 +02:00
Martin Kroeker	6b6adf8a4a	Allow compiling only a subset of kernels for specific variable types	2020-10-11 14:52:09 +02:00
Martin Kroeker	ac653c94f3	Merge branch 'develop' into issue2588-cmake	2020-10-11 13:57:07 +02:00
Martin Kroeker	7a53128481	Add whitelist of DYNAMIC_ARCH kernels for which -msse3 needs to be enabled	2020-10-11 01:06:46 +02:00
Martin Kroeker	e1b7123bbe	Merge pull request #2867 from Qiyu8/usimd-floatdot Optimize the performance of dot by using universal intrinsics in X86/ARM	2020-10-10 12:10:25 +02:00
Qiyu8	f32d34a015	add sse3 compiler flag	2020-10-10 10:36:15 +08:00
Martin Kroeker	7812486091	Use generic C for D/Z nrm2 kernels on Windows to work around fpu exception bug	2020-10-06 21:33:16 +02:00
Matti Picus	a5b164946c	add fninit to reset fpu registers before assembler routines	2020-10-05 22:13:25 +03:00
User User-User	d2333e7842	aarch64 fix std=c18 compilation	2020-10-03 18:00:34 +03:00
Qiyu8	60e6c68e38	Adapt ARM architect	2020-09-29 16:36:14 +08:00
Qiyu8	1b1a757f5f	Optimize the performance of dot by using universal intrinsics in X86/ARM	2020-09-28 20:36:53 +08:00
Rajalakshmi Srinivasaraghavan	2df4235e00	Optimize dcopy/zcopy for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores. Tested in simulator and no new failures.	2020-09-27 21:42:32 -05:00
Martin Kroeker	dfbc62ef7e	Support building only a subset of types	2020-09-22 23:25:59 +02:00
Qiyu8	14f7dad3b7	performance improved	2020-09-22 16:52:15 +08:00
Qiyu8	325b539c26	Optimize the performance of daxpy by using universal intrinsics	2020-09-22 10:38:35 +08:00
Marius Hillenbrand	22aa81f3e5	s390x: fix cscal and zscal implementations The implementation of complex scalar * vector multiplication for Z14 makes some LAPACK tests fail because the numerical differences to the reference implementation exceed the threshold (as can be seen by running make lapack-test and replacing kernel/zarch/cscal.c with a generic implementation for comparison). The complex multiplication uses terms of the form a * b + c * d for both real and imaginary parts. The assembly code (and compiler-emitted code as well) uses fused multiply add operations for the second product and sum. The results can be "surprising", for example when both terms in the imaginary part nearly cancel each other out. In that case, the second product contributes more digits to the sum than the first product that has been rounded before. One option is to use separate multiplications (which then round the same way) and a distinct add. Change the code to pursue that path, by (1) requesting the compiler not to contract the operations into FMAs and (2) replacing the assembly kernel with corresponding vectorized C code (where change 1 also applies). Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-21 13:10:05 +02:00
Marius Hillenbrand	f91057cbad	s390x: move common vector definitions and utils into header ... to facilitate reuse beyond gemm_vec.c and avoid code duplication. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-21 11:32:08 +02:00
Rajalakshmi Srinivasaraghavan	be43d2cb96	Optimize daxpy/zaxpy for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores. Tested in simulator and no new failures.	2020-09-17 12:56:28 -05:00
Martin Kroeker	91c84e1c01	Merge pull request #2796 from Guobing-Chen/BF16_dot_coversion_apis Add bfloat16 based dot and conversion with single/double	2020-09-14 15:00:19 +02:00
Martin Kroeker	e72430fe46	Merge pull request #2803 from xiegengxin/AVX2-asum Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic	2020-09-06 18:32:15 +02:00
Chen, Guobing	deaeb6c5b8	Add bfloat16 based dot and conversion with single/double 1. Added bfloat16 based dot as new API: shdot 2. Implemented generic kernel and cooperlake-specific (AVX512-BF16) kernel for shdot 3. Added 4 conversion APIs for bfloat16 data type <=> single/double: shstobf16 shdtobf16 sbf16tos dbf16tod shstobf16 -- convert single float array to bfloat16 array shdtobf16 -- convert double float array to bfloat16 array sbf16tos -- convert bfloat16 array to single float array dbf16tod -- convert bfloat16 array to double float array 4. Implemented generic kernels for all 4 conversion APIs, and cooperlake-specific kernel for shstobf16 and shdtobf16 5. Update level1 thread facilitate functions and macros to support multi-threading for these new APIs 6. Fix Cooperlake platform detection/specify issue when under dynamic-arch building 7. Change the typedef of bfloat16 from unsigned short to more strict uint16_t Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2020-09-04 02:31:25 +08:00
Martin Kroeker	775a87242d	Rename KERNEL.SILICON to KERNEL.VORTEX	2020-09-03 08:44:20 +02:00
Gengxin Xie	1b0f17eeed	align to 64, using SSE when input size is small	2020-09-03 14:25:54 +08:00
Martin Kroeker	80794fe8fd	Create KERNEL.SILICON	2020-09-02 22:56:58 +02:00
Marius Hillenbrand	2ee5b899ce	s390x: enable S/DGEMM block with explicit loop unrolling + interleaving with clang The code for SGEMM 16x4 and DGEMM 8x4 blocks on z14 and z15 uses explicit unrolling and interleaving to improve performance. The code employs an empty inline asm statement with operands that constrain the compiler's instruction scheduling and thereby enforce proper overlapping of load and compute phases. Fix an ifdef to apply that for clang builds, as well. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:31 +02:00
Marius Hillenbrand	87e5bbd887	s390x: avoid variable-length arrays in struct for asm operands ... since it is not required and clang does not support that gcc extension. Instead, use a variable-length array directly for these operands. Note that, while the actual inline assembly code does not directly use these memory operands, they serve to inform the compiler that it cannot reorder reads or writes to/from the input and output data across the inline asm statements. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:31 +02:00
Marius Hillenbrand	b9b3265ec8	s390x: avoid inline assembly for vector loads for clang ... since clang does not support the instruction format for inline assembly and also it is not required for current versions of clang. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:30 +02:00
Marius Hillenbrand	a1616a0b86	s390x: replace nop with "nop 0" in inline assembly ... as a bandaid for building with clang until LLVM's internal assembler supports nops without operand. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:30 +02:00
Marius Hillenbrand	60ef193258	s390x: use "lghi" for immediate values to fix build with clang Some of the kernels written in assembly utilize a "load address" instruction for loading an immediate value into a register. That is both unnecessarily complex and LLVM's assembler does not understand that specific syntax. Thus, replace with the appropriate "load immediate" instruction, which is also clearer to read. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:30 +02:00
Gengxin Xie	448152cdd8	define __AVX2__ to ensure the haswell code compiled with avx2	2020-08-31 14:39:08 +08:00
Gengxin Xie	cb3c190a3a	Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic	2020-08-31 11:44:08 +08:00
Rajalakshmi Srinivasaraghavan	317ff27cda	POWER10: Avoid setting accumulators to zero in gemm kernels For the first iteration, it is better to use xvfger instead of xvfgerpp builtins which helps to avoid setting accumulators to zero. This helps to reduce few instructions.	2020-08-28 10:42:54 -05:00
Martin Kroeker	b2053239fc	Fix mssing dummy parameter (imag part of alpha) of zdot_thread_function	2020-08-23 15:08:16 +02:00
Martin Kroeker	9ee21a0a39	Merge pull request #2780 from Guobing-Chen/CPL_build_support Enable COOPERLAKE build target	2020-08-20 19:54:29 +02:00
Martin Kroeker	6f4dc7445d	Fix typo	2020-08-19 16:36:55 +02:00
Martin Kroeker	81fbe8d088	-march=cooperlake only available in gcc >= 10	2020-08-19 16:10:15 +02:00
Martin Kroeker	75eeb265d7	[WIP] Refactor the driver code for direct SGEMM (#2782 ) Move "direct SGEMM" functionality out of the SkylakeX SGEMM kernel and make it available (on x86_64 targets only for now) in DYNAMIC_ARCH builds * Add sgemm_direct targets in the kernel Makefile.L3 and CMakeLists.txt * Add direct_sgemm functions to the gotoblas struct in common_param.h * Move sgemm_direct_performant helper to separate file * Update gemm.c to macros for sgemm_direct to support dynamic_arch naming via common_s,h * (Conditionally) add sgemm_direct functions in setparam-ref.c	2020-08-19 14:51:09 +02:00
Chen, Guobing	e740c4873d	Enable COOPERLAKE build target Enable new build target platform -- COOPERLAKE. This target platform supports all the SKYLAKEX supported ISAs + avx512bf16. So all the SKYLAKEX specific kernels/drivers and related code are now extended to be also active on COOPERLAKE. Besides, new BF16 related kernels are active under this target.	2020-08-13 06:18:00 +08:00
Martin Kroeker	cbbe38bb88	Merge pull request #2772 from mhillenibm/s390x_gemm_tuning s390x: GEMM tuning for z14	2020-08-11 18:14:09 +02:00
Marius Hillenbrand	07c334e7be	s390x: Factor out small block sizes for SGEMM/DGEMM on z14 For small register blockings that are too small to fill up vector registers with column vectors, we currently use a generic code block. Replace that with instantiations of the generic code as individual functions, so that the compiler can optimize each one separately. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-08-11 12:56:39 +02:00
Marius Hillenbrand	e2828e30aa	s390x: Optimize SGEMM/DGEMM blocks for z14 with explicit loop unrolling/interleaving Improve performance of SGEMM and DGEMM on z14 and z15 by unrolling and interleaving the inner loop of the SGEMM 16x4 and DGEMM 8x4 blocks. Specifically, we explicitly interleave vector register loads and computation of two iterations. Note that this change only adds one C function, since SGEMM 16x4 and DGEMM 8x4 actually map to the same C code: they both hold intermediate results in a 4x4 grid of vector registers, and the C implementation is built around that. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-08-11 12:55:42 +02:00
Rajalakshmi Srinivasaraghavan	475b5c95b9	Remove extra symbol in Makefile While trying out different unroll values, noted that make failed due to this extra symbol.	2020-08-07 15:27:44 -05:00
Martin Kroeker	81dcfdcf39	Multiply by 2 instead of left-shifting a potentially negative number fixes GCC ubsan warning in the BLAS tests	2020-08-02 18:29:56 +02:00
Martin Kroeker	0ef4b3f1f2	Multiply instead of doing a left shift of a potentially negative number fixes GCC ubsan report in the BLAS tests	2020-08-02 18:27:40 +02:00
Martin Kroeker	aa53a8a5cb	Multiply by two instead of left-shifting one place fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests	2020-08-02 18:25:09 +02:00
Martin Kroeker	aa3a1e7d8c	Multiply by two rather than left shift by one place fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests	2020-08-02 18:22:31 +02:00
Rajalakshmi Srinivasaraghavan	f77b6a83f4	dgemv optimization for POWER10 Making use of new vector pair POWER10 instructions in dgemv_n and dgemv_t. Also adding a new block 4x128 to make use of Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1. Tested on simulator and there are no new test failures.	2020-07-29 18:59:32 -05:00
Rajalakshmi Srinivasaraghavan	d557584b71	Fix compilation issues with clang on POWER As gcc defaults to -malign-power, removing that option. Also adding -fno-integrated-as to use GNU assembler for powerpc assembly optimization files. Fixed other compilation errors reported in dgemv_t.c file.	2020-07-27 14:11:07 -05:00
Ashwin Sekhar T K	4e1be0e481	ARM64: Add THUNDERX3T110 Target	2020-07-26 23:32:24 -07:00
Rajalakshmi Srinivasaraghavan	9be2688c78	Fix to store results in correct order for POWER10 GEMM kernels There is a recent compiler change in __builtin_mma_disassemble_acc() which affects the order of storing result in POWER10. Also removing new LDFLAG -mno-power10-stub as it is handled by linker automatically.	2020-07-24 23:08:11 -05:00
Martin Kroeker	6a2a60038c	Merge pull request #2720 from martin-frbg/issue2694 WIP Further fixes for 32bit POWER8	2020-07-24 23:19:45 +02:00
Martin Kroeker	251a09ec90	Typo fix	2020-07-24 16:04:58 +00:00
Martin Kroeker	95d37e1575	Regroup the 32 and 64bit sections and restore 64bit CAXPY	2020-07-24 10:13:46 +00:00
Martin Kroeker	3523bb778e	Merge pull request #2721 from martin-frbg/p8align Fix alignment errors in the power8 saxpy kernel	2020-07-24 11:06:20 +02:00
Martin Kroeker	bf1f0734ff	Use OPENBLAS_MAKE_COMPLEX_FLOAT on PPC only	2020-07-23 20:40:13 +00:00
Martin Kroeker	ca3561cab9	Add ifdefs around call to altivec microkernel	2020-07-23 18:30:42 +00:00
Martin Kroeker	21072e502a	Typo fix	2020-07-23 17:34:56 +00:00
Martin Kroeker	7c6e56b5df	Rewrite assignment to complex for better portability	2020-07-23 17:10:59 +02:00
Martin Kroeker	661c6bfa5a	Exclude altivec code paths if the compiler does not support them	2020-07-23 17:08:20 +02:00
Martin Kroeker	0033f8be0d	Use vec_vsx_ld/st to fix misaligned accesses flagged by asan	2020-07-16 23:32:54 +02:00
Martin Kroeker	f308e741b2	remove debug output and revert changes to cdot and crot	2020-07-15 10:00:07 +02:00
Martin Kroeker	da17abec87	fix trailing whitespace	2020-07-14 18:20:03 +02:00
Martin Kroeker	f8c2697701	Use POWER6 GEMM, TRMM and DTRSM on 32bit POWER8	2020-07-14 18:11:19 +02:00
Martin Kroeker	b144423f0f	Do not define USE_TRMM for 32bit POWER8	2020-07-14 18:10:12 +02:00
Martin Kroeker	ed7e155c35	Merge branch 'develop' into aix	2020-07-07 18:52:06 +02:00
EGuesnet	634e1305f9	Update cgemm_kernel_8x4_power8.S	2020-06-30 15:16:39 +02:00
Martin Kroeker	28d69e0097	Merge pull request #2687 from martin-frbg/utfbom Strip UTF8 byte order marker from source files	2020-06-26 22:53:09 +02:00
Martin Kroeker	c2467c9619	Merge pull request #2686 from RajalakshmiSR/p10_shgemm powerpc: Optimized SHGEMM kernel for POWER10	2020-06-26 22:52:45 +02:00
Martin Kroeker	d199c2787d	Merge pull request #2680 from kavanabhat/aix_makefile_fix Fix for #2671	2020-06-26 11:27:28 +02:00
Martin Kroeker	e30ad0e521	Strip UTF8 byte order marker from source	2020-06-26 09:00:43 +02:00
Rajalakshmi Srinivasaraghavan	d23419accc	powerpc: Optimized SHGEMM kernel for POWER10 This patch introduces new optimized version of SHGEMM kernel using power10 Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1. This patch makes use of new POWER10 compute instructions for matrix multiplication operation. Tested on simulator and there are no new test failures.	2020-06-25 22:19:08 -05:00
Martin Kroeker	c854ef5471	Fix variable names in conditional	2020-06-25 13:29:52 +02:00
Martin Kroeker	c0afc11742	Fix POWERPC builds on AIX (gcc/gfortran 7) 1. macro preprocessing for POWER8 and later kernels only 2. default buffer size used by AIX version of m4 is too small	2020-06-25 13:12:36 +02:00
Gordon Fossum	bb2f52844b	powerpc: Optimized ZGEMM kernel for POWER10 This patch introduces new optimized version of ZGEMM kernel using power10 Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1. This patch makes use of new POWER10 compute instructions for matrix multiplication operation. Tested on simulator and there are no new test failures. Cycles count reduced by 30-50% compared to POWER9 version depending on M/N/K sizes.	2020-06-24 14:50:12 -05:00
Rajalakshmi Srinivasaraghavan	571eadb880	powerpc: Optimized SGEMM/DGEMM/CGEMM for POWER10 This patch introduces new optimized version of SGEMM, CGEMM and DGEMM using power10 Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1. This patch makes use of new POWER10 compute instructions for matrix multiplication operation. Tested on simulator and there are no new test failures. Cycles count reduced by 30-50% compared to POWER9 version depending on M/N/K sizes. MMA GCC patch for reference: https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=8ee2640bfdc62f835ec9740278f948034bc7d9f1	2020-06-24 14:48:15 -05:00
Kavana Bhat	df4ade070f	Fix for #2671	2020-06-24 04:25:47 -05:00
Martin Kroeker	93592d1260	Merge pull request #2675 from wjc404/develop AVX512 DGEMM TCOPY_16 Function	2020-06-23 09:29:02 +02:00
wjc404	086d87a302	AVX512 dgemm tcopy_16 function	2020-06-20 00:07:43 +08:00
Rajalakshmi Srinivasaraghavan	9fe930f205	powerpc: Add support for future processor This is the initial patch to support build infrastructure for POWER10 architecture.	2020-06-11 15:47:20 -05:00
ZhangDanfeng	bc6fd20a40	fix INIT8x4 Signed-off-by: ZhangDanfeng <467688405@qq.com>	2020-06-10 01:01:16 +08:00
Martin Kroeker	89091e6b64	Merge pull request #2645 from martin-frbg/misc_fixes Miscellaneous fixes	2020-06-07 19:44:50 +02:00
Martin Kroeker	c3574ffe53	Merge pull request #2646 from wjc404/develop Optimize AVX512 parallel DGEMM performance	2020-06-07 13:18:22 +02:00
wjc404	0e3ac4a06b	Add files via upload	2020-06-06 14:56:57 +08:00
Martin Kroeker	7f60fb6b91	Delete spurious copy of common_param.h	2020-06-05 10:04:16 +02:00
ZhangDanfeng	9b7877ccf1	sgemm copy source init Signed-off-by: ZhangDanfeng <467688405@qq.com>	2020-06-04 02:10:45 +08:00
ZhangDanfeng	f82fa802d1	Insert prefetch Signed-off-by: ZhangDanfeng <467688405@qq.com>	2020-06-04 02:08:48 +08:00
Martin Kroeker	b1ee81228a	Change complex DOT and ROT to generic kernels and switch CGEMM in response to test failures seen in #2628 and BLAS-Tester	2020-06-03 09:13:29 +02:00
张丹枫	9df79ae9a3	update sgemm and strmm kernel selecting strategy	2020-05-20 22:26:58 +08:00
张丹枫	a1fc6041cd	use general register to speedup	2020-05-20 22:26:58 +08:00
张丹枫	edb423d772	align general register using to strmm_kernel_8x8	2020-05-20 22:26:58 +08:00
zhangdanfeng	0e6eb8c247	sgemm kernel use sgemm_kernel_8x8_cortexa53 Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>	2020-05-20 22:26:58 +08:00
zhangdanfeng	d475db29c6	optimized for cortex-a53 Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>	2020-05-20 22:26:58 +08:00
Marius Hillenbrand	89fe17f20e	s390x: Use new sgemm kernel also for DGEMM and DTRMM on Z14 Apply our new GEMM kernel implementation, written in C with vector intrinsics, also for DGEMM and DTRMM on Z14 and newer (i.e., architectures with FP32 SIMD instructions). As a result, we gain around 10% in performance on z15, in addition to improving maintainability. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-20 10:23:35 +02:00
Marius Hillenbrand	bdd795ed03	s390x/GEMM: replace 0-init with peeled first iteration ... since it gains another ~2% of SGEMM and DGEMM performance on z15; also, the code just called for that cleanup. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-20 10:23:35 +02:00
Marius Hillenbrand	2840432e49	s390x: improvise vector alignment hints for older compilers Introduce inline assembly so that we can employ vector loads with alignment hints on older compilers (pre gcc-9), since these are still used in distributions such as RHEL 8 and Ubuntu 18.04 LTS. Informing the hardware about alignment can speed up vector loads. For that purpose, we can encode hints about 8-byte or 16-byte alignment of the memory operand into the opcodes. gcc-9 and newer automatically emit such hints, where applicable. Add a bit of inline assembly that achieves the same for older compilers. Since an older binutils may not know about the additional operand for the hints, we explicitly encode the opcode in hex. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-14 15:36:03 +02:00
Marius Hillenbrand	1b0b4349a1	s390x/Z14: Change register blocking for SGEMM to 16x4 Change register blocking for SGEMM (and STRMM) on z14 from 8x4 to 16x4 by adjusting SGEMM_DEFAULT_UNROLL_M and choosing the appropriate copy implementations. Actually make KERNEL.Z14 more flexible, so that the change in param.h suffices. As a result, performance for SGEMM improves by around 30% on z15. On z14, FP SIMD instructions can operate on float-sized scalars in vector registers, while z13 could do that for double-sized scalars only. Thus, we can double the amount of elements of C that are held in registers in an SGEMM kernel. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
Marius Hillenbrand	71b6eaf459	s390x: Use new sgemm kernel also for strmm on Z14 and newer Employ the newly added GEMM kernel also for STRMM on Z14. The implementation in C with vector intrinsics exploits FP32 SIMD operations and thereby gains performance over the existing assembly code. Extend the implementation for handling triangular matrix multiplication, accordingly. As added benefit, the more flexible C code enables us to adjust register blocking in the subsequent commit. Tested via make -C test / ctest / utest and by a couple of additional unit tests that exercise blocking. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
Marius Hillenbrand	43c0d4f312	s390x: Add vectorized sgemm kernel for Z14 and newer Add a new GEMM kernel implementation to exploit the FP32 SIMD operations introduced with z14 and employ it for SGEMM on z14 and newer architectures. The SIMD extensions introduced with z13 support operations on double-sized scalars in vector registers. Thus, the existing SGEMM code would extend floats to doubles before operating on them. z14 extended SIMD support to operations on 32-bit floats. By employing these instructions, we can operate on twice the number of scalars per instruction (four floats in each vector registers) and avoid the conversion operations. The code is written in C with explicit vectorization. In experiments, this kernel improves performance on z14 and z15 by around 2x over the current implementation in assembly. The flexibilty of the C code paves the way for adjustments in subsequent commits. Tested via make -C test / ctest / utest and by a couple of additional unit tests that exercise blocking (e.g., partial register blocks with fewer than UNROLL_M rows and/or fewer than UNROLL_N columns). Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
Martin Kroeker	2271c3506b	Work around excessive LAPACK test failures on Skylake-X Something in the plain C parts of x86_64 cscal.c and zscal.c appears to be miscompiled by both gfortran9 and ifort when compiling for skylakex-avx512, even when the optimized Haswell microkernel is not in use.	2020-05-09 23:49:18 +02:00
Rajalakshmi Srinivasaraghavan	bd9ff820bc	Fix cmake compilation issue - POWER9 This patch removes extra space in the sgemmotcopy filename thereby allowing it to create entry in kernel/Makefile created by cmake.	2020-05-08 20:31:56 -05:00
Ashwin Sekhar T K	8353cb245a	ARM64: Improve DAXPY for ThunderX2 Improve performance of DAXPY for ThunderX2 when the vector fits in L1 Cache.	2020-05-07 09:22:50 -07:00
Martin Kroeker	90dba9f716	Duplicate earlier Clang 9.0.0 workaround for corresponding Apple Clang version As discussed on the original PR #2329, the "Apple Clang 11.0.3" that appears to be based the same LLVM release produces the same miscompilation of this file.	2020-05-05 10:44:50 +02:00
Martin Kroeker	5dd14e3d48	Make building the bfloat16 functions conditional on option BUILD_HALF (#2590 ) * make building the bfloat16 BLAS functions conditional on BUILD_HALF * pass the BUILD_HALF option to gensymbol * Pass BUILD_HALF as a compiler define for dynamic_arch builds	2020-05-01 09:58:30 +02:00
Martin Kroeker	06208c8d01	Limit this fix to ELFv2 builds	2020-04-22 14:16:40 +02:00
Martin Kroeker	f5c4c28b98	Work around POWER8BE bugs on FreeBSD (ELFv2) for #2299	2020-04-21 17:17:17 +02:00
Martin Kroeker	fa42588e1f	Merge pull request #2565 from martin-frbg/mips24k Support MIPS32 24K family as P5600	2020-04-20 17:13:53 +02:00
Martin Kroeker	e55ec82bb9	Delete KERNEL.1004K	2020-04-19 15:44:30 +02:00
Martin Kroeker	7353ea5afc	Delete KERNEL.24K	2020-04-19 15:44:19 +02:00
Martin Kroeker	6a04efb122	Rename KERNEL files to include MIPS prefix	2020-04-19 15:43:54 +02:00
Martin Kroeker	d712ea724c	Add MIPS24K support	2020-04-18 21:10:18 +02:00
Rajalakshmi Srinivasaraghavan	22bb50fb81	cmake fixes	2020-04-17 13:35:17 -05:00
Rajalakshmi Srinivasaraghavan	67cc4b9e16	Fix warnings in clang and export symbol	2020-04-15 19:15:23 -05:00
Rajalakshmi Srinivasaraghavan	a87793e03c	Fix DYNAMIC_ARCH compilation errors	2020-04-15 09:09:50 -05:00
Rajalakshmi Srinivasaraghavan	ff010f496e	Build shgemm for all architecture	2020-04-14 20:38:53 -05:00
Rajalakshmi Srinivasaraghavan	7eb55504b1	RFC : Add half precision gemm for bfloat16 in OpenBLAS This patch adds support for bfloat16 data type matrix multiplication kernel. For architectures that don't support bfloat16, it is defined as unsigned short (2 bytes). Default unroll sizes can be changed as per architecture as done for SGEMM and for now 8 and 4 are used for M and N. Size of ncopy/tcopy can be changed as per architecture requirement and for now, size 2 is used. Added shgemm in kernel/power/KERNEL.POWER9 and tested in powerpc64le and powerpc64. For reference, added a small test compare_sgemm_shgemm.c to compare sgemm and shgemm output. This patch does not cover OpenBLAS test, benchmark and lapack tests for shgemm. Complex type implementation can be discussed and added once this is approved.	2020-04-14 14:55:08 -05:00
Martin Kroeker	5b0093b5fe	Convert aligned moves to unaligned should have no performance impact on reasonably modern cpus and fixes occasional crashes in actual user code.	2020-04-13 14:58:52 +02:00
Martin Kroeker	e9bfa2291a	Fix parameter overflow	2020-04-12 19:47:02 +02:00
gxw	8d07cf9b67	Fix compilation problem on loongson platform Using "make TARGET=GENERIC" on loongson platform will get the following error messages: "make[1]: *** No rule to make target 'sgemm_incopy.o', needed by 'libs'" Add kernel/mips64/KERNEL.generic to slove the problem.	2020-04-09 19:28:15 +08:00
Martin Kroeker	806f89166e	Make ARMV7 compile with xcode and add a CI job for it (#2537 ) * Add an ARMV7 iOS build on Travis * thread_local appears to be unavailable on ARMV7 iOS * Add no-thumb option for ARMV7 IOS build to get it to accept DMB ISH * Make local labels in macros of nrm2_vfpv3.S compatible with the xcode assembler	2020-04-02 10:30:37 +02:00
Martin Kroeker	c6af9bbb32	Merge pull request #2534 from martin-frbg/issue2496 Fix zero initialization for beta=0 case	2020-03-31 20:53:13 +02:00
Martin Kroeker	144be81ca1	fix initialization to zero in the NEON SGEMM_BETA kernel as well	2020-03-31 16:53:56 +02:00
Martin Kroeker	07cdd5d05c	Fix zero initialization for beta=0 case use immediate initialization instead of multiplication in case register content is a NaN	2020-03-31 00:21:02 +02:00
Martin Kroeker	567d2760e6	Merge pull request #2520 from wjc404/develop Fix avx512 sgemm performance bug when ldc is a multiple of 1024	2020-03-30 20:15:59 +02:00
wjc404	b8307768e2	Add files via upload	2020-03-21 05:42:10 +08:00
Martin Kroeker	af8a619e1f	Merge pull request #2517 from wjc404/develop Temporary fix for SKX STRSM	2020-03-17 10:12:53 +01:00
wjc404	62b9608986	Update KERNEL.SKYLAKEX	2020-03-17 12:52:55 +08:00
Martin Kroeker	a1b181cea2	Merge pull request #2516 from wjc404/develop AVX2 STRSM kernels	2020-03-16 21:58:34 +01:00
wjc404	cdc0e9011e	Update KERNEL.ZEN	2020-03-16 16:39:37 +00:00
wjc404	fa049d49c2	AVX2 STRSM kernel	2020-03-17 00:34:08 +08:00
s00548429	bec7923a0d	Fix the functional bugs for zamax.	2020-03-09 15:36:50 +08:00
Rajalakshmi Srinivasaraghavan	2afc074803	Fix DYNAMIC_ARCH build for POWER9 Setting DYNAMIC_ARCH=1 on POWER9 does not build POWER9 files due to some compiler version checks. This patch fixes some of the macros that are used to check compiler version. On fixing those checks, there are some new make failures related to icamin, icamax, isamin, isamax and caxpy files on POWER9. This patch fixes those failures as well.	2020-03-03 12:35:10 -06:00
Martin Kroeker	4f371b0fbf	Use POWER8 kernels on big-endian POWER9 for now	2020-03-01 23:45:58 +01:00
Martin Kroeker	ea8eec5d17	Merge pull request #2422 from wjc404/develop Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix performance bugs in AVX2 s/c/z GEMM	2020-02-29 19:07:35 +01:00
Ali Saidi	c623a965f9	Add Neoverse-N1 core The implementation is a hybird of the ARMV8 one with some of the improved TX2 rountines along with specifying -march=v8.2-a	2020-02-29 03:22:04 +00:00
wjc404	dd22eb7621	Update cgemm_kernel_8x2_haswell.c	2020-02-27 22:26:15 +08:00
wjc404	2352331e60	Update zgemm_kernel_4x2_haswell.c	2020-02-27 22:25:19 +08:00
Xianyi Zhang	265ab484c8	Change default RISC-V 64-bit corename to RISCV64_GENERIC e.g. make CC=riscv64-unknown-linux-gnu-gcc FC=riscv64-unknown-linux-gnu-gfortran TARGET=RISCV64_GENERIC HOSTCC=gcc	2020-02-27 14:46:15 +08:00
Xianyi Zhang	44020a42a4	Fixed compile bug for RV64.	2020-02-27 14:29:42 +08:00
Xianyi Zhang	4aa2d89217	Merge branch 'develop' into risc-v	2020-02-27 13:53:49 +08:00
wjc404	1b980001dd	Update zgemm_kernel_4x2_haswell.c	2020-02-26 18:38:12 +08:00
wjc404	2515e1152f	Update cgemm_kernel_8x2_haswell.c	2020-02-26 18:36:54 +08:00
Martin Kroeker	ddcbed6690	Merge pull request #2437 from martin-frbg/issue2434 [WIP] Add support for Ampere EMAG8180 ARMV8 cpu	2020-02-25 18:42:52 +01:00
wjc404	903854c168	Add files via upload	2020-02-22 23:40:02 +08:00
wjc404	a2ff577a30	Update KERNEL.ZEN	2020-02-22 23:39:43 +08:00
wjc404	97a32cb0a5	Update KERNEL.HASWELL	2020-02-22 23:39:20 +08:00
Martin Kroeker	07454bf4d5	Add proper defaults for IxMIN/IxMAX kernels the fallbacks from Makefile.L1 assume a combined source for absolute value and non-absolute (with ifdef USE_ABS) but here we have separate implementations	2020-02-21 11:58:15 +01:00
Martin Kroeker	4046985913	Add proper defaults for IxMIN/IxMAX kernels the fallbacks from Makefile.L1 assume a combined source for absolute value and non-absolute (with ifdef USE_ABS) but here we have separate implementations	2020-02-21 11:55:52 +01:00
Martin Kroeker	e57b11acca	Add preliminary support for EMAG8180	2020-02-19 19:00:28 +01:00
Martin Kroeker	0b39cf95b0	Fix endianness conditionals	2020-02-19 18:09:54 +01:00
Martin Kroeker	9f39f0a2c3	Specify ismin/ismax assembly kernels for POWER8 directly to fix utest failure in new ismin test - Makefile.L1 defaults look wrong	2020-02-17 19:55:39 +01:00
Martin Liska	aeea14ee40	Come up with LOAD_AND_COMPARE_TO_MXX macro in iamax_sse.S.	2020-02-17 09:01:53 +01:00
Martin Liska	18bcc36a69	Fix implementation of iamax_sse.S as reported in #2116 . The was a typo in iamax_sse.S where one of the comparison was cmpeqps instead of cmpeqss. That misdetected index for sequences where the minimum value was 0.	2020-02-17 09:01:53 +01:00
Martin Liska	0e7f43c898	Add missing USE_MIN in kernel/CMakeLists.txt.	2020-02-17 09:01:53 +01:00
wjc404	f566787e6e	Update KERNEL.SKYLAKEX	2020-02-16 22:58:44 +08:00
wjc404	e3368cbf18	AVX512 STRMM kernel	2020-02-16 22:58:00 +08:00
Martin Kroeker	cafdd999b8	Update caxpy_power8.S	2020-02-13 22:44:09 +01:00
Martin Kroeker	92ca92a46c	Update caxpy_power8.S	2020-02-13 21:24:54 +01:00
Martin Kroeker	486c35c5dc	Update icamin_power8.S	2020-02-13 18:38:43 +01:00
Martin Kroeker	5ba3699f41	Update isamin_power8.S	2020-02-13 00:00:32 +01:00
Martin Kroeker	8eefa530cd	Update isamax_power8.S	2020-02-12 23:59:50 +01:00
Martin Kroeker	de40d47edf	Update isamin_power8.S	2020-02-12 23:57:48 +01:00
Martin Kroeker	7c162b8a21	Update isamax_power8.S	2020-02-12 23:56:57 +01:00
Martin Kroeker	0544cbc806	Fix syntax of endianness conditional	2020-02-12 20:00:29 +01:00
Martin Kroeker	120d20731f	Fix syntax of endianness conditional	2020-02-12 19:58:42 +01:00
Martin Kroeker	dc345d84df	Fix syntax of endianness conditional and add gcc version check for workaround	2020-02-12 19:56:52 +01:00
Bart Oldeman	7ea5e07d1c	Fix inline asm in dscal: mark x, x1 as clobbered. Fixes #2408 The leaq instructions in dscal_kernel_inc_8 modify x and x1 so they must be declared as input/output constraints, otherwise the compiler may assume the corresponding registers are not modified.	2020-02-12 14:11:44 +00:00
Martin Kroeker	7e5cbb6f35	Fix bad conditional syntax that caused spurious application of USE_TRMM	2020-02-10 21:17:39 +01:00
wjc404	3447d04eaf	Update dgemm_kernel_16x2_skylakex.c	2020-02-06 02:14:10 +00:00
wjc404	8b5cdcc64c	Update sgemm_kernel_8x4_haswell.c	2020-02-06 01:47:46 +00:00
wjc404	4e00d96a78	Update dgemm_kernel_16x2_skylakex.c	2020-02-06 01:46:36 +00:00
wjc404	096da2f51a	Update dgemm_kernel_16x2_skylakex.c	2020-02-05 13:36:57 +08:00
wjc404	081b188529	Update KERNEL.SKYLAKEX	2020-02-03 21:38:08 +08:00
wjc404	8019e70211	AVX512 16x2 DGEMM kernel	2020-02-03 21:32:56 +08:00
Qiyu8	ff42e68652	Optimize genenal Gemm Beta	2020-01-20 11:49:42 +08:00
Martin Kroeker	70f45749b9	Merge pull request #2367 from wjc404/develop Improve paralleled SGEMM performance on SKYLAKEX CPUs	2020-01-15 21:13:43 +01:00
wjc404	e5dcdeb550	Update sgemm_direct_skylakex.c	2020-01-13 16:59:23 +08:00
wjc404	952cc2ba38	Update sgemm_kernel_16x4_skylakex_2.c	2020-01-13 16:58:54 +08:00
wjc404	feaafbedd3	make skylakex sgemm code more friendly for readers BTW some kernels were adjusted to improve performance	2020-01-13 16:28:41 +08:00
Martin Kroeker	b36018be6d	Merge pull request #2365 from wjc404/develop Fix SKYLAKEX STRMM issues	2020-01-09 23:23:09 +01:00
wjc404	3a100b2797	Update KERNEL.SKYLAKEX	2020-01-09 13:48:41 +08:00
Martin Kroeker	38742d5547	Merge pull request #2361 from wjc404/develop Optimize AVX2 SGEMM & STRMM	2020-01-08 16:20:28 +01:00
wjc404	bd4c032f52	Update sgemm_kernel_8x4_haswell.c	2020-01-07 11:22:46 +08:00
wjc404	9dc9b7b95e	Update sgemm_kernel_8x4_haswell.c	2020-01-06 20:11:36 +08:00
wjc404	92b10212de	optimize AVX2 SGEMM	2020-01-06 12:11:21 +08:00
wjc404	b73bf01378	optimize AVX2 SGEMM	2020-01-06 12:09:14 +08:00
wjc404	eb3c9f1db9	optimize AVX2 SGEMM	2020-01-06 12:07:02 +08:00
Martin Kroeker	456ee2e1f0	Merge pull request #2357 from chenxuqiang/dgemm_beta_zero kernel/arm64/dgemm_beta.S: add beta == zero branch	2020-01-02 22:28:36 +01:00
shengyang	80db5f11e1	update	2020-01-02 11:01:57 +08:00
chenxuqiang	52de4cc8fd	kernel/arm64/dgemm_beta.S: add beta == zero branch added beta == zero branch, and no need to load C matrix. Signed by: Xuqiang Chen <chenxuqiang3@hisilicon.com>	2020-01-01 21:50:45 -05:00
Martin Kroeker	44028581cc	Merge pull request #2355 from Zeyiii/dev-zeyi2 Use arm neon instructions to optimize sgemm_beta operation	2020-01-01 22:14:16 +01:00
Martin Kroeker	86ab939936	Merge pull request #2354 from ZuoQ3/develop [WIP] Use arm neon instructions to optimize tcopy operation	2020-01-01 22:13:37 +01:00
Martin Kroeker	6c85cb1869	Merge pull request #2352 from wjc404/develop AVX2 ZGEMM3M kernel	2019-12-31 18:08:10 +01:00
Martin Kroeker	995768bbc5	Merge pull request #2351 from Zeyiii/develop prefetching for dgemm_beta	2019-12-31 18:07:37 +01:00
int_13h	96ad579428	add in runtime cpu detection for zarch (#2349 ) add in runtime cpu detection for zarch	2019-12-31 18:03:27 +01:00
shengyang	8d84403205	Use arm neon instructions to optimize ncopy operation modified: KERNEL.ARMV8 modified: KERNEL.TSV110 new file: sgemm_ncopy_4.S	2019-12-31 17:06:35 +08:00
w00421467	0833a4846a	Use arm neon instructions to optimize sgemm_beta operation	2019-12-31 10:42:03 +08:00
zq	50f7fc1401	[WIP] Use arm neon instructions to optimize tcopy operation	2019-12-31 10:21:23 +08:00
w00421467	d1b53806be	Merge remote-tracking branch 'pub/develop' into develop	2019-12-31 10:13:24 +08:00
wjc404	a0f0a802fc	Update zgemm3m_kernel_4x4_haswell.c	2019-12-30 17:33:42 +08:00
wjc404	700fe5b5ee	Add files via upload	2019-12-30 17:18:59 +08:00
wjc404	f60840c420	Update KERNEL.ZEN	2019-12-30 16:04:23 +08:00
wjc404	109e18cd96	Update KERNEL.HASWELL	2019-12-30 16:03:24 +08:00
wjc404	ae1579be13	Create zgemm3m_kernel_4x4_haswell.c	2019-12-30 16:02:51 +08:00
w00421467	3ccf8885ac	prefetching for dgemm_beta	2019-12-30 11:45:49 +08:00
wjc404	cd765f094b	Update cgemm3m_kernel_8x4_haswell.c	2019-12-27 18:23:29 +08:00
wjc404	3a66c8cac1	Update KERNEL.ZEN	2019-12-27 18:04:08 +08:00
wjc404	ed9af2f7da	Update KERNEL.HASWELL	2019-12-27 18:01:38 +08:00
wjc404	5fd1edead9	Create cgemm3m_kernel_8x4_haswell.c	2019-12-27 18:00:55 +08:00
wjc404	eeecd623d8	Update cgemm_kernel_8x2_haswell.c	2019-12-24 00:40:16 +08:00
wjc404	2cd9306bb5	Update KERNEL.ZEN	2019-12-23 23:42:30 +08:00
wjc404	c418c81224	Update KERNEL.HASWELL	2019-12-23 23:41:44 +08:00
wjc404	025741f16a	Fast Haswell CGEMM kernel	2019-12-23 23:40:03 +08:00
wjc404	f41d52665d	Fast Haswell ZGEMM kernel	2019-12-21 14:37:06 +08:00
wjc404	d573d24de7	Fast Haswell ZGEMM kernel	2019-12-21 14:35:15 +08:00
w00421467	b7cc69ee62	declare DGEMM_BETA in KERNEL.ARMV8 rather than the generic KERNEL	2019-12-20 10:11:50 +08:00

... 3 4 5 6 7 ...

1728 Commits