OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Gordon Fossum	213c0e7abb	Added special unrolled vectorized versions of "Solve" for specific sizes, in DTRSM and STRSM, to improve performance in Power9 and Power10.	2020-12-04 17:07:06 -06:00
Martin Kroeker	441c08c9ff	Merge pull request #3016 from xiegengxin/complex-asum Improve the performance of zasum and casum with AVX512 intrinsic	2020-12-04 22:07:16 +01:00
Gengxin Xie	0cb7a403b2	fix error declare function blas_level1_thread_with_return_value	2020-12-02 09:51:52 +08:00
Gengxin Xie	b766c1e9bb	Improve the performance of zasum and casum with AVX512 intrinsic	2020-12-01 16:49:26 +08:00
Rajalakshmi Srinivasaraghavan	7d46e31de1	POWER10: Optimize dgemv_n Handling as 4x8 with vector pairs gives better performance than existing code in POWER10.	2020-11-29 15:28:28 -06:00
Martin Kroeker	f1bf040b25	Merge pull request #2988 from xiegengxin/smp-asum Improve the performance of dasum and sasum when SMP is defined	2020-11-22 12:24:13 +01:00
Xianyi Zhang	7037849498	Merge branch 'develop' into risc-v	2020-11-22 16:04:50 +08:00
Martin Kroeker	7e9cb39a25	Merge pull request #2981 from Qiyu8/fix-sum Fix sum optimize issues	2020-11-16 08:40:46 +01:00
Gengxin Xie	d6e7e05bb3	Improve the performance of dasum and sasum when SMP is defined	2020-11-13 14:20:52 +08:00
Qiyu8	ae0b1dea19	modify system.cmake to enable fma flag	2020-11-13 10:20:24 +08:00
Qiyu8	e0dac6b53b	fix the CI failure of target specific option mismatch	2020-11-12 20:31:03 +08:00
Qiyu8	e5c2ceb675	fix the CI failure of lack the head	2020-11-12 17:35:17 +08:00
Qiyu8	a87e537b8c	modify macro	2020-11-11 15:53:48 +08:00
Qiyu8	5bc0a7583f	only FMA3 and vector larger than 128 have positive effects.	2020-11-11 15:18:01 +08:00
Qiyu8	8c0b206d4c	Optimize the performance of rot by using universal intrinsics	2020-11-11 14:33:12 +08:00
Qiyu8	c4c591ac5a	fix sum optimize issues	2020-11-10 16:16:38 +08:00
Xianyi Zhang	fc35b72ae1	Refs #2899 Merge branch 'openblas-open-910' of git://github.com/damonyu1989/OpenBLAS into damonyu1989-openblas-open-910	2020-11-10 09:38:04 +08:00
Xianyi Zhang	913cc9a4ca	Merge branch 'develop' into risc-v	2020-11-10 09:18:25 +08:00
Martin Kroeker	ff16329cb7	Merge pull request #2972 from xiegengxin/rot-intrinsic Improve the performance of rot by using AVX512 and AVX2 intrinsic	2020-11-08 22:43:00 +01:00
Martin Kroeker	110c7a6de0	Merge pull request #2979 from RajalakshmiSR/dot_power10 Optimize sdot/ddot for POWER10	2020-11-08 10:19:34 +01:00
Rajalakshmi Srinivasaraghavan	6e364981a8	Optimize sdot/ddot for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2020-11-07 15:21:58 -06:00
Martin Kroeker	b976a0bf40	Remove previous workaround for compiler flags related to cpu capabilities in x86_64 DYNAMIC_ARCH builds	2020-11-07 20:39:56 +01:00
Martin Kroeker	ff74319ea5	Merge pull request #2977 from martin-frbg/issue2976 Fix macro name used in ifdef for POWERPC/PGI	2020-11-07 14:41:34 +01:00
Martin Kroeker	28d2dfe2b3	Fix macro name used in ifdef	2020-11-07 12:17:49 +01:00
Gengxin Xie	725ffbf041	fix typo	2020-11-05 16:25:17 +08:00
Gengxin Xie	d9ba49165a	Improve the performance of rot by using AVX512 and AVX2 intrinsic	2020-11-05 15:12:36 +08:00
Rajalakshmi Srinivasaraghavan	dd7a9cc5bf	POWER10: Change dgemm unroll factors Changing the unroll factors for dgemm to 8 shows improved performance with POWER10 MMA feature. Also made some minor changes in sgemm for edge cases.	2020-10-31 18:28:57 -05:00
Rajalakshmi Srinivasaraghavan	b435491885	Optimize caxpy for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2020-10-29 14:57:51 -05:00
Chen, Guobing	a7b1f9b1bb	Implementation of BF16 based gemv 1. Add a new API -- sbgemv to support bfloat16 based gemv 2. Implement a generic kernel for sbgemv 3. Implement an avx512-bf16 based kernel for sbgemv Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2020-10-29 02:08:23 +08:00
Martin Kroeker	67f39ad813	Merge pull request #2939 from thrasibule/Makefile_cleanup reuse variables defined in Makefile.system	2020-10-28 09:38:40 +01:00
Rajalakshmi Srinivasaraghavan	c24ba8b1dd	Optimize saxpy for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	2020-10-26 13:24:59 -05:00
Martin Kroeker	6f9460f0f6	Merge pull request #2937 from martin-frbg/pwr-buffersz Increase and unify BUFFERSIZE on POWER;fix gcc inline warning	2020-10-23 07:15:32 +02:00
Guillaume Horel	1917a4e7b8	reuse variables defined in Makefile.system	2020-10-22 22:04:25 -04:00
Martin Kroeker	34c3c407ef	label always_inline function as inline to silence a gcc warning	2020-10-22 22:14:26 +02:00
Martin Kroeker	2e48d560ba	Fix compiler version check	2020-10-22 16:23:29 +02:00
Rajalakshmi Srinivasaraghavan	ad745c0bae	Optimize scopy/ccopy for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores. Also reorganized all variants of copy functions to make use of same kernel.	2020-10-21 09:53:45 -05:00
İsmail Dönmez	4a1d00f589	Fix build with -Werror=return-type dgemm_tcopy_16_skylakex.c CNAME function should return an int, add a return 0 similar to other files.	2020-10-21 08:43:39 +02:00
Bart Oldeman	b073d759d0	x86_64: clobber all xmm registers after vzeroupper As observed using GCC 10 using -march=native -ftree-vectorize on Knights Landing, it is now smart enough to find clobbers inside non-inlined static functions. In particular, sgemv counted on a kernel to preserve the whole %ymm2 register (since it was not in the clobber list), but the top part was destroyed by vzeroupper. This caused many tests to fail. This patch makes sure all xmm (and ymm/zmm by extension) registers are listed as clobbered to avoid this happening, as most kernels already did correctly in fact.	2020-10-20 02:16:47 +00:00
Martin Kroeker	dc6e44c3f8	Merge pull request #2916 from martin-frbg/issue2911 Clean up duplicate definitions in POWER8 kernels and fix power10 option passing	2020-10-19 23:33:31 +02:00
Martin Kroeker	a61c086408	Fix spurious trailing whitespace in comment	2020-10-19 09:12:12 +02:00
Bart Oldeman	03e781b766	sgemm_direct_skylakex: fix `75eeb26` regression. The `#if defined(SKYLAKEX) \|\| defined (COOPERLAKE)` from that commit was before #include "common.h" so caused the compiled function to be empty, returning garbage results for qualifying sgemm's on those architectures. Closes #2914	2020-10-18 19:58:07 +00:00
Martin Kroeker	f1a4071d8c	Clean up STACKSIZE redefinition	2020-10-18 19:41:43 +02:00
Martin Kroeker	97cf10062f	Clean up STACKSIZE redefinition	2020-10-18 19:39:18 +02:00
Martin Kroeker	17e288e18d	Clean up STACKSIZE redefinition	2020-10-18 19:37:04 +02:00
Martin Kroeker	c1422f3e46	Clean up STACKSIZE redefinition	2020-10-18 19:31:01 +02:00
Martin Kroeker	d85b24e103	Clean up STACKSIZE redefinition	2020-10-18 19:29:45 +02:00
Zhang Xianyi	d7ba7679b6	Merge branch 'develop' into risc-v	2020-10-16 23:27:38 +08:00
Martin Kroeker	df70667043	fix core list for sse/sse2	2020-10-16 09:55:48 +02:00
Martin Kroeker	f071d1207a	add sse2	2020-10-15 22:10:32 +02:00
Martin Kroeker	dc6cefd2f5	Expressly enable -msse for 32bit DYNAMIC_ARCH kernels	2020-10-15 20:16:15 +02:00
Martin Kroeker	c339c40c01	Silence a redefinition warning	2020-10-15 19:08:12 +02:00
Martin Kroeker	10379fc83b	Use ifdef instead of if	2020-10-15 19:05:37 +02:00
Martin Kroeker	4c25910da0	Merge pull request #2896 from martin-frbg/intrin-double Add compiler flag for SSE4 where available	2020-10-15 11:12:35 +02:00
damonyu	ef8e7d0279	Add the support for RISC-V Vector. Change-Id: Iae7800a32f5af3903c330882cdf6f292d885f266	2020-10-15 16:09:02 +08:00
Martin Kroeker	ae6ac83991	Revert "add double precision SSE"	2020-10-15 08:37:02 +02:00
Qiyu8	4fac91ef37	adapt arm platform	2020-10-15 11:08:10 +08:00
Qiyu8	bfdf4b56da	Add double precision universal intrinsics for X86/ARM	2020-10-15 10:29:42 +08:00
Martin Kroeker	ebf0470fc2	add sse4.1 for DYNAMIC_ARCH kernels	2020-10-14 20:34:33 +02:00
Martin Kroeker	c9c3ae07af	Add double precision operations	2020-10-14 18:10:45 +02:00
Martin Kroeker	756802df61	Merge pull request #2890 from martin-frbg/s-d-sum Revert special handling of Windows xNRM2 and enable C+intrinsics kern…	2020-10-14 09:02:03 +02:00
Rajalakshmi Srinivasaraghavan	0826d68f93	POWER10: Change the packing format for bfloat16 As the new MMA instructions need the inputs in 4x2 order for bfloat16, changing the format in copy/packing code. This avoids permute instructions in the gemm kernel inner loop.	2020-10-13 16:05:10 -05:00
Rajalakshmi Srinivasaraghavan	b5d30b390d	Fix build issues with bfloat16 This patch fixes compilation errors due to recent renaming from SH to SB with BUILD_BFLOAT16.	2020-10-13 11:00:22 -05:00
Martin Kroeker	fecedc9c69	Add -mssse3	2020-10-13 11:55:41 +02:00
Martin Kroeker	0eacbca85f	Add Haswell and Zen to temporary sse3 whitelist	2020-10-13 11:42:39 +02:00
Martin Kroeker	6999086a2b	whitelist SANDYBRIDGE for SSE3	2020-10-13 10:32:19 +02:00
Martin Kroeker	8d2df7d066	Revert special handling of Windows xNRM2 and enable C+intrinsics kernel for SSUM/DSUM	2020-10-13 00:14:29 +02:00
Martin Kroeker	08929430cd	Merge pull request #2886 from martin-frbg/issue_2767 Rename "HALF" precision functions (sh prefix) to "BFLOAT16" with "sb" prefix	2020-10-13 00:04:35 +02:00
Martin Kroeker	0c84ffe05f	Merge pull request #2881 from mattip/fninit add fninit to reset fpu registers before assembler routines	2020-10-12 23:50:41 +02:00
Matti Picus	403eb513a0	use emms instead, add WIN guards	2020-10-12 18:15:01 +03:00
Qiyu8	0ed1f07660	Optimize the performance of sum by using universal intrinsics	2020-10-12 19:48:53 +08:00
Martin Kroeker	3aecafad80	Change "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-12 00:00:55 +02:00
Martin Kroeker	756062afa5	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:56:17 +02:00
Martin Kroeker	2061f7fdff	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:54:53 +02:00
Martin Kroeker	dc8a1afa63	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:53:50 +02:00
Martin Kroeker	fd94236042	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:42:07 +02:00
Martin Kroeker	68ce719fac	Rename shdot_microk_cooperlake.c to sbdot_microk_cooperlake.c	2020-10-11 23:41:13 +02:00
Martin Kroeker	d7dd9b396c	Rename shdot.c to sbdot.c	2020-10-11 23:40:43 +02:00
Martin Kroeker	9ae80490e0	rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:39:42 +02:00
Martin Kroeker	d314d1f49f	Rename shgemm_kernel_power10.c to sbgemm_kernel_power10.c	2020-10-11 23:37:38 +02:00
Martin Kroeker	c589c3e2a1	Merge pull request #2882 from martin-frbg/issue2709 Use generic C for (D/Z)NRM2 on Windows x86_64	2020-10-11 22:22:30 +02:00
Martin Kroeker	ec638a82bf	Merge pull request #2852 from martin-frbg/issue2588-cmake Support building only a subset of variable types	2020-10-11 22:21:33 +02:00
Martin Kroeker	6b6adf8a4a	Allow compiling only a subset of kernels for specific variable types	2020-10-11 14:52:09 +02:00
Martin Kroeker	ac653c94f3	Merge branch 'develop' into issue2588-cmake	2020-10-11 13:57:07 +02:00
Martin Kroeker	7a53128481	Add whitelist of DYNAMIC_ARCH kernels for which -msse3 needs to be enabled	2020-10-11 01:06:46 +02:00
Martin Kroeker	e1b7123bbe	Merge pull request #2867 from Qiyu8/usimd-floatdot Optimize the performance of dot by using universal intrinsics in X86/ARM	2020-10-10 12:10:25 +02:00
Qiyu8	f32d34a015	add sse3 compiler flag	2020-10-10 10:36:15 +08:00
Martin Kroeker	7812486091	Use generic C for D/Z nrm2 kernels on Windows to work around fpu exception bug	2020-10-06 21:33:16 +02:00
Matti Picus	a5b164946c	add fninit to reset fpu registers before assembler routines	2020-10-05 22:13:25 +03:00
User User-User	d2333e7842	aarch64 fix std=c18 compilation	2020-10-03 18:00:34 +03:00
Qiyu8	60e6c68e38	Adapt ARM architect	2020-09-29 16:36:14 +08:00
Qiyu8	1b1a757f5f	Optimize the performance of dot by using universal intrinsics in X86/ARM	2020-09-28 20:36:53 +08:00
Rajalakshmi Srinivasaraghavan	2df4235e00	Optimize dcopy/zcopy for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores. Tested in simulator and no new failures.	2020-09-27 21:42:32 -05:00
Martin Kroeker	dfbc62ef7e	Support building only a subset of types	2020-09-22 23:25:59 +02:00
Qiyu8	14f7dad3b7	performance improved	2020-09-22 16:52:15 +08:00
Qiyu8	325b539c26	Optimize the performance of daxpy by using universal intrinsics	2020-09-22 10:38:35 +08:00
Marius Hillenbrand	22aa81f3e5	s390x: fix cscal and zscal implementations The implementation of complex scalar * vector multiplication for Z14 makes some LAPACK tests fail because the numerical differences to the reference implementation exceed the threshold (as can be seen by running make lapack-test and replacing kernel/zarch/cscal.c with a generic implementation for comparison). The complex multiplication uses terms of the form a * b + c * d for both real and imaginary parts. The assembly code (and compiler-emitted code as well) uses fused multiply add operations for the second product and sum. The results can be "surprising", for example when both terms in the imaginary part nearly cancel each other out. In that case, the second product contributes more digits to the sum than the first product that has been rounded before. One option is to use separate multiplications (which then round the same way) and a distinct add. Change the code to pursue that path, by (1) requesting the compiler not to contract the operations into FMAs and (2) replacing the assembly kernel with corresponding vectorized C code (where change 1 also applies). Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-21 13:10:05 +02:00
Marius Hillenbrand	f91057cbad	s390x: move common vector definitions and utils into header ... to facilitate reuse beyond gemm_vec.c and avoid code duplication. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-21 11:32:08 +02:00
Rajalakshmi Srinivasaraghavan	be43d2cb96	Optimize daxpy/zaxpy for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores. Tested in simulator and no new failures.	2020-09-17 12:56:28 -05:00
Martin Kroeker	91c84e1c01	Merge pull request #2796 from Guobing-Chen/BF16_dot_coversion_apis Add bfloat16 based dot and conversion with single/double	2020-09-14 15:00:19 +02:00
Martin Kroeker	e72430fe46	Merge pull request #2803 from xiegengxin/AVX2-asum Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic	2020-09-06 18:32:15 +02:00
Chen, Guobing	deaeb6c5b8	Add bfloat16 based dot and conversion with single/double 1. Added bfloat16 based dot as new API: shdot 2. Implemented generic kernel and cooperlake-specific (AVX512-BF16) kernel for shdot 3. Added 4 conversion APIs for bfloat16 data type <=> single/double: shstobf16 shdtobf16 sbf16tos dbf16tod shstobf16 -- convert single float array to bfloat16 array shdtobf16 -- convert double float array to bfloat16 array sbf16tos -- convert bfloat16 array to single float array dbf16tod -- convert bfloat16 array to double float array 4. Implemented generic kernels for all 4 conversion APIs, and cooperlake-specific kernel for shstobf16 and shdtobf16 5. Update level1 thread facilitate functions and macros to support multi-threading for these new APIs 6. Fix Cooperlake platform detection/specify issue when under dynamic-arch building 7. Change the typedef of bfloat16 from unsigned short to more strict uint16_t Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2020-09-04 02:31:25 +08:00
Martin Kroeker	775a87242d	Rename KERNEL.SILICON to KERNEL.VORTEX	2020-09-03 08:44:20 +02:00
Gengxin Xie	1b0f17eeed	align to 64, using SSE when input size is small	2020-09-03 14:25:54 +08:00
Martin Kroeker	80794fe8fd	Create KERNEL.SILICON	2020-09-02 22:56:58 +02:00
Marius Hillenbrand	2ee5b899ce	s390x: enable S/DGEMM block with explicit loop unrolling + interleaving with clang The code for SGEMM 16x4 and DGEMM 8x4 blocks on z14 and z15 uses explicit unrolling and interleaving to improve performance. The code employs an empty inline asm statement with operands that constrain the compiler's instruction scheduling and thereby enforce proper overlapping of load and compute phases. Fix an ifdef to apply that for clang builds, as well. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:31 +02:00
Marius Hillenbrand	87e5bbd887	s390x: avoid variable-length arrays in struct for asm operands ... since it is not required and clang does not support that gcc extension. Instead, use a variable-length array directly for these operands. Note that, while the actual inline assembly code does not directly use these memory operands, they serve to inform the compiler that it cannot reorder reads or writes to/from the input and output data across the inline asm statements. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:31 +02:00
Marius Hillenbrand	b9b3265ec8	s390x: avoid inline assembly for vector loads for clang ... since clang does not support the instruction format for inline assembly and also it is not required for current versions of clang. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:30 +02:00
Marius Hillenbrand	a1616a0b86	s390x: replace nop with "nop 0" in inline assembly ... as a bandaid for building with clang until LLVM's internal assembler supports nops without operand. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:30 +02:00
Marius Hillenbrand	60ef193258	s390x: use "lghi" for immediate values to fix build with clang Some of the kernels written in assembly utilize a "load address" instruction for loading an immediate value into a register. That is both unnecessarily complex and LLVM's assembler does not understand that specific syntax. Thus, replace with the appropriate "load immediate" instruction, which is also clearer to read. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:30 +02:00
Gengxin Xie	448152cdd8	define __AVX2__ to ensure the haswell code compiled with avx2	2020-08-31 14:39:08 +08:00
Gengxin Xie	cb3c190a3a	Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic	2020-08-31 11:44:08 +08:00
Rajalakshmi Srinivasaraghavan	317ff27cda	POWER10: Avoid setting accumulators to zero in gemm kernels For the first iteration, it is better to use xvfger instead of xvfgerpp builtins which helps to avoid setting accumulators to zero. This helps to reduce few instructions.	2020-08-28 10:42:54 -05:00
Martin Kroeker	b2053239fc	Fix mssing dummy parameter (imag part of alpha) of zdot_thread_function	2020-08-23 15:08:16 +02:00
Martin Kroeker	9ee21a0a39	Merge pull request #2780 from Guobing-Chen/CPL_build_support Enable COOPERLAKE build target	2020-08-20 19:54:29 +02:00
Martin Kroeker	6f4dc7445d	Fix typo	2020-08-19 16:36:55 +02:00
Martin Kroeker	81fbe8d088	-march=cooperlake only available in gcc >= 10	2020-08-19 16:10:15 +02:00
Martin Kroeker	75eeb265d7	[WIP] Refactor the driver code for direct SGEMM (#2782 ) Move "direct SGEMM" functionality out of the SkylakeX SGEMM kernel and make it available (on x86_64 targets only for now) in DYNAMIC_ARCH builds * Add sgemm_direct targets in the kernel Makefile.L3 and CMakeLists.txt * Add direct_sgemm functions to the gotoblas struct in common_param.h * Move sgemm_direct_performant helper to separate file * Update gemm.c to macros for sgemm_direct to support dynamic_arch naming via common_s,h * (Conditionally) add sgemm_direct functions in setparam-ref.c	2020-08-19 14:51:09 +02:00
Chen, Guobing	e740c4873d	Enable COOPERLAKE build target Enable new build target platform -- COOPERLAKE. This target platform supports all the SKYLAKEX supported ISAs + avx512bf16. So all the SKYLAKEX specific kernels/drivers and related code are now extended to be also active on COOPERLAKE. Besides, new BF16 related kernels are active under this target.	2020-08-13 06:18:00 +08:00
Martin Kroeker	cbbe38bb88	Merge pull request #2772 from mhillenibm/s390x_gemm_tuning s390x: GEMM tuning for z14	2020-08-11 18:14:09 +02:00
Marius Hillenbrand	07c334e7be	s390x: Factor out small block sizes for SGEMM/DGEMM on z14 For small register blockings that are too small to fill up vector registers with column vectors, we currently use a generic code block. Replace that with instantiations of the generic code as individual functions, so that the compiler can optimize each one separately. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-08-11 12:56:39 +02:00
Marius Hillenbrand	e2828e30aa	s390x: Optimize SGEMM/DGEMM blocks for z14 with explicit loop unrolling/interleaving Improve performance of SGEMM and DGEMM on z14 and z15 by unrolling and interleaving the inner loop of the SGEMM 16x4 and DGEMM 8x4 blocks. Specifically, we explicitly interleave vector register loads and computation of two iterations. Note that this change only adds one C function, since SGEMM 16x4 and DGEMM 8x4 actually map to the same C code: they both hold intermediate results in a 4x4 grid of vector registers, and the C implementation is built around that. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-08-11 12:55:42 +02:00
Rajalakshmi Srinivasaraghavan	475b5c95b9	Remove extra symbol in Makefile While trying out different unroll values, noted that make failed due to this extra symbol.	2020-08-07 15:27:44 -05:00
Martin Kroeker	81dcfdcf39	Multiply by 2 instead of left-shifting a potentially negative number fixes GCC ubsan warning in the BLAS tests	2020-08-02 18:29:56 +02:00
Martin Kroeker	0ef4b3f1f2	Multiply instead of doing a left shift of a potentially negative number fixes GCC ubsan report in the BLAS tests	2020-08-02 18:27:40 +02:00
Martin Kroeker	aa53a8a5cb	Multiply by two instead of left-shifting one place fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests	2020-08-02 18:25:09 +02:00
Martin Kroeker	aa3a1e7d8c	Multiply by two rather than left shift by one place fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests	2020-08-02 18:22:31 +02:00
Rajalakshmi Srinivasaraghavan	f77b6a83f4	dgemv optimization for POWER10 Making use of new vector pair POWER10 instructions in dgemv_n and dgemv_t. Also adding a new block 4x128 to make use of Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1. Tested on simulator and there are no new test failures.	2020-07-29 18:59:32 -05:00
Rajalakshmi Srinivasaraghavan	d557584b71	Fix compilation issues with clang on POWER As gcc defaults to -malign-power, removing that option. Also adding -fno-integrated-as to use GNU assembler for powerpc assembly optimization files. Fixed other compilation errors reported in dgemv_t.c file.	2020-07-27 14:11:07 -05:00
Ashwin Sekhar T K	4e1be0e481	ARM64: Add THUNDERX3T110 Target	2020-07-26 23:32:24 -07:00
Rajalakshmi Srinivasaraghavan	9be2688c78	Fix to store results in correct order for POWER10 GEMM kernels There is a recent compiler change in __builtin_mma_disassemble_acc() which affects the order of storing result in POWER10. Also removing new LDFLAG -mno-power10-stub as it is handled by linker automatically.	2020-07-24 23:08:11 -05:00
Martin Kroeker	6a2a60038c	Merge pull request #2720 from martin-frbg/issue2694 WIP Further fixes for 32bit POWER8	2020-07-24 23:19:45 +02:00
Martin Kroeker	251a09ec90	Typo fix	2020-07-24 16:04:58 +00:00
Martin Kroeker	95d37e1575	Regroup the 32 and 64bit sections and restore 64bit CAXPY	2020-07-24 10:13:46 +00:00
Martin Kroeker	3523bb778e	Merge pull request #2721 from martin-frbg/p8align Fix alignment errors in the power8 saxpy kernel	2020-07-24 11:06:20 +02:00
Martin Kroeker	bf1f0734ff	Use OPENBLAS_MAKE_COMPLEX_FLOAT on PPC only	2020-07-23 20:40:13 +00:00
Martin Kroeker	ca3561cab9	Add ifdefs around call to altivec microkernel	2020-07-23 18:30:42 +00:00
Martin Kroeker	21072e502a	Typo fix	2020-07-23 17:34:56 +00:00
Martin Kroeker	7c6e56b5df	Rewrite assignment to complex for better portability	2020-07-23 17:10:59 +02:00
Martin Kroeker	661c6bfa5a	Exclude altivec code paths if the compiler does not support them	2020-07-23 17:08:20 +02:00
Martin Kroeker	0033f8be0d	Use vec_vsx_ld/st to fix misaligned accesses flagged by asan	2020-07-16 23:32:54 +02:00
Martin Kroeker	f308e741b2	remove debug output and revert changes to cdot and crot	2020-07-15 10:00:07 +02:00
Martin Kroeker	da17abec87	fix trailing whitespace	2020-07-14 18:20:03 +02:00
Martin Kroeker	f8c2697701	Use POWER6 GEMM, TRMM and DTRSM on 32bit POWER8	2020-07-14 18:11:19 +02:00
Martin Kroeker	b144423f0f	Do not define USE_TRMM for 32bit POWER8	2020-07-14 18:10:12 +02:00
Martin Kroeker	ed7e155c35	Merge branch 'develop' into aix	2020-07-07 18:52:06 +02:00
EGuesnet	634e1305f9	Update cgemm_kernel_8x4_power8.S	2020-06-30 15:16:39 +02:00
Martin Kroeker	28d69e0097	Merge pull request #2687 from martin-frbg/utfbom Strip UTF8 byte order marker from source files	2020-06-26 22:53:09 +02:00
Martin Kroeker	c2467c9619	Merge pull request #2686 from RajalakshmiSR/p10_shgemm powerpc: Optimized SHGEMM kernel for POWER10	2020-06-26 22:52:45 +02:00
Martin Kroeker	d199c2787d	Merge pull request #2680 from kavanabhat/aix_makefile_fix Fix for #2671	2020-06-26 11:27:28 +02:00
Martin Kroeker	e30ad0e521	Strip UTF8 byte order marker from source	2020-06-26 09:00:43 +02:00
Rajalakshmi Srinivasaraghavan	d23419accc	powerpc: Optimized SHGEMM kernel for POWER10 This patch introduces new optimized version of SHGEMM kernel using power10 Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1. This patch makes use of new POWER10 compute instructions for matrix multiplication operation. Tested on simulator and there are no new test failures.	2020-06-25 22:19:08 -05:00
Martin Kroeker	c854ef5471	Fix variable names in conditional	2020-06-25 13:29:52 +02:00
Martin Kroeker	c0afc11742	Fix POWERPC builds on AIX (gcc/gfortran 7) 1. macro preprocessing for POWER8 and later kernels only 2. default buffer size used by AIX version of m4 is too small	2020-06-25 13:12:36 +02:00
Gordon Fossum	bb2f52844b	powerpc: Optimized ZGEMM kernel for POWER10 This patch introduces new optimized version of ZGEMM kernel using power10 Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1. This patch makes use of new POWER10 compute instructions for matrix multiplication operation. Tested on simulator and there are no new test failures. Cycles count reduced by 30-50% compared to POWER9 version depending on M/N/K sizes.	2020-06-24 14:50:12 -05:00
Rajalakshmi Srinivasaraghavan	571eadb880	powerpc: Optimized SGEMM/DGEMM/CGEMM for POWER10 This patch introduces new optimized version of SGEMM, CGEMM and DGEMM using power10 Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1. This patch makes use of new POWER10 compute instructions for matrix multiplication operation. Tested on simulator and there are no new test failures. Cycles count reduced by 30-50% compared to POWER9 version depending on M/N/K sizes. MMA GCC patch for reference: https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=8ee2640bfdc62f835ec9740278f948034bc7d9f1	2020-06-24 14:48:15 -05:00
Kavana Bhat	df4ade070f	Fix for #2671	2020-06-24 04:25:47 -05:00
Martin Kroeker	93592d1260	Merge pull request #2675 from wjc404/develop AVX512 DGEMM TCOPY_16 Function	2020-06-23 09:29:02 +02:00
wjc404	086d87a302	AVX512 dgemm tcopy_16 function	2020-06-20 00:07:43 +08:00
Rajalakshmi Srinivasaraghavan	9fe930f205	powerpc: Add support for future processor This is the initial patch to support build infrastructure for POWER10 architecture.	2020-06-11 15:47:20 -05:00
ZhangDanfeng	bc6fd20a40	fix INIT8x4 Signed-off-by: ZhangDanfeng <467688405@qq.com>	2020-06-10 01:01:16 +08:00
Martin Kroeker	89091e6b64	Merge pull request #2645 from martin-frbg/misc_fixes Miscellaneous fixes	2020-06-07 19:44:50 +02:00
Martin Kroeker	c3574ffe53	Merge pull request #2646 from wjc404/develop Optimize AVX512 parallel DGEMM performance	2020-06-07 13:18:22 +02:00
wjc404	0e3ac4a06b	Add files via upload	2020-06-06 14:56:57 +08:00
Martin Kroeker	7f60fb6b91	Delete spurious copy of common_param.h	2020-06-05 10:04:16 +02:00
ZhangDanfeng	9b7877ccf1	sgemm copy source init Signed-off-by: ZhangDanfeng <467688405@qq.com>	2020-06-04 02:10:45 +08:00
ZhangDanfeng	f82fa802d1	Insert prefetch Signed-off-by: ZhangDanfeng <467688405@qq.com>	2020-06-04 02:08:48 +08:00
Martin Kroeker	b1ee81228a	Change complex DOT and ROT to generic kernels and switch CGEMM in response to test failures seen in #2628 and BLAS-Tester	2020-06-03 09:13:29 +02:00
张丹枫	9df79ae9a3	update sgemm and strmm kernel selecting strategy	2020-05-20 22:26:58 +08:00
张丹枫	a1fc6041cd	use general register to speedup	2020-05-20 22:26:58 +08:00
张丹枫	edb423d772	align general register using to strmm_kernel_8x8	2020-05-20 22:26:58 +08:00
zhangdanfeng	0e6eb8c247	sgemm kernel use sgemm_kernel_8x8_cortexa53 Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>	2020-05-20 22:26:58 +08:00
zhangdanfeng	d475db29c6	optimized for cortex-a53 Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>	2020-05-20 22:26:58 +08:00
Marius Hillenbrand	89fe17f20e	s390x: Use new sgemm kernel also for DGEMM and DTRMM on Z14 Apply our new GEMM kernel implementation, written in C with vector intrinsics, also for DGEMM and DTRMM on Z14 and newer (i.e., architectures with FP32 SIMD instructions). As a result, we gain around 10% in performance on z15, in addition to improving maintainability. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-20 10:23:35 +02:00
Marius Hillenbrand	bdd795ed03	s390x/GEMM: replace 0-init with peeled first iteration ... since it gains another ~2% of SGEMM and DGEMM performance on z15; also, the code just called for that cleanup. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-20 10:23:35 +02:00
Marius Hillenbrand	2840432e49	s390x: improvise vector alignment hints for older compilers Introduce inline assembly so that we can employ vector loads with alignment hints on older compilers (pre gcc-9), since these are still used in distributions such as RHEL 8 and Ubuntu 18.04 LTS. Informing the hardware about alignment can speed up vector loads. For that purpose, we can encode hints about 8-byte or 16-byte alignment of the memory operand into the opcodes. gcc-9 and newer automatically emit such hints, where applicable. Add a bit of inline assembly that achieves the same for older compilers. Since an older binutils may not know about the additional operand for the hints, we explicitly encode the opcode in hex. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-14 15:36:03 +02:00
Marius Hillenbrand	1b0b4349a1	s390x/Z14: Change register blocking for SGEMM to 16x4 Change register blocking for SGEMM (and STRMM) on z14 from 8x4 to 16x4 by adjusting SGEMM_DEFAULT_UNROLL_M and choosing the appropriate copy implementations. Actually make KERNEL.Z14 more flexible, so that the change in param.h suffices. As a result, performance for SGEMM improves by around 30% on z15. On z14, FP SIMD instructions can operate on float-sized scalars in vector registers, while z13 could do that for double-sized scalars only. Thus, we can double the amount of elements of C that are held in registers in an SGEMM kernel. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
Marius Hillenbrand	71b6eaf459	s390x: Use new sgemm kernel also for strmm on Z14 and newer Employ the newly added GEMM kernel also for STRMM on Z14. The implementation in C with vector intrinsics exploits FP32 SIMD operations and thereby gains performance over the existing assembly code. Extend the implementation for handling triangular matrix multiplication, accordingly. As added benefit, the more flexible C code enables us to adjust register blocking in the subsequent commit. Tested via make -C test / ctest / utest and by a couple of additional unit tests that exercise blocking. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
Marius Hillenbrand	43c0d4f312	s390x: Add vectorized sgemm kernel for Z14 and newer Add a new GEMM kernel implementation to exploit the FP32 SIMD operations introduced with z14 and employ it for SGEMM on z14 and newer architectures. The SIMD extensions introduced with z13 support operations on double-sized scalars in vector registers. Thus, the existing SGEMM code would extend floats to doubles before operating on them. z14 extended SIMD support to operations on 32-bit floats. By employing these instructions, we can operate on twice the number of scalars per instruction (four floats in each vector registers) and avoid the conversion operations. The code is written in C with explicit vectorization. In experiments, this kernel improves performance on z14 and z15 by around 2x over the current implementation in assembly. The flexibilty of the C code paves the way for adjustments in subsequent commits. Tested via make -C test / ctest / utest and by a couple of additional unit tests that exercise blocking (e.g., partial register blocks with fewer than UNROLL_M rows and/or fewer than UNROLL_N columns). Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
Martin Kroeker	2271c3506b	Work around excessive LAPACK test failures on Skylake-X Something in the plain C parts of x86_64 cscal.c and zscal.c appears to be miscompiled by both gfortran9 and ifort when compiling for skylakex-avx512, even when the optimized Haswell microkernel is not in use.	2020-05-09 23:49:18 +02:00
Rajalakshmi Srinivasaraghavan	bd9ff820bc	Fix cmake compilation issue - POWER9 This patch removes extra space in the sgemmotcopy filename thereby allowing it to create entry in kernel/Makefile created by cmake.	2020-05-08 20:31:56 -05:00
Ashwin Sekhar T K	8353cb245a	ARM64: Improve DAXPY for ThunderX2 Improve performance of DAXPY for ThunderX2 when the vector fits in L1 Cache.	2020-05-07 09:22:50 -07:00
Martin Kroeker	90dba9f716	Duplicate earlier Clang 9.0.0 workaround for corresponding Apple Clang version As discussed on the original PR #2329, the "Apple Clang 11.0.3" that appears to be based the same LLVM release produces the same miscompilation of this file.	2020-05-05 10:44:50 +02:00
Martin Kroeker	5dd14e3d48	Make building the bfloat16 functions conditional on option BUILD_HALF (#2590 ) * make building the bfloat16 BLAS functions conditional on BUILD_HALF * pass the BUILD_HALF option to gensymbol * Pass BUILD_HALF as a compiler define for dynamic_arch builds	2020-05-01 09:58:30 +02:00
Martin Kroeker	06208c8d01	Limit this fix to ELFv2 builds	2020-04-22 14:16:40 +02:00
Martin Kroeker	f5c4c28b98	Work around POWER8BE bugs on FreeBSD (ELFv2) for #2299	2020-04-21 17:17:17 +02:00
Martin Kroeker	fa42588e1f	Merge pull request #2565 from martin-frbg/mips24k Support MIPS32 24K family as P5600	2020-04-20 17:13:53 +02:00
Martin Kroeker	e55ec82bb9	Delete KERNEL.1004K	2020-04-19 15:44:30 +02:00
Martin Kroeker	7353ea5afc	Delete KERNEL.24K	2020-04-19 15:44:19 +02:00
Martin Kroeker	6a04efb122	Rename KERNEL files to include MIPS prefix	2020-04-19 15:43:54 +02:00
Martin Kroeker	d712ea724c	Add MIPS24K support	2020-04-18 21:10:18 +02:00
Rajalakshmi Srinivasaraghavan	22bb50fb81	cmake fixes	2020-04-17 13:35:17 -05:00
Rajalakshmi Srinivasaraghavan	67cc4b9e16	Fix warnings in clang and export symbol	2020-04-15 19:15:23 -05:00
Rajalakshmi Srinivasaraghavan	a87793e03c	Fix DYNAMIC_ARCH compilation errors	2020-04-15 09:09:50 -05:00
Rajalakshmi Srinivasaraghavan	ff010f496e	Build shgemm for all architecture	2020-04-14 20:38:53 -05:00
Rajalakshmi Srinivasaraghavan	7eb55504b1	RFC : Add half precision gemm for bfloat16 in OpenBLAS This patch adds support for bfloat16 data type matrix multiplication kernel. For architectures that don't support bfloat16, it is defined as unsigned short (2 bytes). Default unroll sizes can be changed as per architecture as done for SGEMM and for now 8 and 4 are used for M and N. Size of ncopy/tcopy can be changed as per architecture requirement and for now, size 2 is used. Added shgemm in kernel/power/KERNEL.POWER9 and tested in powerpc64le and powerpc64. For reference, added a small test compare_sgemm_shgemm.c to compare sgemm and shgemm output. This patch does not cover OpenBLAS test, benchmark and lapack tests for shgemm. Complex type implementation can be discussed and added once this is approved.	2020-04-14 14:55:08 -05:00
Martin Kroeker	5b0093b5fe	Convert aligned moves to unaligned should have no performance impact on reasonably modern cpus and fixes occasional crashes in actual user code.	2020-04-13 14:58:52 +02:00
Martin Kroeker	e9bfa2291a	Fix parameter overflow	2020-04-12 19:47:02 +02:00
gxw	8d07cf9b67	Fix compilation problem on loongson platform Using "make TARGET=GENERIC" on loongson platform will get the following error messages: "make[1]: *** No rule to make target 'sgemm_incopy.o', needed by 'libs'" Add kernel/mips64/KERNEL.generic to slove the problem.	2020-04-09 19:28:15 +08:00
Martin Kroeker	806f89166e	Make ARMV7 compile with xcode and add a CI job for it (#2537 ) * Add an ARMV7 iOS build on Travis * thread_local appears to be unavailable on ARMV7 iOS * Add no-thumb option for ARMV7 IOS build to get it to accept DMB ISH * Make local labels in macros of nrm2_vfpv3.S compatible with the xcode assembler	2020-04-02 10:30:37 +02:00
Martin Kroeker	c6af9bbb32	Merge pull request #2534 from martin-frbg/issue2496 Fix zero initialization for beta=0 case	2020-03-31 20:53:13 +02:00
Martin Kroeker	144be81ca1	fix initialization to zero in the NEON SGEMM_BETA kernel as well	2020-03-31 16:53:56 +02:00
Martin Kroeker	07cdd5d05c	Fix zero initialization for beta=0 case use immediate initialization instead of multiplication in case register content is a NaN	2020-03-31 00:21:02 +02:00
Martin Kroeker	567d2760e6	Merge pull request #2520 from wjc404/develop Fix avx512 sgemm performance bug when ldc is a multiple of 1024	2020-03-30 20:15:59 +02:00
wjc404	b8307768e2	Add files via upload	2020-03-21 05:42:10 +08:00
Martin Kroeker	af8a619e1f	Merge pull request #2517 from wjc404/develop Temporary fix for SKX STRSM	2020-03-17 10:12:53 +01:00
wjc404	62b9608986	Update KERNEL.SKYLAKEX	2020-03-17 12:52:55 +08:00
Martin Kroeker	a1b181cea2	Merge pull request #2516 from wjc404/develop AVX2 STRSM kernels	2020-03-16 21:58:34 +01:00
wjc404	cdc0e9011e	Update KERNEL.ZEN	2020-03-16 16:39:37 +00:00
wjc404	fa049d49c2	AVX2 STRSM kernel	2020-03-17 00:34:08 +08:00
s00548429	bec7923a0d	Fix the functional bugs for zamax.	2020-03-09 15:36:50 +08:00
Rajalakshmi Srinivasaraghavan	2afc074803	Fix DYNAMIC_ARCH build for POWER9 Setting DYNAMIC_ARCH=1 on POWER9 does not build POWER9 files due to some compiler version checks. This patch fixes some of the macros that are used to check compiler version. On fixing those checks, there are some new make failures related to icamin, icamax, isamin, isamax and caxpy files on POWER9. This patch fixes those failures as well.	2020-03-03 12:35:10 -06:00
Martin Kroeker	4f371b0fbf	Use POWER8 kernels on big-endian POWER9 for now	2020-03-01 23:45:58 +01:00
Martin Kroeker	ea8eec5d17	Merge pull request #2422 from wjc404/develop Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix performance bugs in AVX2 s/c/z GEMM	2020-02-29 19:07:35 +01:00
Ali Saidi	c623a965f9	Add Neoverse-N1 core The implementation is a hybird of the ARMV8 one with some of the improved TX2 rountines along with specifying -march=v8.2-a	2020-02-29 03:22:04 +00:00
wjc404	dd22eb7621	Update cgemm_kernel_8x2_haswell.c	2020-02-27 22:26:15 +08:00
wjc404	2352331e60	Update zgemm_kernel_4x2_haswell.c	2020-02-27 22:25:19 +08:00
Xianyi Zhang	265ab484c8	Change default RISC-V 64-bit corename to RISCV64_GENERIC e.g. make CC=riscv64-unknown-linux-gnu-gcc FC=riscv64-unknown-linux-gnu-gfortran TARGET=RISCV64_GENERIC HOSTCC=gcc	2020-02-27 14:46:15 +08:00
Xianyi Zhang	44020a42a4	Fixed compile bug for RV64.	2020-02-27 14:29:42 +08:00
Xianyi Zhang	4aa2d89217	Merge branch 'develop' into risc-v	2020-02-27 13:53:49 +08:00
wjc404	1b980001dd	Update zgemm_kernel_4x2_haswell.c	2020-02-26 18:38:12 +08:00
wjc404	2515e1152f	Update cgemm_kernel_8x2_haswell.c	2020-02-26 18:36:54 +08:00
Martin Kroeker	ddcbed6690	Merge pull request #2437 from martin-frbg/issue2434 [WIP] Add support for Ampere EMAG8180 ARMV8 cpu	2020-02-25 18:42:52 +01:00
wjc404	903854c168	Add files via upload	2020-02-22 23:40:02 +08:00
wjc404	a2ff577a30	Update KERNEL.ZEN	2020-02-22 23:39:43 +08:00
wjc404	97a32cb0a5	Update KERNEL.HASWELL	2020-02-22 23:39:20 +08:00
Martin Kroeker	07454bf4d5	Add proper defaults for IxMIN/IxMAX kernels the fallbacks from Makefile.L1 assume a combined source for absolute value and non-absolute (with ifdef USE_ABS) but here we have separate implementations	2020-02-21 11:58:15 +01:00
Martin Kroeker	4046985913	Add proper defaults for IxMIN/IxMAX kernels the fallbacks from Makefile.L1 assume a combined source for absolute value and non-absolute (with ifdef USE_ABS) but here we have separate implementations	2020-02-21 11:55:52 +01:00
Martin Kroeker	e57b11acca	Add preliminary support for EMAG8180	2020-02-19 19:00:28 +01:00
Martin Kroeker	0b39cf95b0	Fix endianness conditionals	2020-02-19 18:09:54 +01:00
Martin Kroeker	9f39f0a2c3	Specify ismin/ismax assembly kernels for POWER8 directly to fix utest failure in new ismin test - Makefile.L1 defaults look wrong	2020-02-17 19:55:39 +01:00
Martin Liska	aeea14ee40	Come up with LOAD_AND_COMPARE_TO_MXX macro in iamax_sse.S.	2020-02-17 09:01:53 +01:00
Martin Liska	18bcc36a69	Fix implementation of iamax_sse.S as reported in #2116 . The was a typo in iamax_sse.S where one of the comparison was cmpeqps instead of cmpeqss. That misdetected index for sequences where the minimum value was 0.	2020-02-17 09:01:53 +01:00
Martin Liska	0e7f43c898	Add missing USE_MIN in kernel/CMakeLists.txt.	2020-02-17 09:01:53 +01:00
wjc404	f566787e6e	Update KERNEL.SKYLAKEX	2020-02-16 22:58:44 +08:00
wjc404	e3368cbf18	AVX512 STRMM kernel	2020-02-16 22:58:00 +08:00
Martin Kroeker	cafdd999b8	Update caxpy_power8.S	2020-02-13 22:44:09 +01:00
Martin Kroeker	92ca92a46c	Update caxpy_power8.S	2020-02-13 21:24:54 +01:00
Martin Kroeker	486c35c5dc	Update icamin_power8.S	2020-02-13 18:38:43 +01:00
Martin Kroeker	5ba3699f41	Update isamin_power8.S	2020-02-13 00:00:32 +01:00
Martin Kroeker	8eefa530cd	Update isamax_power8.S	2020-02-12 23:59:50 +01:00
Martin Kroeker	de40d47edf	Update isamin_power8.S	2020-02-12 23:57:48 +01:00
Martin Kroeker	7c162b8a21	Update isamax_power8.S	2020-02-12 23:56:57 +01:00
Martin Kroeker	0544cbc806	Fix syntax of endianness conditional	2020-02-12 20:00:29 +01:00
Martin Kroeker	120d20731f	Fix syntax of endianness conditional	2020-02-12 19:58:42 +01:00
Martin Kroeker	dc345d84df	Fix syntax of endianness conditional and add gcc version check for workaround	2020-02-12 19:56:52 +01:00
Bart Oldeman	7ea5e07d1c	Fix inline asm in dscal: mark x, x1 as clobbered. Fixes #2408 The leaq instructions in dscal_kernel_inc_8 modify x and x1 so they must be declared as input/output constraints, otherwise the compiler may assume the corresponding registers are not modified.	2020-02-12 14:11:44 +00:00
Martin Kroeker	7e5cbb6f35	Fix bad conditional syntax that caused spurious application of USE_TRMM	2020-02-10 21:17:39 +01:00
wjc404	3447d04eaf	Update dgemm_kernel_16x2_skylakex.c	2020-02-06 02:14:10 +00:00
wjc404	8b5cdcc64c	Update sgemm_kernel_8x4_haswell.c	2020-02-06 01:47:46 +00:00
wjc404	4e00d96a78	Update dgemm_kernel_16x2_skylakex.c	2020-02-06 01:46:36 +00:00
wjc404	096da2f51a	Update dgemm_kernel_16x2_skylakex.c	2020-02-05 13:36:57 +08:00
wjc404	081b188529	Update KERNEL.SKYLAKEX	2020-02-03 21:38:08 +08:00
wjc404	8019e70211	AVX512 16x2 DGEMM kernel	2020-02-03 21:32:56 +08:00
Qiyu8	ff42e68652	Optimize genenal Gemm Beta	2020-01-20 11:49:42 +08:00
Martin Kroeker	70f45749b9	Merge pull request #2367 from wjc404/develop Improve paralleled SGEMM performance on SKYLAKEX CPUs	2020-01-15 21:13:43 +01:00
wjc404	e5dcdeb550	Update sgemm_direct_skylakex.c	2020-01-13 16:59:23 +08:00
wjc404	952cc2ba38	Update sgemm_kernel_16x4_skylakex_2.c	2020-01-13 16:58:54 +08:00
wjc404	feaafbedd3	make skylakex sgemm code more friendly for readers BTW some kernels were adjusted to improve performance	2020-01-13 16:28:41 +08:00
Martin Kroeker	b36018be6d	Merge pull request #2365 from wjc404/develop Fix SKYLAKEX STRMM issues	2020-01-09 23:23:09 +01:00
wjc404	3a100b2797	Update KERNEL.SKYLAKEX	2020-01-09 13:48:41 +08:00
Martin Kroeker	38742d5547	Merge pull request #2361 from wjc404/develop Optimize AVX2 SGEMM & STRMM	2020-01-08 16:20:28 +01:00
wjc404	bd4c032f52	Update sgemm_kernel_8x4_haswell.c	2020-01-07 11:22:46 +08:00
wjc404	9dc9b7b95e	Update sgemm_kernel_8x4_haswell.c	2020-01-06 20:11:36 +08:00
wjc404	92b10212de	optimize AVX2 SGEMM	2020-01-06 12:11:21 +08:00
wjc404	b73bf01378	optimize AVX2 SGEMM	2020-01-06 12:09:14 +08:00
wjc404	eb3c9f1db9	optimize AVX2 SGEMM	2020-01-06 12:07:02 +08:00
Martin Kroeker	456ee2e1f0	Merge pull request #2357 from chenxuqiang/dgemm_beta_zero kernel/arm64/dgemm_beta.S: add beta == zero branch	2020-01-02 22:28:36 +01:00
shengyang	80db5f11e1	update	2020-01-02 11:01:57 +08:00
chenxuqiang	52de4cc8fd	kernel/arm64/dgemm_beta.S: add beta == zero branch added beta == zero branch, and no need to load C matrix. Signed by: Xuqiang Chen <chenxuqiang3@hisilicon.com>	2020-01-01 21:50:45 -05:00
Martin Kroeker	44028581cc	Merge pull request #2355 from Zeyiii/dev-zeyi2 Use arm neon instructions to optimize sgemm_beta operation	2020-01-01 22:14:16 +01:00
Martin Kroeker	86ab939936	Merge pull request #2354 from ZuoQ3/develop [WIP] Use arm neon instructions to optimize tcopy operation	2020-01-01 22:13:37 +01:00
Martin Kroeker	6c85cb1869	Merge pull request #2352 from wjc404/develop AVX2 ZGEMM3M kernel	2019-12-31 18:08:10 +01:00
Martin Kroeker	995768bbc5	Merge pull request #2351 from Zeyiii/develop prefetching for dgemm_beta	2019-12-31 18:07:37 +01:00
int_13h	96ad579428	add in runtime cpu detection for zarch (#2349 ) add in runtime cpu detection for zarch	2019-12-31 18:03:27 +01:00
shengyang	8d84403205	Use arm neon instructions to optimize ncopy operation modified: KERNEL.ARMV8 modified: KERNEL.TSV110 new file: sgemm_ncopy_4.S	2019-12-31 17:06:35 +08:00
w00421467	0833a4846a	Use arm neon instructions to optimize sgemm_beta operation	2019-12-31 10:42:03 +08:00
zq	50f7fc1401	[WIP] Use arm neon instructions to optimize tcopy operation	2019-12-31 10:21:23 +08:00
w00421467	d1b53806be	Merge remote-tracking branch 'pub/develop' into develop	2019-12-31 10:13:24 +08:00
wjc404	a0f0a802fc	Update zgemm3m_kernel_4x4_haswell.c	2019-12-30 17:33:42 +08:00
wjc404	700fe5b5ee	Add files via upload	2019-12-30 17:18:59 +08:00
wjc404	f60840c420	Update KERNEL.ZEN	2019-12-30 16:04:23 +08:00
wjc404	109e18cd96	Update KERNEL.HASWELL	2019-12-30 16:03:24 +08:00
wjc404	ae1579be13	Create zgemm3m_kernel_4x4_haswell.c	2019-12-30 16:02:51 +08:00
w00421467	3ccf8885ac	prefetching for dgemm_beta	2019-12-30 11:45:49 +08:00
wjc404	cd765f094b	Update cgemm3m_kernel_8x4_haswell.c	2019-12-27 18:23:29 +08:00
wjc404	3a66c8cac1	Update KERNEL.ZEN	2019-12-27 18:04:08 +08:00
wjc404	ed9af2f7da	Update KERNEL.HASWELL	2019-12-27 18:01:38 +08:00
wjc404	5fd1edead9	Create cgemm3m_kernel_8x4_haswell.c	2019-12-27 18:00:55 +08:00
wjc404	eeecd623d8	Update cgemm_kernel_8x2_haswell.c	2019-12-24 00:40:16 +08:00
wjc404	2cd9306bb5	Update KERNEL.ZEN	2019-12-23 23:42:30 +08:00
wjc404	c418c81224	Update KERNEL.HASWELL	2019-12-23 23:41:44 +08:00
wjc404	025741f16a	Fast Haswell CGEMM kernel	2019-12-23 23:40:03 +08:00
wjc404	f41d52665d	Fast Haswell ZGEMM kernel	2019-12-21 14:37:06 +08:00
wjc404	d573d24de7	Fast Haswell ZGEMM kernel	2019-12-21 14:35:15 +08:00
w00421467	b7cc69ee62	declare DGEMM_BETA in KERNEL.ARMV8 rather than the generic KERNEL	2019-12-20 10:11:50 +08:00
w00421467	aeef942c4f	use arm neon instructions to optimize gemm beta operation	2019-12-17 10:00:13 +08:00
Martin Kroeker	1a6ea8ee6d	Merge pull request #2338 from kavanabhat/aix_mod Changes to build on AIX in POWER8 mode	2019-12-09 17:54:49 +01:00
Kavana Bhat	6baa9b07d7	AIX changes for Power8	2019-12-06 04:33:32 -06:00
Kavana Bhat	3938e59569	AIX changes for Power8	2019-12-04 00:23:46 -06:00
Isuru Fernando	b863b32ac5	Workaround an ICE in clang 9.0.0 This bug is not there in 8.x nor in the 9.0 daily snapshot.	2019-12-01 12:59:46 -06:00
Martin Kroeker	dd04143d4a	Merge pull request #2328 from martin-frbg/ppc9 Fix precompiled kernels on POWER9 and make their use conditional on (old) gcc version	2019-11-30 12:23:57 +01:00
Martin Kroeker	f3a6164bff	Merge pull request #2324 from antonblanchard/power9_segv Fix SEGV in cdot_power9	2019-11-30 00:03:42 +01:00
Martin Kroeker	dedd822d1a	Fix caxpy/caxpyc naming in localentry	2019-11-29 23:56:57 +01:00
Martin Kroeker	2181fb7047	Fix caxpy/caxpyc naming in localentry	2019-11-29 23:54:15 +01:00
Martin Kroeker	a9b62c03f8	Substitute precompiled gcc7 codes only when gcc is older than 9.x	2019-11-29 23:49:50 +01:00
Martin Kroeker	97762234f9	Add variable for gcc >=9 test used in KERNEL.POWER9	2019-11-29 23:47:23 +01:00
wjc404	934e601e93	Update dgemm_kernel_4x8_skylakex_2.c	2019-11-28 19:56:35 +08:00
Anton Blanchard	cf2a8e410c	Fix SEGV in cdot_power9 We were corrupting r2 because the local entry wasn't being setup correctly.	2019-11-26 21:55:04 -07:00
wjc404	eb1e9c8c92	some optimizations	2019-11-26 14:12:20 +08:00
Andreas Arnez	d117dfd505	Change bad usage of "asum" to "sum" in ZARCH versions of ?sum The ZARCH implementations of ?sum contain a cut & paste-error: An inline assembly argument is named "sum", but the assembly references "asum" instead. The mismatch causes a build error. This is fixed.	2019-11-21 13:49:13 +01:00
Martin Kroeker	b09b5be0a4	Merge pull request #2315 from ewanglong/develop revised fix windows compatible for #2313	2019-11-21 05:06:44 +01:00
Wang, Long	bfb5fbdb4d	revised fix windows compatible for #2313 Signed-off-by: Wang, Long <long1.wang@intel.com>	2019-11-21 10:22:58 +08:00
Martin Kroeker	08fa83aba2	Merge pull request #2312 from martin-frbg/power8be Further Power8 big-endian corrections	2019-11-20 15:12:06 +01:00
Wang, Long	1191db1a49	For the sake of windows compatible, used "unsigned long long" to ensure 64-bit length Signed-off-by: Wang, Long <long1.wang@intel.com>	2019-11-20 21:30:47 +08:00
Wang, Long	0caf1434c9	Fix the integer overflow issue for large matrix size For large matrix, e.g. M=N=K, and M>1290, int mnk=MNK will overflow. This will lead to wrong branching to single-threading. The performance is downgraded significantly. Signed-off-by: Wang, Long <long1.wang@intel.com>	2019-11-20 14:11:17 +08:00
Martin Kroeker	cad0d150db	Define alternate kernels for big-endian POWER8	2019-11-17 23:12:10 +01:00
Martin Kroeker	eba0aeb7cd	Fix compilation for big-endian POWER8	2019-11-17 22:58:32 +01:00
Martin Kroeker	0c07c356c1	Define alternate kernels for big-endian PPC440	2019-11-17 19:25:08 +01:00
Martin Kroeker	3e67017ac8	Merge pull request #2309 from martin-frbg/ppc970-be Fix PPC970 big-endian support	2019-11-17 18:22:24 +01:00
Martin Kroeker	b3ac6ee222	Define alternate kernels for big-endian PPC970 The altivec versions of SGEMM and CGEMM fail most test in LAPACK-TESTING when compiled for big endian, STRSM/CTRSM even cause segfaults. The rot kernels either fail the corresponding utest or lead to failures in LAPACK-TESTING.	2019-11-17 15:19:39 +01:00
Martin Kroeker	71e96163db	Merge pull request #2305 from wjc404/develop AVX512 CGEMM & ZGEMM kernels	2019-11-12 07:38:37 +01:00
wjc404	819e852ae7	AVX512 CGEMM & ZGEMM kernels 96-99% 1-thread performance of MKL2018	2019-11-11 20:04:52 +08:00
Martin Kroeker	4c6a457358	Merge pull request #2300 from wjc404/develop Optimize SGEMM on SKYLAKEX CPUs	2019-11-06 07:27:33 +01:00
wjc404	836c414e22	optimizations of software prefetching	2019-11-05 13:36:56 +08:00
Martin Kroeker	3cd97f1a80	Merge pull request #2301 from martin-frbg/ppc8be Disable IDAMIN/MAX and IZAMIN/MAX optimizations on big-endian POWER8	2019-11-04 22:54:28 +01:00
wjc404	430c11e135	Add files via upload	2019-11-04 20:10:12 +08:00
wjc404	fbacd2605d	optimizations via software prefetches	2019-11-04 19:37:19 +08:00
Martin Kroeker	68597002ea	The assembly microkernel is not safe to use on ELFv1	2019-11-03 22:42:46 +01:00
Martin Kroeker	d2a6285549	The assembly microkernel is not safe to use on ELFv1	2019-11-03 22:41:19 +01:00
Martin Kroeker	d999688d1a	The assembly microkernel is not safe to use on ELFv1	2019-11-03 22:39:06 +01:00
Martin Kroeker	928fe1b28e	The assembly microkernel is not safe to use on ELFv1	2019-11-03 22:37:27 +01:00
wjc404	1df9a2013d	new sgemm kernel for skylakex	2019-11-02 00:00:48 +08:00
Martin Kroeker	85ccdce8c4	Remove the IOS fallbacks to generic C kernels	2019-10-25 23:02:37 +02:00
wjc404	6ff013bae0	native support for icopy_4 90% MKL 1-thread performance.	2019-10-19 03:54:44 +08:00
wjc404	0d669e04bb	Update dgemm_kernel_8x8_skylakex.c	2019-10-18 15:00:17 +08:00
wjc404	17cdd9f9e1	some correction	2019-10-18 14:58:07 +08:00
wjc404	6bcb06fcb1	make further changes to icopy_8 easier	2019-10-18 10:47:31 +08:00
wjc404	b7315f8401	Add files via upload	2019-10-16 19:23:36 +08:00
wjc404	9b19e9e1b0	Update dgemm_kernel_8x8_skylakex.c	2019-10-16 10:14:51 +08:00
wjc404	6bd67ddbab	Update dgemm_kernel_8x8_skylakex.c	2019-10-16 03:20:08 +08:00
wjc404	844629af57	Add files via upload	2019-10-16 02:00:34 +08:00
Martin Kroeker	a448884a63	Remove automatic label postfixes from macro included only once	2019-10-08 08:37:50 +02:00
Martin Kroeker	3a2df19db6	Fix accidental duplication of jump instruction	2019-10-08 08:09:26 +02:00
Martin Kroeker	d2093a40d3	Merge pull request #2277 from martin-frbg/issue2275 Rewrite ARMV8 code to allow cross-compilation for IOS	2019-10-06 23:01:54 +02:00
Martin Kroeker	56837e9d92	Make local labels in macro compatible with the xcode assembler ... which does not perform the automatic numbering on instantiation that the _@ suffix signifies	2019-10-04 14:53:23 +02:00
Martin Kroeker	5e244d80f2	Merge pull request #2271 from quickwritereader/strmm_fix fixed bug power9 strmm . BLAS-TESTER passes	2019-09-29 13:53:45 +02:00
AbdelRauf	ede5efebab	trmm fix	2019-09-29 02:28:34 +00:00
Martin Kroeker	596a22325a	Fix prologue of power9 assembly cdot(c) kernel to provide cdotc	2019-09-27 00:47:18 +02:00
Martin Kroeker	7f58f3ad0e	Fix mis-edits in the gcc-derived power8 caxpy kernel	2019-09-27 00:44:26 +02:00
Martin Kroeker	673e5a0495	Replace several POWER8/9 C kernels with their gcc7-generated assembly versions (#2263 ) * Add gcc7-generated assembly files for POWER8/9 isa/ica-min/max and POWER9 caxpy To work around internal compiler errors encountered when compiling the original C source with gcc 4 and 5, and wrong code generated by gcc 8.3.0 * Use gcc-generated assembly instead of original C sources to work around internal compiler errors encountered with gcc 4.8/5.4 and wrong code generation by gcc 8.3 * Use gcc-generated assembly instead of the original C source to work around internal compiler errors encountered with gcc 4.8 and 5.4, and wrong code generation by gcc 8.3 * Add gcc7-generated assembler version of caxpy for power8 to work around wrong code generated by gcc 8.3 * Handle CONJ define for caxpyc * Handle CONJ define for caxpyc * Add gcc7-generated assembly cdot for POWER9 * Use prebuilt assembly for POWER9 cdot created with gcc 7.3.1 to work around ICE in older gcc versions * Exclude POWER9 from DYNAMIC_ARCH when gcc versions is lower than 6 * Update Makefile.system * Use PROLOGUE macro to ensure correct function name for DYNAMIC_ARCH * Disable POWER9 with old gcc versions	2019-09-22 22:35:22 +02:00
Martin Kroeker	e7c4d6705a	Revert #2051 and replace with a better fix (#2261 ) * Revert #2051 and add a better fix for TARGET=generic with DYNAMIC_ARCH fixes #2257 without breaking #2048 again	2019-09-17 18:56:04 +02:00
Martin Kroeker	f3c314550c	Merge pull request #2243 from quickwritereader/develop possible cgemv,caxpy,cdot fix	2019-08-30 23:06:23 +02:00
AbdelRauf	847c20c9b7	fix uninitialized variables i	2019-08-30 11:14:55 +00:00
AbdelRauf	4c22828812	caxpy and cdot are using vec_vsx_ld	2019-08-30 04:09:15 +00:00
AbdelRauf	e79712d969	cgemv using vec_vsx_ld instead of letting gcc to decide	2019-08-30 02:52:04 +00:00
AbdelRauf	be09551cdf	aligned	2019-08-29 23:22:23 +00:00
Martin Kroeker	11c59acfb1	Keep both PGI/SUN and default code paths to avoid breaking Clang/WIndows	2019-08-28 18:07:44 +02:00
Martin Kroeker	3a55dca2dc	Make x86_64 zdot compile with PGI and Sun C again broken by #2222 as CREAL,CIMAG do not expand to a valid lvalue with these compilers	2019-08-28 11:35:31 +02:00
Kavana Bhat	3dc6b26eff	AIX changes for Power8	2019-08-20 06:51:35 -05:00
Martin Kroeker	9ef96b32a6	Add multithreading support to the x86_64 zdot kernel (#2222 ) * Add multithreading support copied from the ThunderX2T99 kernel. For #2221	2019-08-15 22:09:12 +02:00
Martin Kroeker	103b32fdb7	Merge pull request #2216 from martin-frbg/issue2214 Remove case-sensitivity in x86 LSAME on (AMD) cpus without CMOV	2019-08-13 13:59:33 +02:00
Martin Kroeker	aef9804089	Fix unwanted case-sensitivity in x86 LSAME for (AMD) processors without CMOV Problem was already noticed some years ago in #238, but back then the problem was only corrected in one of the #ifdef branches. Fixes #2214	2019-08-13 10:19:10 +02:00
Martin Kroeker	dccff2e785	Merge pull request #2206 from martin-frbg/zen-dtrmm Replace vpermpd with vpermilpd in the Haswell DTRMM kernel	2019-08-09 07:55:20 +02:00
Martin Kroeker	5c3458a6e7	Merge pull request #2199 from martin-frbg/zen-dtrsm Replace most vpermpd calls in the Haswell DTRSM_RN kernel	2019-08-09 07:55:02 +02:00
Martin Kroeker	acf6002ab2	Replace most vpermpd calls in the Haswell DTRSM_RN kernel	2019-08-03 12:40:13 +02:00
Martin Kroeker	2dfb804cb9	Replace vpermpd with vpermilpd in the Haswell DTRMM kernel to improve performance on AMD Zen (#2180) applying wjc404's improvement of the DGEMM kernel from #2186	2019-07-28 23:17:28 +02:00
Martin Kroeker	4c153ec9da	Merge pull request #2196 from wjc404/develop Add vbroadcastsd kernel to dgemm_kernel_4x8_haswell.S	2019-07-28 23:11:40 +02:00
wjc404	7eecd8e39c	Add files via upload	2019-07-28 07:39:09 +08:00
Martin Kroeker	7b0b7c11d2	Merge pull request #2190 from martin-frbg/zdot-zen Replace vpermpd with vpermilpd in the Haswell/Zen zdot microkernel	2019-07-23 16:15:08 +02:00
Martin Kroeker	28e96458e5	Replace vpermpd with vpermilpd to improve performance on Zen/Zen2 (as demonstrated by wjc404 in #2180)	2019-07-22 08:28:16 +02:00
wjc404	95fb98f556	Update dgemm_kernel_4x8_haswell.S	2019-07-21 01:10:32 +08:00
wjc404	4801c6d36b	Update dgemm_kernel_4x8_haswell.S	2019-07-21 00:47:45 +08:00
wjc404	9440fa607d	Add files via upload	2019-07-20 22:08:22 +08:00
wjc404	94db259e5b	Add files via upload	2019-07-20 22:04:41 +08:00
wjc404	f49f8047ac	Add files via upload	2019-07-20 14:33:37 +08:00
wjc404	825777faab	Update dgemm_kernel_4x8_haswell.S	2019-07-19 23:58:24 +08:00
wjc404	9c89757562	Add files via upload	2019-07-19 23:47:58 +08:00
wjc404	9b04baeaee	Update dgemm_kernel_4x8_haswell.S	2019-07-17 23:50:03 +08:00
wjc404	8a074b3965	Update dgemm_kernel_4x8_haswell.S	2019-07-17 23:47:30 +08:00
wjc404	211ab03b14	Update dgemm_kernel_4x8_haswell.S	2019-07-17 22:39:15 +08:00
wjc404	1733f927e6	Update dgemm_kernel_4x8_haswell.S	2019-07-17 21:27:41 +08:00
wjc404	182b06d6ad	Update dgemm_kernel_4x8_haswell.S	2019-07-17 17:02:35 +08:00
wjc404	7a9050d681	Update dgemm_kernel_4x8_haswell.S	2019-07-17 00:55:06 +08:00
wjc404	0ba29fd262	Update dgemm_kernel_4x8_haswell.S for zen2 replaced a bunch of vpermpd instructions with vpermilpd and vperm2f128	2019-07-17 00:46:51 +08:00
Martin Kroeker	6b6c9b1441	Merge pull request #2172 from quickwritereader/develop power9 cgemm/ctrmm. new sgemm 8x16	2019-07-01 21:06:02 +02:00
AbdelRauf	a97b301aaa	cgemm/ctrmm power9	2019-07-01 14:07:54 +00:00
Piotr Kubaj	eebfeba768	Fix build on FreeBSD/powerpc64. Signed-off-by: Piotr Kubaj <pkubaj@anongoth.pl>	2019-06-25 10:58:56 +02:00
kavanabhat	a575f1e4c7	Update dtrmm_kernel_16x4_power8.S	2019-06-19 15:27:14 +05:30
AbdelRauf	cdbfb891da	new sgemm 8x16	2019-06-17 15:33:38 +00:00
Martin Kroeker	a17cf36225	Merge pull request #2153 from quickwritereader/develop improved power9 zgemm,sgemm	2019-06-06 07:42:56 +02:00
AbdelRauf	148c4cc5fd	conflict resolve	2019-06-05 20:50:50 +00:00
AbdelRauf	d0c3543c3f	power9 zgemm ztrmm optimized	2019-06-05 20:07:16 +00:00
AbdelRauf	a469b32cf4	sgemm pipeline improved, zgemm rewritten without inner packs, ABI lxvx v20 fixed with vs52	2019-06-04 07:11:30 +00:00
AbdelRauf	8fe794f059	improved zgemm power9 based on power8	2019-05-30 15:31:25 +00:00
Martin Kroeker	74c10b57c6	Use generic kernels for complex (I)AMAX to support softfp	2019-05-30 11:38:11 +02:00
Martin Kroeker	c5495d2056	Ensure correct output for DAMAX with softfp	2019-05-30 11:25:43 +02:00
Martin Kroeker	c70496b108	Separate implementations of AMAX and IAMAX on arm As noted in #1912 and comment on #1942, the combined implementation happens to "do the right thing" on hardfp, but cannot return both value and index on softfp where they would have to share the return register	2019-05-29 15:02:51 +02:00
Martin Kroeker	9ea30f3788	Replace ISMIN and ISAMIN kernels on all x86_64 platforms (#2125 ) * Mark iamax_sse.S as unsuitable for MIN due to issue #2116 * Use iamax.S rather than iamax_sse.S for ISMIN/ISAMIN on all x86_64 as workaround for #2116	2019-05-09 14:42:36 +02:00
Martin Kroeker	6a8b4269b5	Merge pull request #2111 from martin-frbg/issue1955 Disable the SkyLakeX DGEMMIxCOPY kernels as well	2019-05-05 18:08:49 +02:00
Martin Kroeker	b1561ecc68	Disable DGEMMINCOPY as well for now #1955	2019-05-05 15:52:01 +02:00
Martin Kroeker	7ed8431527	Disable the SkyLakeX DGEMMITCOPY kernel as well as a stopgap measure for https://github.com/numpy/numpy/issues/13401 as mentioned in #1955	2019-05-04 22:54:41 +02:00
Martin Kroeker	3f427c0cf9	Merge pull request #2107 from quickwritereader/develop sgemm/strmm kernel for power9	2019-05-02 07:56:57 +02:00
AbdelRauf	47f892198c	conflict resolve	2019-05-01 19:36:22 +00:00
AbdelRauf	628b335e83	Merge branch 'develop' of https://github.com/quickwritereader/OpenBLAS into develop	2019-04-29 08:57:44 +00:00
AbdelRauf	0f105dd8a5	sgemm/strmm	2019-04-29 08:49:50 +00:00
Martin Kroeker	ccfb7ead15	Merge pull request #2072 from martin-frbg/sum Add (C)BLAS extension ?sum	2019-04-23 20:11:36 +02:00
Rashmica Gupta	bcdf1d4917	Add in runtime CPU detection for POWER.	2019-04-09 14:20:16 +10:00
Martin Kroeker	c04a729081	Add ?sum definitions for generic kernel	2019-03-31 13:55:49 +02:00
Martin Kroeker	100d94f94e	Add ?sum	2019-03-31 13:55:05 +02:00
Martin Kroeker	246ca29679	Add ZARCH implementation of ?sum as trivial copies of the respective ?asum kernels with the ABS and vflpsb calls removed	2019-03-30 22:49:05 +01:00
Martin Kroeker	9d717cb5ee	Add x86_64 implementation of ?sum as trivial copy of ?asum with the fabs calls removed	2019-03-30 22:27:04 +01:00
Martin Kroeker	e3bc83f2a8	Add x86 implementation of ?sum as trivial copy of ?asum with the fabs calls removed	2019-03-30 22:26:10 +01:00
Martin Kroeker	70f2a4e0d7	Add SPARC implementation of ?sum as trivial copy of ?asum with the fabs replaced by fmov to preserve code structure	2019-03-30 22:25:06 +01:00
Martin Kroeker	706dfe263b	Add POWER implementation of ?sum as trivial copy of ?asum with the fabs replaced by fmr to preserve code structure	2019-03-30 22:23:42 +01:00
Martin Kroeker	688fa9201c	Add MIPS64 implementation of ?sum as trivial copy of ?asum with the fabs replaced by mov to preserve code structure	2019-03-30 22:22:15 +01:00
Martin Kroeker	cdbe0f0235	Add MIPS implementation of ?sum as trivial copy of ?asum with the fabs calls removed	2019-03-30 22:20:14 +01:00
Martin Kroeker	f8b82bc6dc	Add ia64 implementation of ?sum as trivial copy of asum with the fabs calls removed	2019-03-30 22:18:03 +01:00
Martin Kroeker	3e3ccb9011	Add ARM64 implementations of ?sum as trivial copies of the respective ?asum kernels with the fabs calls removed	2019-03-30 22:13:36 +01:00
Martin Kroeker	94ab4e6fb2	Add ARM implementations of ?sum (trivial copies of the respective ?asum with the fabs calls removed)	2019-03-30 22:11:38 +01:00
Martin Kroeker	c3cfc6986b	Add implementations of ssum/dsum and csum/zsum as trivial copies of asum/zsasum with the fabs calls replaced by fmov to preserve code structure	2019-03-30 22:05:11 +01:00
Martin Kroeker	b9f4943a14	Add ?sum	2019-03-30 22:01:13 +01:00
Martin Kroeker	32c7063cb0	Merge pull request #2061 from martin-frbg/martin-frbg-patch-1 Disable the AVX512 DGEMM kernel (again)	2019-03-30 21:21:38 +01:00
Martin Kroeker	7c51cc8527	Merge branch 'develop' into develop	2019-03-29 19:36:29 +01:00
AbdelRauf	853a18bc17	power9 makefile. dgemm based on power8 kernel with following changes : 32x unrolled 16x4 kernel and 8x4 kernel using (lxv stxv butterfly rank1 update). improvement from 17 to 22-23gflops. dtrmm cases were added into dgemm itself	2019-03-29 15:49:40 +00:00
Martin Kroeker	e608d4f7fe	Disable the AVX512 DGEMM kernel (again) Due to as yet unresolved errors seen in #1955 and #2029	2019-03-13 22:10:28 +01:00
Martin Kroeker	03d7110900	Merge pull request #2042 from maomao194313/develop add TARGET support for HiSilicon tsv110 CPUs	2019-03-12 22:57:39 +01:00
Martin Kroeker	f18ab6c17b	Merge pull request #2051 from martin-frbg/issue2048 Make TARGET=GENERIC compatible with DYNAMIC_ARCH=1	2019-03-09 16:39:35 +01:00
Martin Kroeker	5b95534afc	Make TARGET=GENERIC compatible with DYNAMIC_ARCH=1 for issue #2048	2019-03-09 11:21:16 +01:00
Celelibi	b7f59da42d	Fix crash in sgemm SSE/nano kernel on x86_64 Fix bug #2047. Signed-off-by: Celelibi <celelibi@gmail.com>	2019-03-07 16:55:13 +01:00
maomao194313	783ba8058f	HiSilicon tsv110 CPUs optimization branch add HiSilicon tsv110 CPUs optimization branch	2019-03-04 16:30:50 +08:00
Andrew	6eee1beac5	move fix to right place	2019-02-24 20:41:02 +02:00
Martin Kroeker	e12cdf58ef	Merge pull request #2024 from martin-frbg/gcc9fixes4 Fix inline assembly constraints in Bulldozer TRSM kernels	2019-02-17 11:49:15 +01:00
Martin Kroeker	1860c9456d	Merge pull request #2023 from martin-frbg/gcc9fixes3 Fix inline assembly constraints in various x86_64 GEMVN kernels	2019-02-17 11:48:57 +01:00
Martin Kroeker	f9bb76d29a	Fix inline assembly constraints in Bulldozer TRSM kernels rework indices to allow marking i,as and bs as both input and output (marked operand n1 as well for simplicity). For #2009	2019-02-16 20:06:48 +01:00
Martin Kroeker	efb9038f72	Fix inline assembly constraints	2019-02-16 18:46:17 +01:00
Martin Kroeker	e976557d29	Fix inline assembly constraints rework indices to allow marking argument lda as input and output.	2019-02-16 18:36:39 +01:00
Martin Kroeker	9d8be15789	Fix inline assembly constraints rework indices to allow marking argument lda4 as input and output. For #2009	2019-02-16 18:24:11 +01:00
Martin Kroeker	d752799a0f	Merge pull request #2021 from martin-frbg/gcc9fixes2 Fix wrong constraints in inline assembly of Haswell DTRSM kernel	2019-02-16 18:05:40 +01:00
Martin Kroeker	c26c0b77a7	Fix wrong constraints in inline assembly for #2009	2019-02-15 15:08:16 +01:00
Martin Kroeker	1c6da2d03c	Merge pull request #2019 from martin-frbg/gcc9fixes Fix unannounced modification of input operand 8 (lda4) in Haswell GEMVN microkernel	2019-02-15 15:02:54 +01:00
Martin Kroeker	4255a58cd2	Rename operands to put lda on the input/output constraint list	2019-02-15 10:10:04 +01:00
Martin Kroeker	46e415b140	Save and restore input argument 8 (lda4) Fixes miscompilation with gcc9 -ftree-vectorize (related to issue #2009)	2019-02-14 22:43:18 +01:00
Bart Oldeman	69a97ca7b9	dgemv_kernel_4x4(Haswell): add missing clobbers for xmm0,xmm1,xmm2,xmm3 This fixes a crash in dblat2 when OpenBLAS is compiled using -march=znver1 -ftree-vectorize -O2 See also: https://github.com/easybuilders/easybuild-easyconfigs/issues/7180	2019-02-14 16:27:58 +00:00
Martin Kroeker	056917d616	Merge pull request #2013 from martin-frbg/issue2011 Fix invalid memory access in PPC gemm_beta	2019-02-14 09:29:34 +01:00
Martin Kroeker	718efcec6f	Fix out-of-bounds memory access in gemm_beta Fixes #2011 (as suggested by davemq), assuming typo by K.Goto	2019-02-13 22:08:37 +01:00
Martin Kroeker	f9d67bb5e8	Fix out-of-bounds memory access in gemm_beta Fixes #2011 (as suggested by davemq) presuming typo by K.Goto	2019-02-13 22:06:41 +01:00
Martin Kroeker	76bb74fcd4	Merge pull request #2012 from maamountki/z14 [ZARCH] Many improvements	2019-02-13 20:15:56 +01:00
maamountki	0a54c98b9d	[ZARCH] Modify constraints	2019-02-13 21:06:25 +02:00
maamountki	bec54ae366	[ZARCH] Fix caxpy	2019-02-13 12:54:35 +02:00
Martin Kroeker	ab1630f9fa	Fix declaration of arguments in inline assembly Argument 0 is modified so should be input and output	2019-02-12 16:14:02 +01:00
Martin Kroeker	b824fa70eb	Fix declaration of assembly arguments in SSYMV and DSYMV microkernels Arguments 0 and 1 are both input and output	2019-02-12 16:00:18 +01:00
Martin Kroeker	91481a3e4e	Fix declaration of input arguments in inline assembly Argument 0 is modified as it doubles as a counter	2019-02-12 15:51:43 +01:00
Martin Kroeker	dc6ac9eab0	Fix declaration of input arguments in the x86_64 s/dGEMV_T and s/dGEMV_N kernels Arguments 0 and 1 need to be tagged as both input and output	2019-02-12 15:33:48 +01:00
maamountki	f583674109	[ZARCH] Fix cgemv_t_4	2019-02-12 13:12:28 +02:00
maamountki	77fe70019f	[ZARCH] Fix constraints and source code formatting	2019-02-11 16:01:13 +02:00
maamountki	7039770165	[ZARCH] Undo the last commit	2019-02-06 20:11:44 +02:00
maamountki	11a43e8116	[ZARCH] Set alignment hint for vl/vst	2019-02-05 19:17:08 +02:00
maamountki	61526480f9	[ZARCH] Fix copy constraint	2019-02-05 07:51:19 +02:00
maamountki	81daf6bc38	[ZARCH] Format source code, Fix constraints	2019-02-05 07:30:38 +02:00
Martin Kroeker	729e925174	Merge pull request #1996 from quickwritereader/develop NBMAX=4096 for gemvn, added sgemvn 8x8 for future	2019-02-04 16:52:04 +01:00
Ubuntu	498ac98581	Note for unused kernels	2019-02-04 15:41:56 +00:00
Ubuntu	cd9ea45463	NBMAX=4096 for gemvn, added sgemvn 8x8 for future	2019-02-04 06:57:11 +00:00
Martin Kroeker	f9c5023e04	Merge pull request #1994 from quickwritereader/develop sgemv cgemv pairs	2019-02-01 21:04:47 +01:00
Ubuntu	4abc375a91	sgemv cgemv pairs	2019-02-01 13:45:00 +00:00
Martin Kroeker	874df65491	Fix incorrect sgemv results for IBM z14 part of PR #1993 that was inadvertently misplaced into the toplevel directory	2019-02-01 12:58:59 +01:00
Martin Kroeker	877023e1e1	Fix precision of zarch DSDOT from patch provided by aarnez in #991	2019-01-31 21:22:26 +01:00
Martin Kroeker	265142edd5	Fix typo in the zarch min/max kernels from patch provided by aarnez in #991	2019-01-31 21:21:40 +01:00
Martin Kroeker	885a3c4350	USE_TRMM on Z14 from patch provided by aarnez in #991	2019-01-31 21:18:09 +01:00
maamountki	82124729af	Merge branch 'develop' into z14	2019-01-31 19:36:41 +02:00
maamountki	29416cb5a3	[ZARCH] Add Z13 version for max/min functions	2019-01-31 19:11:11 +02:00
maamountki	48b9b94f7f	[ZARCH] Improve loading performance for camax/icamax	2019-01-31 18:52:11 +02:00
Martin Kroeker	86a824c97f	Fix wrong comparison that made IMIN identical to IMAX as reported by aarnez in #1990	2019-01-31 15:27:21 +01:00
Martin Kroeker	808410c2c7	Fix wrong comparison that made IMIN identical to IMAX as suggested in #1990	2019-01-31 15:25:15 +01:00
maamountki	fcd814a8d2	[ZARCH] Fix bug in max/min functions	2019-01-29 17:59:38 +02:00
maamountki	dc4d3bccd5	[ZARCH] Fix icamax/icamin	2019-01-29 03:47:49 +02:00
maamountki	c7143c1019	[ZARCH] Fix iamax/imax single precision	2019-01-28 17:52:23 +02:00
maamountki	04873bb174	[ZARCH] Undo the last commit	2019-01-28 17:32:24 +02:00
maamountki	c8ef9fb220	[ZARCH] Fix bug in iamax/iamin/imax/imin	2019-01-28 17:16:18 +02:00
maamountki	b111829226	[ZARCH] Update max/min functions	2019-01-21 15:56:04 +02:00
Martin Kroeker	32b0f1168e	Fix declaration of input arguments in the Sandybridge GER microkernels (#1967 ) * Tag arguments 0 and 1 as both input and output	2019-01-18 08:11:39 +01:00
Martin Kroeker	b495e54310	Fix declaration of input arguments in the x86_64 SCAL microkernels (#1966 ) * Tag arguments 0 and 1 as both input and output (see #1964)	2019-01-18 08:11:07 +01:00
Martin Kroeker	d5e6940253	Fix declaration of input arguments in the x86_64 microkernels for DOT and AXPY (#1965 ) * Tag operands 0 and 1 as both input and output For #1964 (basically a continuation of coding problems first seen in #1292)	2019-01-17 23:20:32 +01:00
Ubuntu	43a4572038	crot fix	2019-01-17 14:45:31 +00:00
Abdelrauf	a034e65512	Merge branch 'develop' into develop	2019-01-16 19:25:13 +04:00
Ubuntu	8c3386be87	Added missing Blas1 single fp {saxpy, caxpy, cdot, crot(refactored version of srot),isamax ,isamin, icamax, icamin}, Fixed idamin,icamin choosing the first occurance index of equal minimals	2019-01-16 15:16:21 +00:00
maamountki	b815a04c87	[ZARCH] fix a bug in max/min functions	2019-01-15 21:04:22 +02:00
maamountki	1a7925b3a3	[ZARCH] Update dgemv_n_4.c	2019-01-11 17:43:11 +02:00
maamountki	406f835f00	[ZARCH] update cgemv_n_4.c	2019-01-11 17:39:17 +02:00
maamountki	621dedb37b	[ZARCH] Update cgemv_t_4.c	2019-01-11 17:37:11 +02:00
maamountki	b731e8246f	Update sgemv_t_4.c	2019-01-11 17:14:04 +02:00
maamountki	ecc31b743f	Update dgemv_t_4.c	2019-01-11 17:13:02 +02:00
maamountki	5d89d6b143	[ZARCH] fix sgemv_n_4.c	2019-01-11 17:08:24 +02:00
maamountki	67432b23c2	[ZARCH] fix cgemv_n_4.c	2019-01-11 16:44:46 +02:00
maamountki	be66f5d5c2	[ZARCH] fix data prefetch type in sdot	2019-01-09 16:50:07 +02:00
maamountki	c2ffef8156	[ZARCH] fix data prefetch type in ddot	2019-01-09 16:49:44 +02:00
maamountki	e7455f500c	[ZARCH] fix dsdot.c	2019-01-09 16:33:54 +02:00
maamountki	3eafcfa650	[ZARCH] fix cgemv_n_4.c	2019-01-09 07:43:45 +02:00
maamountki	94cd946b96	[ZARCH] fix cgemv_n_4.c	2019-01-04 17:45:56 +02:00
maamountki	1aa840a0a2	[ZARCH] fix sgemv_t_4.c	2019-01-04 01:38:18 +02:00
Arjan van de Ven	795285c587	Fix thinko in skylake beta handling casting ints is cheaper but it has a rounding, not memory casing effect, resulting in invalid outcome	2018-12-24 18:49:50 +00:00

... 8 9 10 11 12 ...

2023 Commits