OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Chen, Guobing	deaeb6c5b8	Add bfloat16 based dot and conversion with single/double 1. Added bfloat16 based dot as new API: shdot 2. Implemented generic kernel and cooperlake-specific (AVX512-BF16) kernel for shdot 3. Added 4 conversion APIs for bfloat16 data type <=> single/double: shstobf16 shdtobf16 sbf16tos dbf16tod shstobf16 -- convert single float array to bfloat16 array shdtobf16 -- convert double float array to bfloat16 array sbf16tos -- convert bfloat16 array to single float array dbf16tod -- convert bfloat16 array to double float array 4. Implemented generic kernels for all 4 conversion APIs, and cooperlake-specific kernel for shstobf16 and shdtobf16 5. Update level1 thread facilitate functions and macros to support multi-threading for these new APIs 6. Fix Cooperlake platform detection/specify issue when under dynamic-arch building 7. Change the typedef of bfloat16 from unsigned short to more strict uint16_t Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2020-09-04 02:31:25 +08:00
Martin Kroeker	775a87242d	Rename KERNEL.SILICON to KERNEL.VORTEX	2020-09-03 08:44:20 +02:00
Gengxin Xie	1b0f17eeed	align to 64, using SSE when input size is small	2020-09-03 14:25:54 +08:00
Martin Kroeker	80794fe8fd	Create KERNEL.SILICON	2020-09-02 22:56:58 +02:00
Marius Hillenbrand	2ee5b899ce	s390x: enable S/DGEMM block with explicit loop unrolling + interleaving with clang The code for SGEMM 16x4 and DGEMM 8x4 blocks on z14 and z15 uses explicit unrolling and interleaving to improve performance. The code employs an empty inline asm statement with operands that constrain the compiler's instruction scheduling and thereby enforce proper overlapping of load and compute phases. Fix an ifdef to apply that for clang builds, as well. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:31 +02:00
Marius Hillenbrand	87e5bbd887	s390x: avoid variable-length arrays in struct for asm operands ... since it is not required and clang does not support that gcc extension. Instead, use a variable-length array directly for these operands. Note that, while the actual inline assembly code does not directly use these memory operands, they serve to inform the compiler that it cannot reorder reads or writes to/from the input and output data across the inline asm statements. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:31 +02:00
Marius Hillenbrand	b9b3265ec8	s390x: avoid inline assembly for vector loads for clang ... since clang does not support the instruction format for inline assembly and also it is not required for current versions of clang. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:30 +02:00
Marius Hillenbrand	a1616a0b86	s390x: replace nop with "nop 0" in inline assembly ... as a bandaid for building with clang until LLVM's internal assembler supports nops without operand. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:30 +02:00
Marius Hillenbrand	60ef193258	s390x: use "lghi" for immediate values to fix build with clang Some of the kernels written in assembly utilize a "load address" instruction for loading an immediate value into a register. That is both unnecessarily complex and LLVM's assembler does not understand that specific syntax. Thus, replace with the appropriate "load immediate" instruction, which is also clearer to read. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-02 13:49:30 +02:00
Gengxin Xie	448152cdd8	define __AVX2__ to ensure the haswell code compiled with avx2	2020-08-31 14:39:08 +08:00
Gengxin Xie	cb3c190a3a	Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic	2020-08-31 11:44:08 +08:00
Rajalakshmi Srinivasaraghavan	317ff27cda	POWER10: Avoid setting accumulators to zero in gemm kernels For the first iteration, it is better to use xvfger instead of xvfgerpp builtins which helps to avoid setting accumulators to zero. This helps to reduce few instructions.	2020-08-28 10:42:54 -05:00
Martin Kroeker	b2053239fc	Fix mssing dummy parameter (imag part of alpha) of zdot_thread_function	2020-08-23 15:08:16 +02:00
Martin Kroeker	9ee21a0a39	Merge pull request #2780 from Guobing-Chen/CPL_build_support Enable COOPERLAKE build target	2020-08-20 19:54:29 +02:00
Martin Kroeker	6f4dc7445d	Fix typo	2020-08-19 16:36:55 +02:00
Martin Kroeker	81fbe8d088	-march=cooperlake only available in gcc >= 10	2020-08-19 16:10:15 +02:00
Martin Kroeker	75eeb265d7	[WIP] Refactor the driver code for direct SGEMM (#2782 ) Move "direct SGEMM" functionality out of the SkylakeX SGEMM kernel and make it available (on x86_64 targets only for now) in DYNAMIC_ARCH builds * Add sgemm_direct targets in the kernel Makefile.L3 and CMakeLists.txt * Add direct_sgemm functions to the gotoblas struct in common_param.h * Move sgemm_direct_performant helper to separate file * Update gemm.c to macros for sgemm_direct to support dynamic_arch naming via common_s,h * (Conditionally) add sgemm_direct functions in setparam-ref.c	2020-08-19 14:51:09 +02:00
Chen, Guobing	e740c4873d	Enable COOPERLAKE build target Enable new build target platform -- COOPERLAKE. This target platform supports all the SKYLAKEX supported ISAs + avx512bf16. So all the SKYLAKEX specific kernels/drivers and related code are now extended to be also active on COOPERLAKE. Besides, new BF16 related kernels are active under this target.	2020-08-13 06:18:00 +08:00
Martin Kroeker	cbbe38bb88	Merge pull request #2772 from mhillenibm/s390x_gemm_tuning s390x: GEMM tuning for z14	2020-08-11 18:14:09 +02:00
Marius Hillenbrand	07c334e7be	s390x: Factor out small block sizes for SGEMM/DGEMM on z14 For small register blockings that are too small to fill up vector registers with column vectors, we currently use a generic code block. Replace that with instantiations of the generic code as individual functions, so that the compiler can optimize each one separately. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-08-11 12:56:39 +02:00
Marius Hillenbrand	e2828e30aa	s390x: Optimize SGEMM/DGEMM blocks for z14 with explicit loop unrolling/interleaving Improve performance of SGEMM and DGEMM on z14 and z15 by unrolling and interleaving the inner loop of the SGEMM 16x4 and DGEMM 8x4 blocks. Specifically, we explicitly interleave vector register loads and computation of two iterations. Note that this change only adds one C function, since SGEMM 16x4 and DGEMM 8x4 actually map to the same C code: they both hold intermediate results in a 4x4 grid of vector registers, and the C implementation is built around that. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-08-11 12:55:42 +02:00
Rajalakshmi Srinivasaraghavan	475b5c95b9	Remove extra symbol in Makefile While trying out different unroll values, noted that make failed due to this extra symbol.	2020-08-07 15:27:44 -05:00
Martin Kroeker	81dcfdcf39	Multiply by 2 instead of left-shifting a potentially negative number fixes GCC ubsan warning in the BLAS tests	2020-08-02 18:29:56 +02:00
Martin Kroeker	0ef4b3f1f2	Multiply instead of doing a left shift of a potentially negative number fixes GCC ubsan report in the BLAS tests	2020-08-02 18:27:40 +02:00
Martin Kroeker	aa53a8a5cb	Multiply by two instead of left-shifting one place fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests	2020-08-02 18:25:09 +02:00
Martin Kroeker	aa3a1e7d8c	Multiply by two rather than left shift by one place fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests	2020-08-02 18:22:31 +02:00
Rajalakshmi Srinivasaraghavan	f77b6a83f4	dgemv optimization for POWER10 Making use of new vector pair POWER10 instructions in dgemv_n and dgemv_t. Also adding a new block 4x128 to make use of Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1. Tested on simulator and there are no new test failures.	2020-07-29 18:59:32 -05:00
Rajalakshmi Srinivasaraghavan	d557584b71	Fix compilation issues with clang on POWER As gcc defaults to -malign-power, removing that option. Also adding -fno-integrated-as to use GNU assembler for powerpc assembly optimization files. Fixed other compilation errors reported in dgemv_t.c file.	2020-07-27 14:11:07 -05:00
Ashwin Sekhar T K	4e1be0e481	ARM64: Add THUNDERX3T110 Target	2020-07-26 23:32:24 -07:00
Rajalakshmi Srinivasaraghavan	9be2688c78	Fix to store results in correct order for POWER10 GEMM kernels There is a recent compiler change in __builtin_mma_disassemble_acc() which affects the order of storing result in POWER10. Also removing new LDFLAG -mno-power10-stub as it is handled by linker automatically.	2020-07-24 23:08:11 -05:00
Martin Kroeker	6a2a60038c	Merge pull request #2720 from martin-frbg/issue2694 WIP Further fixes for 32bit POWER8	2020-07-24 23:19:45 +02:00
Martin Kroeker	251a09ec90	Typo fix	2020-07-24 16:04:58 +00:00
Martin Kroeker	95d37e1575	Regroup the 32 and 64bit sections and restore 64bit CAXPY	2020-07-24 10:13:46 +00:00
Martin Kroeker	3523bb778e	Merge pull request #2721 from martin-frbg/p8align Fix alignment errors in the power8 saxpy kernel	2020-07-24 11:06:20 +02:00
Martin Kroeker	bf1f0734ff	Use OPENBLAS_MAKE_COMPLEX_FLOAT on PPC only	2020-07-23 20:40:13 +00:00
Martin Kroeker	ca3561cab9	Add ifdefs around call to altivec microkernel	2020-07-23 18:30:42 +00:00
Martin Kroeker	21072e502a	Typo fix	2020-07-23 17:34:56 +00:00
Martin Kroeker	7c6e56b5df	Rewrite assignment to complex for better portability	2020-07-23 17:10:59 +02:00
Martin Kroeker	661c6bfa5a	Exclude altivec code paths if the compiler does not support them	2020-07-23 17:08:20 +02:00
Martin Kroeker	0033f8be0d	Use vec_vsx_ld/st to fix misaligned accesses flagged by asan	2020-07-16 23:32:54 +02:00
Martin Kroeker	f308e741b2	remove debug output and revert changes to cdot and crot	2020-07-15 10:00:07 +02:00
Martin Kroeker	da17abec87	fix trailing whitespace	2020-07-14 18:20:03 +02:00
Martin Kroeker	f8c2697701	Use POWER6 GEMM, TRMM and DTRSM on 32bit POWER8	2020-07-14 18:11:19 +02:00
Martin Kroeker	b144423f0f	Do not define USE_TRMM for 32bit POWER8	2020-07-14 18:10:12 +02:00
Martin Kroeker	ed7e155c35	Merge branch 'develop' into aix	2020-07-07 18:52:06 +02:00
EGuesnet	634e1305f9	Update cgemm_kernel_8x4_power8.S	2020-06-30 15:16:39 +02:00
Martin Kroeker	28d69e0097	Merge pull request #2687 from martin-frbg/utfbom Strip UTF8 byte order marker from source files	2020-06-26 22:53:09 +02:00
Martin Kroeker	c2467c9619	Merge pull request #2686 from RajalakshmiSR/p10_shgemm powerpc: Optimized SHGEMM kernel for POWER10	2020-06-26 22:52:45 +02:00
Martin Kroeker	d199c2787d	Merge pull request #2680 from kavanabhat/aix_makefile_fix Fix for #2671	2020-06-26 11:27:28 +02:00
Martin Kroeker	e30ad0e521	Strip UTF8 byte order marker from source	2020-06-26 09:00:43 +02:00
Rajalakshmi Srinivasaraghavan	d23419accc	powerpc: Optimized SHGEMM kernel for POWER10 This patch introduces new optimized version of SHGEMM kernel using power10 Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1. This patch makes use of new POWER10 compute instructions for matrix multiplication operation. Tested on simulator and there are no new test failures.	2020-06-25 22:19:08 -05:00
Martin Kroeker	c854ef5471	Fix variable names in conditional	2020-06-25 13:29:52 +02:00
Martin Kroeker	c0afc11742	Fix POWERPC builds on AIX (gcc/gfortran 7) 1. macro preprocessing for POWER8 and later kernels only 2. default buffer size used by AIX version of m4 is too small	2020-06-25 13:12:36 +02:00
Gordon Fossum	bb2f52844b	powerpc: Optimized ZGEMM kernel for POWER10 This patch introduces new optimized version of ZGEMM kernel using power10 Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1. This patch makes use of new POWER10 compute instructions for matrix multiplication operation. Tested on simulator and there are no new test failures. Cycles count reduced by 30-50% compared to POWER9 version depending on M/N/K sizes.	2020-06-24 14:50:12 -05:00
Rajalakshmi Srinivasaraghavan	571eadb880	powerpc: Optimized SGEMM/DGEMM/CGEMM for POWER10 This patch introduces new optimized version of SGEMM, CGEMM and DGEMM using power10 Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1. This patch makes use of new POWER10 compute instructions for matrix multiplication operation. Tested on simulator and there are no new test failures. Cycles count reduced by 30-50% compared to POWER9 version depending on M/N/K sizes. MMA GCC patch for reference: https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=8ee2640bfdc62f835ec9740278f948034bc7d9f1	2020-06-24 14:48:15 -05:00
Kavana Bhat	df4ade070f	Fix for #2671	2020-06-24 04:25:47 -05:00
Martin Kroeker	93592d1260	Merge pull request #2675 from wjc404/develop AVX512 DGEMM TCOPY_16 Function	2020-06-23 09:29:02 +02:00
wjc404	086d87a302	AVX512 dgemm tcopy_16 function	2020-06-20 00:07:43 +08:00
Rajalakshmi Srinivasaraghavan	9fe930f205	powerpc: Add support for future processor This is the initial patch to support build infrastructure for POWER10 architecture.	2020-06-11 15:47:20 -05:00
ZhangDanfeng	bc6fd20a40	fix INIT8x4 Signed-off-by: ZhangDanfeng <467688405@qq.com>	2020-06-10 01:01:16 +08:00
Martin Kroeker	89091e6b64	Merge pull request #2645 from martin-frbg/misc_fixes Miscellaneous fixes	2020-06-07 19:44:50 +02:00
Martin Kroeker	c3574ffe53	Merge pull request #2646 from wjc404/develop Optimize AVX512 parallel DGEMM performance	2020-06-07 13:18:22 +02:00
wjc404	0e3ac4a06b	Add files via upload	2020-06-06 14:56:57 +08:00
Martin Kroeker	7f60fb6b91	Delete spurious copy of common_param.h	2020-06-05 10:04:16 +02:00
ZhangDanfeng	9b7877ccf1	sgemm copy source init Signed-off-by: ZhangDanfeng <467688405@qq.com>	2020-06-04 02:10:45 +08:00
ZhangDanfeng	f82fa802d1	Insert prefetch Signed-off-by: ZhangDanfeng <467688405@qq.com>	2020-06-04 02:08:48 +08:00
Martin Kroeker	b1ee81228a	Change complex DOT and ROT to generic kernels and switch CGEMM in response to test failures seen in #2628 and BLAS-Tester	2020-06-03 09:13:29 +02:00
张丹枫	9df79ae9a3	update sgemm and strmm kernel selecting strategy	2020-05-20 22:26:58 +08:00
张丹枫	a1fc6041cd	use general register to speedup	2020-05-20 22:26:58 +08:00
张丹枫	edb423d772	align general register using to strmm_kernel_8x8	2020-05-20 22:26:58 +08:00
zhangdanfeng	0e6eb8c247	sgemm kernel use sgemm_kernel_8x8_cortexa53 Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>	2020-05-20 22:26:58 +08:00
zhangdanfeng	d475db29c6	optimized for cortex-a53 Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>	2020-05-20 22:26:58 +08:00
Marius Hillenbrand	89fe17f20e	s390x: Use new sgemm kernel also for DGEMM and DTRMM on Z14 Apply our new GEMM kernel implementation, written in C with vector intrinsics, also for DGEMM and DTRMM on Z14 and newer (i.e., architectures with FP32 SIMD instructions). As a result, we gain around 10% in performance on z15, in addition to improving maintainability. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-20 10:23:35 +02:00
Marius Hillenbrand	bdd795ed03	s390x/GEMM: replace 0-init with peeled first iteration ... since it gains another ~2% of SGEMM and DGEMM performance on z15; also, the code just called for that cleanup. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-20 10:23:35 +02:00
Marius Hillenbrand	2840432e49	s390x: improvise vector alignment hints for older compilers Introduce inline assembly so that we can employ vector loads with alignment hints on older compilers (pre gcc-9), since these are still used in distributions such as RHEL 8 and Ubuntu 18.04 LTS. Informing the hardware about alignment can speed up vector loads. For that purpose, we can encode hints about 8-byte or 16-byte alignment of the memory operand into the opcodes. gcc-9 and newer automatically emit such hints, where applicable. Add a bit of inline assembly that achieves the same for older compilers. Since an older binutils may not know about the additional operand for the hints, we explicitly encode the opcode in hex. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-14 15:36:03 +02:00
Marius Hillenbrand	1b0b4349a1	s390x/Z14: Change register blocking for SGEMM to 16x4 Change register blocking for SGEMM (and STRMM) on z14 from 8x4 to 16x4 by adjusting SGEMM_DEFAULT_UNROLL_M and choosing the appropriate copy implementations. Actually make KERNEL.Z14 more flexible, so that the change in param.h suffices. As a result, performance for SGEMM improves by around 30% on z15. On z14, FP SIMD instructions can operate on float-sized scalars in vector registers, while z13 could do that for double-sized scalars only. Thus, we can double the amount of elements of C that are held in registers in an SGEMM kernel. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
Marius Hillenbrand	71b6eaf459	s390x: Use new sgemm kernel also for strmm on Z14 and newer Employ the newly added GEMM kernel also for STRMM on Z14. The implementation in C with vector intrinsics exploits FP32 SIMD operations and thereby gains performance over the existing assembly code. Extend the implementation for handling triangular matrix multiplication, accordingly. As added benefit, the more flexible C code enables us to adjust register blocking in the subsequent commit. Tested via make -C test / ctest / utest and by a couple of additional unit tests that exercise blocking. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
Marius Hillenbrand	43c0d4f312	s390x: Add vectorized sgemm kernel for Z14 and newer Add a new GEMM kernel implementation to exploit the FP32 SIMD operations introduced with z14 and employ it for SGEMM on z14 and newer architectures. The SIMD extensions introduced with z13 support operations on double-sized scalars in vector registers. Thus, the existing SGEMM code would extend floats to doubles before operating on them. z14 extended SIMD support to operations on 32-bit floats. By employing these instructions, we can operate on twice the number of scalars per instruction (four floats in each vector registers) and avoid the conversion operations. The code is written in C with explicit vectorization. In experiments, this kernel improves performance on z14 and z15 by around 2x over the current implementation in assembly. The flexibilty of the C code paves the way for adjustments in subsequent commits. Tested via make -C test / ctest / utest and by a couple of additional unit tests that exercise blocking (e.g., partial register blocks with fewer than UNROLL_M rows and/or fewer than UNROLL_N columns). Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 15:59:51 +02:00
Martin Kroeker	2271c3506b	Work around excessive LAPACK test failures on Skylake-X Something in the plain C parts of x86_64 cscal.c and zscal.c appears to be miscompiled by both gfortran9 and ifort when compiling for skylakex-avx512, even when the optimized Haswell microkernel is not in use.	2020-05-09 23:49:18 +02:00
Rajalakshmi Srinivasaraghavan	bd9ff820bc	Fix cmake compilation issue - POWER9 This patch removes extra space in the sgemmotcopy filename thereby allowing it to create entry in kernel/Makefile created by cmake.	2020-05-08 20:31:56 -05:00
Ashwin Sekhar T K	8353cb245a	ARM64: Improve DAXPY for ThunderX2 Improve performance of DAXPY for ThunderX2 when the vector fits in L1 Cache.	2020-05-07 09:22:50 -07:00
Martin Kroeker	90dba9f716	Duplicate earlier Clang 9.0.0 workaround for corresponding Apple Clang version As discussed on the original PR #2329, the "Apple Clang 11.0.3" that appears to be based the same LLVM release produces the same miscompilation of this file.	2020-05-05 10:44:50 +02:00
Martin Kroeker	5dd14e3d48	Make building the bfloat16 functions conditional on option BUILD_HALF (#2590 ) * make building the bfloat16 BLAS functions conditional on BUILD_HALF * pass the BUILD_HALF option to gensymbol * Pass BUILD_HALF as a compiler define for dynamic_arch builds	2020-05-01 09:58:30 +02:00
Martin Kroeker	06208c8d01	Limit this fix to ELFv2 builds	2020-04-22 14:16:40 +02:00
Martin Kroeker	f5c4c28b98	Work around POWER8BE bugs on FreeBSD (ELFv2) for #2299	2020-04-21 17:17:17 +02:00
Martin Kroeker	fa42588e1f	Merge pull request #2565 from martin-frbg/mips24k Support MIPS32 24K family as P5600	2020-04-20 17:13:53 +02:00
Martin Kroeker	e55ec82bb9	Delete KERNEL.1004K	2020-04-19 15:44:30 +02:00
Martin Kroeker	7353ea5afc	Delete KERNEL.24K	2020-04-19 15:44:19 +02:00
Martin Kroeker	6a04efb122	Rename KERNEL files to include MIPS prefix	2020-04-19 15:43:54 +02:00
Martin Kroeker	d712ea724c	Add MIPS24K support	2020-04-18 21:10:18 +02:00
Rajalakshmi Srinivasaraghavan	22bb50fb81	cmake fixes	2020-04-17 13:35:17 -05:00
Rajalakshmi Srinivasaraghavan	67cc4b9e16	Fix warnings in clang and export symbol	2020-04-15 19:15:23 -05:00
Rajalakshmi Srinivasaraghavan	a87793e03c	Fix DYNAMIC_ARCH compilation errors	2020-04-15 09:09:50 -05:00
Rajalakshmi Srinivasaraghavan	ff010f496e	Build shgemm for all architecture	2020-04-14 20:38:53 -05:00
Rajalakshmi Srinivasaraghavan	7eb55504b1	RFC : Add half precision gemm for bfloat16 in OpenBLAS This patch adds support for bfloat16 data type matrix multiplication kernel. For architectures that don't support bfloat16, it is defined as unsigned short (2 bytes). Default unroll sizes can be changed as per architecture as done for SGEMM and for now 8 and 4 are used for M and N. Size of ncopy/tcopy can be changed as per architecture requirement and for now, size 2 is used. Added shgemm in kernel/power/KERNEL.POWER9 and tested in powerpc64le and powerpc64. For reference, added a small test compare_sgemm_shgemm.c to compare sgemm and shgemm output. This patch does not cover OpenBLAS test, benchmark and lapack tests for shgemm. Complex type implementation can be discussed and added once this is approved.	2020-04-14 14:55:08 -05:00
Martin Kroeker	5b0093b5fe	Convert aligned moves to unaligned should have no performance impact on reasonably modern cpus and fixes occasional crashes in actual user code.	2020-04-13 14:58:52 +02:00
Martin Kroeker	e9bfa2291a	Fix parameter overflow	2020-04-12 19:47:02 +02:00
gxw	8d07cf9b67	Fix compilation problem on loongson platform Using "make TARGET=GENERIC" on loongson platform will get the following error messages: "make[1]: *** No rule to make target 'sgemm_incopy.o', needed by 'libs'" Add kernel/mips64/KERNEL.generic to slove the problem.	2020-04-09 19:28:15 +08:00
Martin Kroeker	806f89166e	Make ARMV7 compile with xcode and add a CI job for it (#2537 ) * Add an ARMV7 iOS build on Travis * thread_local appears to be unavailable on ARMV7 iOS * Add no-thumb option for ARMV7 IOS build to get it to accept DMB ISH * Make local labels in macros of nrm2_vfpv3.S compatible with the xcode assembler	2020-04-02 10:30:37 +02:00
Martin Kroeker	c6af9bbb32	Merge pull request #2534 from martin-frbg/issue2496 Fix zero initialization for beta=0 case	2020-03-31 20:53:13 +02:00
Martin Kroeker	144be81ca1	fix initialization to zero in the NEON SGEMM_BETA kernel as well	2020-03-31 16:53:56 +02:00
Martin Kroeker	07cdd5d05c	Fix zero initialization for beta=0 case use immediate initialization instead of multiplication in case register content is a NaN	2020-03-31 00:21:02 +02:00
Martin Kroeker	567d2760e6	Merge pull request #2520 from wjc404/develop Fix avx512 sgemm performance bug when ldc is a multiple of 1024	2020-03-30 20:15:59 +02:00
wjc404	b8307768e2	Add files via upload	2020-03-21 05:42:10 +08:00
Martin Kroeker	af8a619e1f	Merge pull request #2517 from wjc404/develop Temporary fix for SKX STRSM	2020-03-17 10:12:53 +01:00
wjc404	62b9608986	Update KERNEL.SKYLAKEX	2020-03-17 12:52:55 +08:00
Martin Kroeker	a1b181cea2	Merge pull request #2516 from wjc404/develop AVX2 STRSM kernels	2020-03-16 21:58:34 +01:00
wjc404	cdc0e9011e	Update KERNEL.ZEN	2020-03-16 16:39:37 +00:00
wjc404	fa049d49c2	AVX2 STRSM kernel	2020-03-17 00:34:08 +08:00
s00548429	bec7923a0d	Fix the functional bugs for zamax.	2020-03-09 15:36:50 +08:00
Rajalakshmi Srinivasaraghavan	2afc074803	Fix DYNAMIC_ARCH build for POWER9 Setting DYNAMIC_ARCH=1 on POWER9 does not build POWER9 files due to some compiler version checks. This patch fixes some of the macros that are used to check compiler version. On fixing those checks, there are some new make failures related to icamin, icamax, isamin, isamax and caxpy files on POWER9. This patch fixes those failures as well.	2020-03-03 12:35:10 -06:00
Martin Kroeker	4f371b0fbf	Use POWER8 kernels on big-endian POWER9 for now	2020-03-01 23:45:58 +01:00
Martin Kroeker	ea8eec5d17	Merge pull request #2422 from wjc404/develop Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix performance bugs in AVX2 s/c/z GEMM	2020-02-29 19:07:35 +01:00
Ali Saidi	c623a965f9	Add Neoverse-N1 core The implementation is a hybird of the ARMV8 one with some of the improved TX2 rountines along with specifying -march=v8.2-a	2020-02-29 03:22:04 +00:00
wjc404	dd22eb7621	Update cgemm_kernel_8x2_haswell.c	2020-02-27 22:26:15 +08:00
wjc404	2352331e60	Update zgemm_kernel_4x2_haswell.c	2020-02-27 22:25:19 +08:00
Xianyi Zhang	265ab484c8	Change default RISC-V 64-bit corename to RISCV64_GENERIC e.g. make CC=riscv64-unknown-linux-gnu-gcc FC=riscv64-unknown-linux-gnu-gfortran TARGET=RISCV64_GENERIC HOSTCC=gcc	2020-02-27 14:46:15 +08:00
Xianyi Zhang	44020a42a4	Fixed compile bug for RV64.	2020-02-27 14:29:42 +08:00
Xianyi Zhang	4aa2d89217	Merge branch 'develop' into risc-v	2020-02-27 13:53:49 +08:00
wjc404	1b980001dd	Update zgemm_kernel_4x2_haswell.c	2020-02-26 18:38:12 +08:00
wjc404	2515e1152f	Update cgemm_kernel_8x2_haswell.c	2020-02-26 18:36:54 +08:00
Martin Kroeker	ddcbed6690	Merge pull request #2437 from martin-frbg/issue2434 [WIP] Add support for Ampere EMAG8180 ARMV8 cpu	2020-02-25 18:42:52 +01:00
wjc404	903854c168	Add files via upload	2020-02-22 23:40:02 +08:00
wjc404	a2ff577a30	Update KERNEL.ZEN	2020-02-22 23:39:43 +08:00
wjc404	97a32cb0a5	Update KERNEL.HASWELL	2020-02-22 23:39:20 +08:00
Martin Kroeker	07454bf4d5	Add proper defaults for IxMIN/IxMAX kernels the fallbacks from Makefile.L1 assume a combined source for absolute value and non-absolute (with ifdef USE_ABS) but here we have separate implementations	2020-02-21 11:58:15 +01:00
Martin Kroeker	4046985913	Add proper defaults for IxMIN/IxMAX kernels the fallbacks from Makefile.L1 assume a combined source for absolute value and non-absolute (with ifdef USE_ABS) but here we have separate implementations	2020-02-21 11:55:52 +01:00
Martin Kroeker	e57b11acca	Add preliminary support for EMAG8180	2020-02-19 19:00:28 +01:00
Martin Kroeker	0b39cf95b0	Fix endianness conditionals	2020-02-19 18:09:54 +01:00
Martin Kroeker	9f39f0a2c3	Specify ismin/ismax assembly kernels for POWER8 directly to fix utest failure in new ismin test - Makefile.L1 defaults look wrong	2020-02-17 19:55:39 +01:00
Martin Liska	aeea14ee40	Come up with LOAD_AND_COMPARE_TO_MXX macro in iamax_sse.S.	2020-02-17 09:01:53 +01:00
Martin Liska	18bcc36a69	Fix implementation of iamax_sse.S as reported in #2116 . The was a typo in iamax_sse.S where one of the comparison was cmpeqps instead of cmpeqss. That misdetected index for sequences where the minimum value was 0.	2020-02-17 09:01:53 +01:00
Martin Liska	0e7f43c898	Add missing USE_MIN in kernel/CMakeLists.txt.	2020-02-17 09:01:53 +01:00
wjc404	f566787e6e	Update KERNEL.SKYLAKEX	2020-02-16 22:58:44 +08:00
wjc404	e3368cbf18	AVX512 STRMM kernel	2020-02-16 22:58:00 +08:00
Martin Kroeker	cafdd999b8	Update caxpy_power8.S	2020-02-13 22:44:09 +01:00
Martin Kroeker	92ca92a46c	Update caxpy_power8.S	2020-02-13 21:24:54 +01:00
Martin Kroeker	486c35c5dc	Update icamin_power8.S	2020-02-13 18:38:43 +01:00
Martin Kroeker	5ba3699f41	Update isamin_power8.S	2020-02-13 00:00:32 +01:00
Martin Kroeker	8eefa530cd	Update isamax_power8.S	2020-02-12 23:59:50 +01:00
Martin Kroeker	de40d47edf	Update isamin_power8.S	2020-02-12 23:57:48 +01:00
Martin Kroeker	7c162b8a21	Update isamax_power8.S	2020-02-12 23:56:57 +01:00
Martin Kroeker	0544cbc806	Fix syntax of endianness conditional	2020-02-12 20:00:29 +01:00
Martin Kroeker	120d20731f	Fix syntax of endianness conditional	2020-02-12 19:58:42 +01:00
Martin Kroeker	dc345d84df	Fix syntax of endianness conditional and add gcc version check for workaround	2020-02-12 19:56:52 +01:00
Bart Oldeman	7ea5e07d1c	Fix inline asm in dscal: mark x, x1 as clobbered. Fixes #2408 The leaq instructions in dscal_kernel_inc_8 modify x and x1 so they must be declared as input/output constraints, otherwise the compiler may assume the corresponding registers are not modified.	2020-02-12 14:11:44 +00:00
Martin Kroeker	7e5cbb6f35	Fix bad conditional syntax that caused spurious application of USE_TRMM	2020-02-10 21:17:39 +01:00
wjc404	3447d04eaf	Update dgemm_kernel_16x2_skylakex.c	2020-02-06 02:14:10 +00:00
wjc404	8b5cdcc64c	Update sgemm_kernel_8x4_haswell.c	2020-02-06 01:47:46 +00:00
wjc404	4e00d96a78	Update dgemm_kernel_16x2_skylakex.c	2020-02-06 01:46:36 +00:00
wjc404	096da2f51a	Update dgemm_kernel_16x2_skylakex.c	2020-02-05 13:36:57 +08:00
wjc404	081b188529	Update KERNEL.SKYLAKEX	2020-02-03 21:38:08 +08:00
wjc404	8019e70211	AVX512 16x2 DGEMM kernel	2020-02-03 21:32:56 +08:00
Qiyu8	ff42e68652	Optimize genenal Gemm Beta	2020-01-20 11:49:42 +08:00
Martin Kroeker	70f45749b9	Merge pull request #2367 from wjc404/develop Improve paralleled SGEMM performance on SKYLAKEX CPUs	2020-01-15 21:13:43 +01:00
wjc404	e5dcdeb550	Update sgemm_direct_skylakex.c	2020-01-13 16:59:23 +08:00
wjc404	952cc2ba38	Update sgemm_kernel_16x4_skylakex_2.c	2020-01-13 16:58:54 +08:00
wjc404	feaafbedd3	make skylakex sgemm code more friendly for readers BTW some kernels were adjusted to improve performance	2020-01-13 16:28:41 +08:00
Martin Kroeker	b36018be6d	Merge pull request #2365 from wjc404/develop Fix SKYLAKEX STRMM issues	2020-01-09 23:23:09 +01:00
wjc404	3a100b2797	Update KERNEL.SKYLAKEX	2020-01-09 13:48:41 +08:00
Martin Kroeker	38742d5547	Merge pull request #2361 from wjc404/develop Optimize AVX2 SGEMM & STRMM	2020-01-08 16:20:28 +01:00
wjc404	bd4c032f52	Update sgemm_kernel_8x4_haswell.c	2020-01-07 11:22:46 +08:00
wjc404	9dc9b7b95e	Update sgemm_kernel_8x4_haswell.c	2020-01-06 20:11:36 +08:00
wjc404	92b10212de	optimize AVX2 SGEMM	2020-01-06 12:11:21 +08:00
wjc404	b73bf01378	optimize AVX2 SGEMM	2020-01-06 12:09:14 +08:00
wjc404	eb3c9f1db9	optimize AVX2 SGEMM	2020-01-06 12:07:02 +08:00
Martin Kroeker	456ee2e1f0	Merge pull request #2357 from chenxuqiang/dgemm_beta_zero kernel/arm64/dgemm_beta.S: add beta == zero branch	2020-01-02 22:28:36 +01:00
shengyang	80db5f11e1	update	2020-01-02 11:01:57 +08:00
chenxuqiang	52de4cc8fd	kernel/arm64/dgemm_beta.S: add beta == zero branch added beta == zero branch, and no need to load C matrix. Signed by: Xuqiang Chen <chenxuqiang3@hisilicon.com>	2020-01-01 21:50:45 -05:00
Martin Kroeker	44028581cc	Merge pull request #2355 from Zeyiii/dev-zeyi2 Use arm neon instructions to optimize sgemm_beta operation	2020-01-01 22:14:16 +01:00
Martin Kroeker	86ab939936	Merge pull request #2354 from ZuoQ3/develop [WIP] Use arm neon instructions to optimize tcopy operation	2020-01-01 22:13:37 +01:00
Martin Kroeker	6c85cb1869	Merge pull request #2352 from wjc404/develop AVX2 ZGEMM3M kernel	2019-12-31 18:08:10 +01:00
Martin Kroeker	995768bbc5	Merge pull request #2351 from Zeyiii/develop prefetching for dgemm_beta	2019-12-31 18:07:37 +01:00
int_13h	96ad579428	add in runtime cpu detection for zarch (#2349 ) add in runtime cpu detection for zarch	2019-12-31 18:03:27 +01:00
shengyang	8d84403205	Use arm neon instructions to optimize ncopy operation modified: KERNEL.ARMV8 modified: KERNEL.TSV110 new file: sgemm_ncopy_4.S	2019-12-31 17:06:35 +08:00
w00421467	0833a4846a	Use arm neon instructions to optimize sgemm_beta operation	2019-12-31 10:42:03 +08:00
zq	50f7fc1401	[WIP] Use arm neon instructions to optimize tcopy operation	2019-12-31 10:21:23 +08:00
w00421467	d1b53806be	Merge remote-tracking branch 'pub/develop' into develop	2019-12-31 10:13:24 +08:00
wjc404	a0f0a802fc	Update zgemm3m_kernel_4x4_haswell.c	2019-12-30 17:33:42 +08:00
wjc404	700fe5b5ee	Add files via upload	2019-12-30 17:18:59 +08:00
wjc404	f60840c420	Update KERNEL.ZEN	2019-12-30 16:04:23 +08:00
wjc404	109e18cd96	Update KERNEL.HASWELL	2019-12-30 16:03:24 +08:00
wjc404	ae1579be13	Create zgemm3m_kernel_4x4_haswell.c	2019-12-30 16:02:51 +08:00
w00421467	3ccf8885ac	prefetching for dgemm_beta	2019-12-30 11:45:49 +08:00
wjc404	cd765f094b	Update cgemm3m_kernel_8x4_haswell.c	2019-12-27 18:23:29 +08:00
wjc404	3a66c8cac1	Update KERNEL.ZEN	2019-12-27 18:04:08 +08:00
wjc404	ed9af2f7da	Update KERNEL.HASWELL	2019-12-27 18:01:38 +08:00
wjc404	5fd1edead9	Create cgemm3m_kernel_8x4_haswell.c	2019-12-27 18:00:55 +08:00
wjc404	eeecd623d8	Update cgemm_kernel_8x2_haswell.c	2019-12-24 00:40:16 +08:00
wjc404	2cd9306bb5	Update KERNEL.ZEN	2019-12-23 23:42:30 +08:00
wjc404	c418c81224	Update KERNEL.HASWELL	2019-12-23 23:41:44 +08:00
wjc404	025741f16a	Fast Haswell CGEMM kernel	2019-12-23 23:40:03 +08:00
wjc404	f41d52665d	Fast Haswell ZGEMM kernel	2019-12-21 14:37:06 +08:00
wjc404	d573d24de7	Fast Haswell ZGEMM kernel	2019-12-21 14:35:15 +08:00
w00421467	b7cc69ee62	declare DGEMM_BETA in KERNEL.ARMV8 rather than the generic KERNEL	2019-12-20 10:11:50 +08:00
w00421467	aeef942c4f	use arm neon instructions to optimize gemm beta operation	2019-12-17 10:00:13 +08:00
Martin Kroeker	1a6ea8ee6d	Merge pull request #2338 from kavanabhat/aix_mod Changes to build on AIX in POWER8 mode	2019-12-09 17:54:49 +01:00
Kavana Bhat	6baa9b07d7	AIX changes for Power8	2019-12-06 04:33:32 -06:00
Kavana Bhat	3938e59569	AIX changes for Power8	2019-12-04 00:23:46 -06:00
Isuru Fernando	b863b32ac5	Workaround an ICE in clang 9.0.0 This bug is not there in 8.x nor in the 9.0 daily snapshot.	2019-12-01 12:59:46 -06:00
Martin Kroeker	dd04143d4a	Merge pull request #2328 from martin-frbg/ppc9 Fix precompiled kernels on POWER9 and make their use conditional on (old) gcc version	2019-11-30 12:23:57 +01:00
Martin Kroeker	f3a6164bff	Merge pull request #2324 from antonblanchard/power9_segv Fix SEGV in cdot_power9	2019-11-30 00:03:42 +01:00
Martin Kroeker	dedd822d1a	Fix caxpy/caxpyc naming in localentry	2019-11-29 23:56:57 +01:00
Martin Kroeker	2181fb7047	Fix caxpy/caxpyc naming in localentry	2019-11-29 23:54:15 +01:00
Martin Kroeker	a9b62c03f8	Substitute precompiled gcc7 codes only when gcc is older than 9.x	2019-11-29 23:49:50 +01:00
Martin Kroeker	97762234f9	Add variable for gcc >=9 test used in KERNEL.POWER9	2019-11-29 23:47:23 +01:00
wjc404	934e601e93	Update dgemm_kernel_4x8_skylakex_2.c	2019-11-28 19:56:35 +08:00
Anton Blanchard	cf2a8e410c	Fix SEGV in cdot_power9 We were corrupting r2 because the local entry wasn't being setup correctly.	2019-11-26 21:55:04 -07:00
wjc404	eb1e9c8c92	some optimizations	2019-11-26 14:12:20 +08:00
Andreas Arnez	d117dfd505	Change bad usage of "asum" to "sum" in ZARCH versions of ?sum The ZARCH implementations of ?sum contain a cut & paste-error: An inline assembly argument is named "sum", but the assembly references "asum" instead. The mismatch causes a build error. This is fixed.	2019-11-21 13:49:13 +01:00
Martin Kroeker	b09b5be0a4	Merge pull request #2315 from ewanglong/develop revised fix windows compatible for #2313	2019-11-21 05:06:44 +01:00
Wang, Long	bfb5fbdb4d	revised fix windows compatible for #2313 Signed-off-by: Wang, Long <long1.wang@intel.com>	2019-11-21 10:22:58 +08:00
Martin Kroeker	08fa83aba2	Merge pull request #2312 from martin-frbg/power8be Further Power8 big-endian corrections	2019-11-20 15:12:06 +01:00
Wang, Long	1191db1a49	For the sake of windows compatible, used "unsigned long long" to ensure 64-bit length Signed-off-by: Wang, Long <long1.wang@intel.com>	2019-11-20 21:30:47 +08:00
Wang, Long	0caf1434c9	Fix the integer overflow issue for large matrix size For large matrix, e.g. M=N=K, and M>1290, int mnk=MNK will overflow. This will lead to wrong branching to single-threading. The performance is downgraded significantly. Signed-off-by: Wang, Long <long1.wang@intel.com>	2019-11-20 14:11:17 +08:00
Martin Kroeker	cad0d150db	Define alternate kernels for big-endian POWER8	2019-11-17 23:12:10 +01:00
Martin Kroeker	eba0aeb7cd	Fix compilation for big-endian POWER8	2019-11-17 22:58:32 +01:00
Martin Kroeker	0c07c356c1	Define alternate kernels for big-endian PPC440	2019-11-17 19:25:08 +01:00
Martin Kroeker	3e67017ac8	Merge pull request #2309 from martin-frbg/ppc970-be Fix PPC970 big-endian support	2019-11-17 18:22:24 +01:00
Martin Kroeker	b3ac6ee222	Define alternate kernels for big-endian PPC970 The altivec versions of SGEMM and CGEMM fail most test in LAPACK-TESTING when compiled for big endian, STRSM/CTRSM even cause segfaults. The rot kernels either fail the corresponding utest or lead to failures in LAPACK-TESTING.	2019-11-17 15:19:39 +01:00
Martin Kroeker	71e96163db	Merge pull request #2305 from wjc404/develop AVX512 CGEMM & ZGEMM kernels	2019-11-12 07:38:37 +01:00
wjc404	819e852ae7	AVX512 CGEMM & ZGEMM kernels 96-99% 1-thread performance of MKL2018	2019-11-11 20:04:52 +08:00
Martin Kroeker	4c6a457358	Merge pull request #2300 from wjc404/develop Optimize SGEMM on SKYLAKEX CPUs	2019-11-06 07:27:33 +01:00
wjc404	836c414e22	optimizations of software prefetching	2019-11-05 13:36:56 +08:00
Martin Kroeker	3cd97f1a80	Merge pull request #2301 from martin-frbg/ppc8be Disable IDAMIN/MAX and IZAMIN/MAX optimizations on big-endian POWER8	2019-11-04 22:54:28 +01:00
wjc404	430c11e135	Add files via upload	2019-11-04 20:10:12 +08:00
wjc404	fbacd2605d	optimizations via software prefetches	2019-11-04 19:37:19 +08:00
Martin Kroeker	68597002ea	The assembly microkernel is not safe to use on ELFv1	2019-11-03 22:42:46 +01:00
Martin Kroeker	d2a6285549	The assembly microkernel is not safe to use on ELFv1	2019-11-03 22:41:19 +01:00
Martin Kroeker	d999688d1a	The assembly microkernel is not safe to use on ELFv1	2019-11-03 22:39:06 +01:00
Martin Kroeker	928fe1b28e	The assembly microkernel is not safe to use on ELFv1	2019-11-03 22:37:27 +01:00
wjc404	1df9a2013d	new sgemm kernel for skylakex	2019-11-02 00:00:48 +08:00
Martin Kroeker	85ccdce8c4	Remove the IOS fallbacks to generic C kernels	2019-10-25 23:02:37 +02:00
wjc404	6ff013bae0	native support for icopy_4 90% MKL 1-thread performance.	2019-10-19 03:54:44 +08:00
wjc404	0d669e04bb	Update dgemm_kernel_8x8_skylakex.c	2019-10-18 15:00:17 +08:00
wjc404	17cdd9f9e1	some correction	2019-10-18 14:58:07 +08:00
wjc404	6bcb06fcb1	make further changes to icopy_8 easier	2019-10-18 10:47:31 +08:00
wjc404	b7315f8401	Add files via upload	2019-10-16 19:23:36 +08:00
wjc404	9b19e9e1b0	Update dgemm_kernel_8x8_skylakex.c	2019-10-16 10:14:51 +08:00
wjc404	6bd67ddbab	Update dgemm_kernel_8x8_skylakex.c	2019-10-16 03:20:08 +08:00
wjc404	844629af57	Add files via upload	2019-10-16 02:00:34 +08:00
Martin Kroeker	a448884a63	Remove automatic label postfixes from macro included only once	2019-10-08 08:37:50 +02:00
Martin Kroeker	3a2df19db6	Fix accidental duplication of jump instruction	2019-10-08 08:09:26 +02:00
Martin Kroeker	d2093a40d3	Merge pull request #2277 from martin-frbg/issue2275 Rewrite ARMV8 code to allow cross-compilation for IOS	2019-10-06 23:01:54 +02:00
Martin Kroeker	56837e9d92	Make local labels in macro compatible with the xcode assembler ... which does not perform the automatic numbering on instantiation that the _@ suffix signifies	2019-10-04 14:53:23 +02:00
Martin Kroeker	5e244d80f2	Merge pull request #2271 from quickwritereader/strmm_fix fixed bug power9 strmm . BLAS-TESTER passes	2019-09-29 13:53:45 +02:00
AbdelRauf	ede5efebab	trmm fix	2019-09-29 02:28:34 +00:00
Martin Kroeker	596a22325a	Fix prologue of power9 assembly cdot(c) kernel to provide cdotc	2019-09-27 00:47:18 +02:00
Martin Kroeker	7f58f3ad0e	Fix mis-edits in the gcc-derived power8 caxpy kernel	2019-09-27 00:44:26 +02:00
Martin Kroeker	673e5a0495	Replace several POWER8/9 C kernels with their gcc7-generated assembly versions (#2263 ) * Add gcc7-generated assembly files for POWER8/9 isa/ica-min/max and POWER9 caxpy To work around internal compiler errors encountered when compiling the original C source with gcc 4 and 5, and wrong code generated by gcc 8.3.0 * Use gcc-generated assembly instead of original C sources to work around internal compiler errors encountered with gcc 4.8/5.4 and wrong code generation by gcc 8.3 * Use gcc-generated assembly instead of the original C source to work around internal compiler errors encountered with gcc 4.8 and 5.4, and wrong code generation by gcc 8.3 * Add gcc7-generated assembler version of caxpy for power8 to work around wrong code generated by gcc 8.3 * Handle CONJ define for caxpyc * Handle CONJ define for caxpyc * Add gcc7-generated assembly cdot for POWER9 * Use prebuilt assembly for POWER9 cdot created with gcc 7.3.1 to work around ICE in older gcc versions * Exclude POWER9 from DYNAMIC_ARCH when gcc versions is lower than 6 * Update Makefile.system * Use PROLOGUE macro to ensure correct function name for DYNAMIC_ARCH * Disable POWER9 with old gcc versions	2019-09-22 22:35:22 +02:00
Martin Kroeker	e7c4d6705a	Revert #2051 and replace with a better fix (#2261 ) * Revert #2051 and add a better fix for TARGET=generic with DYNAMIC_ARCH fixes #2257 without breaking #2048 again	2019-09-17 18:56:04 +02:00
Martin Kroeker	f3c314550c	Merge pull request #2243 from quickwritereader/develop possible cgemv,caxpy,cdot fix	2019-08-30 23:06:23 +02:00
AbdelRauf	847c20c9b7	fix uninitialized variables i	2019-08-30 11:14:55 +00:00
AbdelRauf	4c22828812	caxpy and cdot are using vec_vsx_ld	2019-08-30 04:09:15 +00:00
AbdelRauf	e79712d969	cgemv using vec_vsx_ld instead of letting gcc to decide	2019-08-30 02:52:04 +00:00
AbdelRauf	be09551cdf	aligned	2019-08-29 23:22:23 +00:00
Martin Kroeker	11c59acfb1	Keep both PGI/SUN and default code paths to avoid breaking Clang/WIndows	2019-08-28 18:07:44 +02:00
Martin Kroeker	3a55dca2dc	Make x86_64 zdot compile with PGI and Sun C again broken by #2222 as CREAL,CIMAG do not expand to a valid lvalue with these compilers	2019-08-28 11:35:31 +02:00
Kavana Bhat	3dc6b26eff	AIX changes for Power8	2019-08-20 06:51:35 -05:00
Martin Kroeker	9ef96b32a6	Add multithreading support to the x86_64 zdot kernel (#2222 ) * Add multithreading support copied from the ThunderX2T99 kernel. For #2221	2019-08-15 22:09:12 +02:00
Martin Kroeker	103b32fdb7	Merge pull request #2216 from martin-frbg/issue2214 Remove case-sensitivity in x86 LSAME on (AMD) cpus without CMOV	2019-08-13 13:59:33 +02:00
Martin Kroeker	aef9804089	Fix unwanted case-sensitivity in x86 LSAME for (AMD) processors without CMOV Problem was already noticed some years ago in #238, but back then the problem was only corrected in one of the #ifdef branches. Fixes #2214	2019-08-13 10:19:10 +02:00
Martin Kroeker	dccff2e785	Merge pull request #2206 from martin-frbg/zen-dtrmm Replace vpermpd with vpermilpd in the Haswell DTRMM kernel	2019-08-09 07:55:20 +02:00
Martin Kroeker	5c3458a6e7	Merge pull request #2199 from martin-frbg/zen-dtrsm Replace most vpermpd calls in the Haswell DTRSM_RN kernel	2019-08-09 07:55:02 +02:00
Martin Kroeker	acf6002ab2	Replace most vpermpd calls in the Haswell DTRSM_RN kernel	2019-08-03 12:40:13 +02:00
Martin Kroeker	2dfb804cb9	Replace vpermpd with vpermilpd in the Haswell DTRMM kernel to improve performance on AMD Zen (#2180) applying wjc404's improvement of the DGEMM kernel from #2186	2019-07-28 23:17:28 +02:00
Martin Kroeker	4c153ec9da	Merge pull request #2196 from wjc404/develop Add vbroadcastsd kernel to dgemm_kernel_4x8_haswell.S	2019-07-28 23:11:40 +02:00
wjc404	7eecd8e39c	Add files via upload	2019-07-28 07:39:09 +08:00
Martin Kroeker	7b0b7c11d2	Merge pull request #2190 from martin-frbg/zdot-zen Replace vpermpd with vpermilpd in the Haswell/Zen zdot microkernel	2019-07-23 16:15:08 +02:00
Martin Kroeker	28e96458e5	Replace vpermpd with vpermilpd to improve performance on Zen/Zen2 (as demonstrated by wjc404 in #2180)	2019-07-22 08:28:16 +02:00
wjc404	95fb98f556	Update dgemm_kernel_4x8_haswell.S	2019-07-21 01:10:32 +08:00
wjc404	4801c6d36b	Update dgemm_kernel_4x8_haswell.S	2019-07-21 00:47:45 +08:00
wjc404	9440fa607d	Add files via upload	2019-07-20 22:08:22 +08:00
wjc404	94db259e5b	Add files via upload	2019-07-20 22:04:41 +08:00
wjc404	f49f8047ac	Add files via upload	2019-07-20 14:33:37 +08:00
wjc404	825777faab	Update dgemm_kernel_4x8_haswell.S	2019-07-19 23:58:24 +08:00
wjc404	9c89757562	Add files via upload	2019-07-19 23:47:58 +08:00
wjc404	9b04baeaee	Update dgemm_kernel_4x8_haswell.S	2019-07-17 23:50:03 +08:00
wjc404	8a074b3965	Update dgemm_kernel_4x8_haswell.S	2019-07-17 23:47:30 +08:00
wjc404	211ab03b14	Update dgemm_kernel_4x8_haswell.S	2019-07-17 22:39:15 +08:00
wjc404	1733f927e6	Update dgemm_kernel_4x8_haswell.S	2019-07-17 21:27:41 +08:00
wjc404	182b06d6ad	Update dgemm_kernel_4x8_haswell.S	2019-07-17 17:02:35 +08:00
wjc404	7a9050d681	Update dgemm_kernel_4x8_haswell.S	2019-07-17 00:55:06 +08:00
wjc404	0ba29fd262	Update dgemm_kernel_4x8_haswell.S for zen2 replaced a bunch of vpermpd instructions with vpermilpd and vperm2f128	2019-07-17 00:46:51 +08:00
Martin Kroeker	6b6c9b1441	Merge pull request #2172 from quickwritereader/develop power9 cgemm/ctrmm. new sgemm 8x16	2019-07-01 21:06:02 +02:00
AbdelRauf	a97b301aaa	cgemm/ctrmm power9	2019-07-01 14:07:54 +00:00
Piotr Kubaj	eebfeba768	Fix build on FreeBSD/powerpc64. Signed-off-by: Piotr Kubaj <pkubaj@anongoth.pl>	2019-06-25 10:58:56 +02:00
kavanabhat	a575f1e4c7	Update dtrmm_kernel_16x4_power8.S	2019-06-19 15:27:14 +05:30
AbdelRauf	cdbfb891da	new sgemm 8x16	2019-06-17 15:33:38 +00:00
Martin Kroeker	a17cf36225	Merge pull request #2153 from quickwritereader/develop improved power9 zgemm,sgemm	2019-06-06 07:42:56 +02:00
AbdelRauf	148c4cc5fd	conflict resolve	2019-06-05 20:50:50 +00:00
AbdelRauf	d0c3543c3f	power9 zgemm ztrmm optimized	2019-06-05 20:07:16 +00:00
AbdelRauf	a469b32cf4	sgemm pipeline improved, zgemm rewritten without inner packs, ABI lxvx v20 fixed with vs52	2019-06-04 07:11:30 +00:00
AbdelRauf	8fe794f059	improved zgemm power9 based on power8	2019-05-30 15:31:25 +00:00
Martin Kroeker	74c10b57c6	Use generic kernels for complex (I)AMAX to support softfp	2019-05-30 11:38:11 +02:00
Martin Kroeker	c5495d2056	Ensure correct output for DAMAX with softfp	2019-05-30 11:25:43 +02:00
Martin Kroeker	c70496b108	Separate implementations of AMAX and IAMAX on arm As noted in #1912 and comment on #1942, the combined implementation happens to "do the right thing" on hardfp, but cannot return both value and index on softfp where they would have to share the return register	2019-05-29 15:02:51 +02:00
Martin Kroeker	9ea30f3788	Replace ISMIN and ISAMIN kernels on all x86_64 platforms (#2125 ) * Mark iamax_sse.S as unsuitable for MIN due to issue #2116 * Use iamax.S rather than iamax_sse.S for ISMIN/ISAMIN on all x86_64 as workaround for #2116	2019-05-09 14:42:36 +02:00
Martin Kroeker	6a8b4269b5	Merge pull request #2111 from martin-frbg/issue1955 Disable the SkyLakeX DGEMMIxCOPY kernels as well	2019-05-05 18:08:49 +02:00
Martin Kroeker	b1561ecc68	Disable DGEMMINCOPY as well for now #1955	2019-05-05 15:52:01 +02:00
Martin Kroeker	7ed8431527	Disable the SkyLakeX DGEMMITCOPY kernel as well as a stopgap measure for https://github.com/numpy/numpy/issues/13401 as mentioned in #1955	2019-05-04 22:54:41 +02:00
Martin Kroeker	3f427c0cf9	Merge pull request #2107 from quickwritereader/develop sgemm/strmm kernel for power9	2019-05-02 07:56:57 +02:00
AbdelRauf	47f892198c	conflict resolve	2019-05-01 19:36:22 +00:00
AbdelRauf	628b335e83	Merge branch 'develop' of https://github.com/quickwritereader/OpenBLAS into develop	2019-04-29 08:57:44 +00:00
AbdelRauf	0f105dd8a5	sgemm/strmm	2019-04-29 08:49:50 +00:00
Martin Kroeker	ccfb7ead15	Merge pull request #2072 from martin-frbg/sum Add (C)BLAS extension ?sum	2019-04-23 20:11:36 +02:00
Rashmica Gupta	bcdf1d4917	Add in runtime CPU detection for POWER.	2019-04-09 14:20:16 +10:00
Martin Kroeker	c04a729081	Add ?sum definitions for generic kernel	2019-03-31 13:55:49 +02:00
Martin Kroeker	100d94f94e	Add ?sum	2019-03-31 13:55:05 +02:00
Martin Kroeker	246ca29679	Add ZARCH implementation of ?sum as trivial copies of the respective ?asum kernels with the ABS and vflpsb calls removed	2019-03-30 22:49:05 +01:00
Martin Kroeker	9d717cb5ee	Add x86_64 implementation of ?sum as trivial copy of ?asum with the fabs calls removed	2019-03-30 22:27:04 +01:00
Martin Kroeker	e3bc83f2a8	Add x86 implementation of ?sum as trivial copy of ?asum with the fabs calls removed	2019-03-30 22:26:10 +01:00
Martin Kroeker	70f2a4e0d7	Add SPARC implementation of ?sum as trivial copy of ?asum with the fabs replaced by fmov to preserve code structure	2019-03-30 22:25:06 +01:00
Martin Kroeker	706dfe263b	Add POWER implementation of ?sum as trivial copy of ?asum with the fabs replaced by fmr to preserve code structure	2019-03-30 22:23:42 +01:00
Martin Kroeker	688fa9201c	Add MIPS64 implementation of ?sum as trivial copy of ?asum with the fabs replaced by mov to preserve code structure	2019-03-30 22:22:15 +01:00
Martin Kroeker	cdbe0f0235	Add MIPS implementation of ?sum as trivial copy of ?asum with the fabs calls removed	2019-03-30 22:20:14 +01:00
Martin Kroeker	f8b82bc6dc	Add ia64 implementation of ?sum as trivial copy of asum with the fabs calls removed	2019-03-30 22:18:03 +01:00
Martin Kroeker	3e3ccb9011	Add ARM64 implementations of ?sum as trivial copies of the respective ?asum kernels with the fabs calls removed	2019-03-30 22:13:36 +01:00
Martin Kroeker	94ab4e6fb2	Add ARM implementations of ?sum (trivial copies of the respective ?asum with the fabs calls removed)	2019-03-30 22:11:38 +01:00
Martin Kroeker	c3cfc6986b	Add implementations of ssum/dsum and csum/zsum as trivial copies of asum/zsasum with the fabs calls replaced by fmov to preserve code structure	2019-03-30 22:05:11 +01:00
Martin Kroeker	b9f4943a14	Add ?sum	2019-03-30 22:01:13 +01:00
Martin Kroeker	32c7063cb0	Merge pull request #2061 from martin-frbg/martin-frbg-patch-1 Disable the AVX512 DGEMM kernel (again)	2019-03-30 21:21:38 +01:00
Martin Kroeker	7c51cc8527	Merge branch 'develop' into develop	2019-03-29 19:36:29 +01:00
AbdelRauf	853a18bc17	power9 makefile. dgemm based on power8 kernel with following changes : 32x unrolled 16x4 kernel and 8x4 kernel using (lxv stxv butterfly rank1 update). improvement from 17 to 22-23gflops. dtrmm cases were added into dgemm itself	2019-03-29 15:49:40 +00:00
Martin Kroeker	e608d4f7fe	Disable the AVX512 DGEMM kernel (again) Due to as yet unresolved errors seen in #1955 and #2029	2019-03-13 22:10:28 +01:00
Martin Kroeker	03d7110900	Merge pull request #2042 from maomao194313/develop add TARGET support for HiSilicon tsv110 CPUs	2019-03-12 22:57:39 +01:00
Martin Kroeker	f18ab6c17b	Merge pull request #2051 from martin-frbg/issue2048 Make TARGET=GENERIC compatible with DYNAMIC_ARCH=1	2019-03-09 16:39:35 +01:00
Martin Kroeker	5b95534afc	Make TARGET=GENERIC compatible with DYNAMIC_ARCH=1 for issue #2048	2019-03-09 11:21:16 +01:00
Celelibi	b7f59da42d	Fix crash in sgemm SSE/nano kernel on x86_64 Fix bug #2047. Signed-off-by: Celelibi <celelibi@gmail.com>	2019-03-07 16:55:13 +01:00
maomao194313	783ba8058f	HiSilicon tsv110 CPUs optimization branch add HiSilicon tsv110 CPUs optimization branch	2019-03-04 16:30:50 +08:00
Andrew	6eee1beac5	move fix to right place	2019-02-24 20:41:02 +02:00
Martin Kroeker	e12cdf58ef	Merge pull request #2024 from martin-frbg/gcc9fixes4 Fix inline assembly constraints in Bulldozer TRSM kernels	2019-02-17 11:49:15 +01:00
Martin Kroeker	1860c9456d	Merge pull request #2023 from martin-frbg/gcc9fixes3 Fix inline assembly constraints in various x86_64 GEMVN kernels	2019-02-17 11:48:57 +01:00
Martin Kroeker	f9bb76d29a	Fix inline assembly constraints in Bulldozer TRSM kernels rework indices to allow marking i,as and bs as both input and output (marked operand n1 as well for simplicity). For #2009	2019-02-16 20:06:48 +01:00
Martin Kroeker	efb9038f72	Fix inline assembly constraints	2019-02-16 18:46:17 +01:00
Martin Kroeker	e976557d29	Fix inline assembly constraints rework indices to allow marking argument lda as input and output.	2019-02-16 18:36:39 +01:00
Martin Kroeker	9d8be15789	Fix inline assembly constraints rework indices to allow marking argument lda4 as input and output. For #2009	2019-02-16 18:24:11 +01:00
Martin Kroeker	d752799a0f	Merge pull request #2021 from martin-frbg/gcc9fixes2 Fix wrong constraints in inline assembly of Haswell DTRSM kernel	2019-02-16 18:05:40 +01:00
Martin Kroeker	c26c0b77a7	Fix wrong constraints in inline assembly for #2009	2019-02-15 15:08:16 +01:00
Martin Kroeker	1c6da2d03c	Merge pull request #2019 from martin-frbg/gcc9fixes Fix unannounced modification of input operand 8 (lda4) in Haswell GEMVN microkernel	2019-02-15 15:02:54 +01:00
Martin Kroeker	4255a58cd2	Rename operands to put lda on the input/output constraint list	2019-02-15 10:10:04 +01:00
Martin Kroeker	46e415b140	Save and restore input argument 8 (lda4) Fixes miscompilation with gcc9 -ftree-vectorize (related to issue #2009)	2019-02-14 22:43:18 +01:00
Bart Oldeman	69a97ca7b9	dgemv_kernel_4x4(Haswell): add missing clobbers for xmm0,xmm1,xmm2,xmm3 This fixes a crash in dblat2 when OpenBLAS is compiled using -march=znver1 -ftree-vectorize -O2 See also: https://github.com/easybuilders/easybuild-easyconfigs/issues/7180	2019-02-14 16:27:58 +00:00
Martin Kroeker	056917d616	Merge pull request #2013 from martin-frbg/issue2011 Fix invalid memory access in PPC gemm_beta	2019-02-14 09:29:34 +01:00
Martin Kroeker	718efcec6f	Fix out-of-bounds memory access in gemm_beta Fixes #2011 (as suggested by davemq), assuming typo by K.Goto	2019-02-13 22:08:37 +01:00
Martin Kroeker	f9d67bb5e8	Fix out-of-bounds memory access in gemm_beta Fixes #2011 (as suggested by davemq) presuming typo by K.Goto	2019-02-13 22:06:41 +01:00
Martin Kroeker	76bb74fcd4	Merge pull request #2012 from maamountki/z14 [ZARCH] Many improvements	2019-02-13 20:15:56 +01:00
maamountki	0a54c98b9d	[ZARCH] Modify constraints	2019-02-13 21:06:25 +02:00
maamountki	bec54ae366	[ZARCH] Fix caxpy	2019-02-13 12:54:35 +02:00
Martin Kroeker	ab1630f9fa	Fix declaration of arguments in inline assembly Argument 0 is modified so should be input and output	2019-02-12 16:14:02 +01:00
Martin Kroeker	b824fa70eb	Fix declaration of assembly arguments in SSYMV and DSYMV microkernels Arguments 0 and 1 are both input and output	2019-02-12 16:00:18 +01:00
Martin Kroeker	91481a3e4e	Fix declaration of input arguments in inline assembly Argument 0 is modified as it doubles as a counter	2019-02-12 15:51:43 +01:00
Martin Kroeker	dc6ac9eab0	Fix declaration of input arguments in the x86_64 s/dGEMV_T and s/dGEMV_N kernels Arguments 0 and 1 need to be tagged as both input and output	2019-02-12 15:33:48 +01:00
maamountki	f583674109	[ZARCH] Fix cgemv_t_4	2019-02-12 13:12:28 +02:00
maamountki	77fe70019f	[ZARCH] Fix constraints and source code formatting	2019-02-11 16:01:13 +02:00
maamountki	7039770165	[ZARCH] Undo the last commit	2019-02-06 20:11:44 +02:00
maamountki	11a43e8116	[ZARCH] Set alignment hint for vl/vst	2019-02-05 19:17:08 +02:00
maamountki	61526480f9	[ZARCH] Fix copy constraint	2019-02-05 07:51:19 +02:00
maamountki	81daf6bc38	[ZARCH] Format source code, Fix constraints	2019-02-05 07:30:38 +02:00
Martin Kroeker	729e925174	Merge pull request #1996 from quickwritereader/develop NBMAX=4096 for gemvn, added sgemvn 8x8 for future	2019-02-04 16:52:04 +01:00
Ubuntu	498ac98581	Note for unused kernels	2019-02-04 15:41:56 +00:00
Ubuntu	cd9ea45463	NBMAX=4096 for gemvn, added sgemvn 8x8 for future	2019-02-04 06:57:11 +00:00
Martin Kroeker	f9c5023e04	Merge pull request #1994 from quickwritereader/develop sgemv cgemv pairs	2019-02-01 21:04:47 +01:00
Ubuntu	4abc375a91	sgemv cgemv pairs	2019-02-01 13:45:00 +00:00
Martin Kroeker	874df65491	Fix incorrect sgemv results for IBM z14 part of PR #1993 that was inadvertently misplaced into the toplevel directory	2019-02-01 12:58:59 +01:00
Martin Kroeker	877023e1e1	Fix precision of zarch DSDOT from patch provided by aarnez in #991	2019-01-31 21:22:26 +01:00
Martin Kroeker	265142edd5	Fix typo in the zarch min/max kernels from patch provided by aarnez in #991	2019-01-31 21:21:40 +01:00
Martin Kroeker	885a3c4350	USE_TRMM on Z14 from patch provided by aarnez in #991	2019-01-31 21:18:09 +01:00
maamountki	82124729af	Merge branch 'develop' into z14	2019-01-31 19:36:41 +02:00
maamountki	29416cb5a3	[ZARCH] Add Z13 version for max/min functions	2019-01-31 19:11:11 +02:00
maamountki	48b9b94f7f	[ZARCH] Improve loading performance for camax/icamax	2019-01-31 18:52:11 +02:00
Martin Kroeker	86a824c97f	Fix wrong comparison that made IMIN identical to IMAX as reported by aarnez in #1990	2019-01-31 15:27:21 +01:00
Martin Kroeker	808410c2c7	Fix wrong comparison that made IMIN identical to IMAX as suggested in #1990	2019-01-31 15:25:15 +01:00
maamountki	fcd814a8d2	[ZARCH] Fix bug in max/min functions	2019-01-29 17:59:38 +02:00
maamountki	dc4d3bccd5	[ZARCH] Fix icamax/icamin	2019-01-29 03:47:49 +02:00
maamountki	c7143c1019	[ZARCH] Fix iamax/imax single precision	2019-01-28 17:52:23 +02:00
maamountki	04873bb174	[ZARCH] Undo the last commit	2019-01-28 17:32:24 +02:00
maamountki	c8ef9fb220	[ZARCH] Fix bug in iamax/iamin/imax/imin	2019-01-28 17:16:18 +02:00
maamountki	b111829226	[ZARCH] Update max/min functions	2019-01-21 15:56:04 +02:00
Martin Kroeker	32b0f1168e	Fix declaration of input arguments in the Sandybridge GER microkernels (#1967 ) * Tag arguments 0 and 1 as both input and output	2019-01-18 08:11:39 +01:00
Martin Kroeker	b495e54310	Fix declaration of input arguments in the x86_64 SCAL microkernels (#1966 ) * Tag arguments 0 and 1 as both input and output (see #1964)	2019-01-18 08:11:07 +01:00
Martin Kroeker	d5e6940253	Fix declaration of input arguments in the x86_64 microkernels for DOT and AXPY (#1965 ) * Tag operands 0 and 1 as both input and output For #1964 (basically a continuation of coding problems first seen in #1292)	2019-01-17 23:20:32 +01:00
Ubuntu	43a4572038	crot fix	2019-01-17 14:45:31 +00:00
Abdelrauf	a034e65512	Merge branch 'develop' into develop	2019-01-16 19:25:13 +04:00
Ubuntu	8c3386be87	Added missing Blas1 single fp {saxpy, caxpy, cdot, crot(refactored version of srot),isamax ,isamin, icamax, icamin}, Fixed idamin,icamin choosing the first occurance index of equal minimals	2019-01-16 15:16:21 +00:00
maamountki	b815a04c87	[ZARCH] fix a bug in max/min functions	2019-01-15 21:04:22 +02:00
maamountki	1a7925b3a3	[ZARCH] Update dgemv_n_4.c	2019-01-11 17:43:11 +02:00
maamountki	406f835f00	[ZARCH] update cgemv_n_4.c	2019-01-11 17:39:17 +02:00
maamountki	621dedb37b	[ZARCH] Update cgemv_t_4.c	2019-01-11 17:37:11 +02:00
maamountki	b731e8246f	Update sgemv_t_4.c	2019-01-11 17:14:04 +02:00
maamountki	ecc31b743f	Update dgemv_t_4.c	2019-01-11 17:13:02 +02:00
maamountki	5d89d6b143	[ZARCH] fix sgemv_n_4.c	2019-01-11 17:08:24 +02:00
maamountki	67432b23c2	[ZARCH] fix cgemv_n_4.c	2019-01-11 16:44:46 +02:00
maamountki	be66f5d5c2	[ZARCH] fix data prefetch type in sdot	2019-01-09 16:50:07 +02:00
maamountki	c2ffef8156	[ZARCH] fix data prefetch type in ddot	2019-01-09 16:49:44 +02:00
maamountki	e7455f500c	[ZARCH] fix dsdot.c	2019-01-09 16:33:54 +02:00
maamountki	3eafcfa650	[ZARCH] fix cgemv_n_4.c	2019-01-09 07:43:45 +02:00
maamountki	94cd946b96	[ZARCH] fix cgemv_n_4.c	2019-01-04 17:45:56 +02:00
maamountki	1aa840a0a2	[ZARCH] fix sgemv_t_4.c	2019-01-04 01:38:18 +02:00
Arjan van de Ven	795285c587	Fix thinko in skylake beta handling casting ints is cheaper but it has a rounding, not memory casing effect, resulting in invalid outcome	2018-12-24 18:49:50 +00:00
Arjan van de Ven	d321448a63	dgemm: use dgemm_ncopy_8_skylakex.c also for Haswell The dgemm_ncopy_8_skylakex.c code is not avx512 specific and gives a nice performance boost for medium sized matrices	2018-12-16 23:09:22 +00:00
Arjan van de Ven	c43331ad0a	dgemm: Use the skylakex beta function also for haswell it's more efficient for certain tall/skinny matrices	2018-12-16 23:09:17 +00:00
Martin Kroeker	c4e23dd016	Update Makefile	2018-12-16 18:14:40 +01:00
Martin Kroeker	cfc4acc221	typo	2018-12-16 16:19:51 +01:00
Martin Kroeker	545c2b1bbb	Add -mavx2 on Haswell only if the compiler supports it	2018-12-16 13:09:19 +01:00
Arjan van de Ven	69d206440a	Make the skylakex/haswell sgemm code compile and run even with compilers without avx2 support	2018-12-16 00:19:41 +00:00
Martin Kroeker	3843e3e017	use -maxv2 on haswell	2018-12-15 23:30:31 +01:00
Martin Kroeker	fbcb14a74b	should be core-avx2	2018-12-15 20:18:59 +01:00
Martin Kroeker	2a3190dc76	fix elseifeq and use older option core2-avx for compatibility	2018-12-15 20:17:44 +01:00
Martin Kroeker	1ebe5c0f49	Add -march=haswell to HASWELL part of DYNAMIC_ARCH build	2018-12-15 19:35:35 +01:00
Arjan van de Ven	0586899a10	Use sgemm_ncopy_4_skylakex.c also for Haswell sgemm_ncopy_4_skylakex.c uses SSE transpose operations where the real perf win happens; this also works great for Haswell. This gives double digit percentage gains on small and skinny matrices	2018-12-15 13:49:19 +00:00
Arjan van de Ven	00dc09ad19	Use the skylake sgemm beta code also for haswell with a few small changes it's possible to use the skylake sgemm code also for haswell, this gives a modest gain (10% range) for smallish matrixes but does wonders for very skinny matrixes	2018-12-15 13:49:13 +00:00
Arjan van de Ven	cdc668d82b	Add a "sgemm direct" mode for small matrixes OpenBLAS has a fancy algorithm for copying the input data while laying it out in a more CPU friendly memory layout. This is great for large matrixes; the cost of the copy is easily ammortized by the gains from the better memory layout. But for small matrixes (on CPUs that can do efficient unaligned loads) this copy can be a net loss. This patch adds (for SKYLAKEX initially) a "sgemm direct" mode, that bypasses the whole copy machinary for ALPHA=1/BETA=0/... standard arguments, for small matrixes only. What is small? For the non-threaded case this has been measured to be in the MNK = 28 * 512 * 512 range, while in the threaded case it's less, around MNK = 1 * 512 * 512	2018-12-13 13:47:31 +00:00
Martin Kroeker	87718807f0	Merge pull request #1910 from martin-frbg/issue1909 Fix for DYNAMIC_ARCH builds made on a AVX512-capable host	2018-12-12 14:56:25 +01:00
Martin Kroeker	51aec8e96b	make sure the added march=skylake-avx512 does not cause problems on Windows	2018-12-11 22:47:32 +01:00
Martin Kroeker	06f7d78d70	Add -march=skylake-avx512 to SkylakeX part of DYNAMIC_ARCH builds	2018-12-11 21:10:38 +01:00
Martin Kroeker	7639f2e1f0	Rewrite the conditional for OSX to fix cmake parsing on others The Makefile variable parser in utils.cmake currently does not handle conditionals. Having the definitions for non-OSX last will at least make cmake builds work again on non-OSX platforms.	2018-12-06 14:04:27 +01:00
Martin Kroeker	2fc712469d	Avoid creating spurious non-suffixed c/zgemm_kernels Plain cgemm_kernel and zgemm_kernel are not used anywhere, only cgemm_kernel_b etc. Needlessly building them (without any define like NN, CN, etc.) just happened to work on most platforms, but not on arm64. See #1870	2018-12-06 13:56:06 +01:00
Martin Kroeker	6ba30e270d	Fix typo that broke CNRM2 on ARMV8 since 0.3.0 must have happened in my #1449	2018-12-06 13:42:25 +01:00
Martin Kroeker	701ea88347	Use p2align instead of align for OSX compatibility fixes #1902	2018-12-03 13:06:43 +01:00
Martin Kroeker	6c7b691083	Really revert xDOT changes from 1832 neglected to rebase #1892 on merging	2018-11-30 21:32:01 +01:00
Martin Kroeker	5f4c550c27	Merge pull request #1892 from martin-frbg/mipsdot revert MIPS64 xDOT kernel changes from #1832	2018-11-30 21:28:21 +01:00
Martin Kroeker	95a5542e3c	Revert DOT kernel changes from #1834 as the failures seen on Loongson3A appear to be limited to DSDOT/SDSDOT (i.e. my hackish "fix" from #1684)	2018-11-30 11:16:24 +01:00
Martin Kroeker	7a2e1bc804	Use generic kernel for DSDOT/SDSDOT as discussed in #1834	2018-11-30 10:57:09 +01:00
Martin Kroeker	35653e38b3	Merge pull request #1834 from fengrl/develop register push/pop command change	2018-11-30 10:48:46 +01:00
Andrew	19c4bdd8b3	Add return value so that freebsd system clang does not err out	2018-11-25 21:35:01 +01:00
Renato Golin	310ea55f29	Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for all cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.	2018-11-19 16:41:49 +00:00
fengruilin	43bb386b10	fix dot problem on 64bit mips	2018-11-15 11:11:59 +08:00
Arjan van de Ven	dcc5d6291e	skylakex: Make the sgemm/dgemm beta code robust for a N=0 or M=0 case in the threading code there are cases where N or M can become 0, and the optimized beta code did not handle this well, leading to a crash during the audit for the crash a few edge conditions on the if statements were found and fixed as well	2018-11-01 01:42:09 +00:00
fengrl	2d8064174c	register push/pop command change 64bit push/pop register command should be used. Otherwise, data will lost.	2018-10-26 17:55:15 +08:00
Ashwin Sekhar T K	d5aeff636f	ARM64: Enable DYNAMIC_ARCH Enable DYNAMIC_ARCH feature on ARM64. This patch uses the cpuid feature in linux kernel to detect the core type at runtime (https://www.kernel.org/doc/Documentation/arm64/cpu-feature-registers.txt). If this feature is missing in kernel, then the user should use the OPENBLAS_CORETYPE env variable to select the desired core type.	2018-10-22 01:49:35 -07:00
Ashwin Sekhar T K	e7b66cd36e	ARM64: Fix DYNAMIC_ARCH compilation for cores which dont use GEMM3M	2018-10-22 01:45:51 -07:00
Ashwin Sekhar T K	d50abc8903	ARM64: Move parameters from parameter.c to param.h Remove the runtime setting of P, Q, R parameters for targets ARMV8, THUNDERX2T99. Instead set them as constants in param.h at compile time.	2018-10-22 01:45:51 -07:00
Ashwin Sekhar T K	351a0c777c	ARM64: Remove XGENE1 references Remove XGENE1 target as the implementation for the same is incomplete. Moreover whoever wishes to use on XGENE1 can use the generic ARMV8 target as there are no XGENE1 specific optimizations in OpenBLAS.	2018-10-22 01:45:51 -07:00
Ashwin Sekhar T K	21f46a1cf2	ARM64: Use THUNDERX2T99 Neon Kernels for ARMV8 Currently the generic ARMV8 target uses C implementations for many routines. Replace these with the neon implementations written for THUNDERX2T99 target which are upto 6x faster for certain routines.	2018-10-17 10:44:37 -07:00
Ashwin Sekhar T K	caf339412f	ARM64: Remove dependency of THUNDERX2T99 Makefile on CORTEXA57 Makefile	2018-10-17 08:02:40 -07:00
Ashwin Sekhar T K	8001fdcd2a	ARM64: Remove dependency of THUNDERX Makefile on ARMV8 Makefile	2018-10-17 08:02:16 -07:00
Ashwin Sekhar T K	162e312832	ARM64: Remove dependency of CORTEXA57 Makefile on ARMV8 Makefile	2018-10-17 08:01:45 -07:00
Ashwin Sekhar T K	c3d93caa8d	ARM64: Remove dependency of XGENE1 Makefile on ARMV8 Makefile	2018-10-17 08:01:27 -07:00
Arjan van de Ven	55b244ca0d	enable the SGEMM/SKX C based kernel In QA the final bug was found so now the sklyakex sgemm C based kernel can be activated....	2018-10-12 09:30:35 +00:00
Arjan van de Ven	d4bad73834	Add a C+intrinsics version of the SGEMM/skylakex kernel for most sizes this is 1.2x to 1.4x faster than the current code	2018-10-10 01:49:22 +00:00
Arjan van de Ven	582c589727	dgemm/skylakex: replace discrete mul/add with fma very minor gains since it's not super hot code, but general principles	2018-10-06 23:13:26 +00:00
Arjan van de Ven	adbf6afa25	Add vector optimizations for ncopy as well for dgemm/skylakex	2018-10-06 21:18:12 +00:00
Arjan van de Ven	32bec8afbb	add a skylakex optimized dgemm beta function	2018-10-06 16:36:26 +00:00
Arjan van de Ven	20c5d668fe	dgemm/avx512 simplify and speed up the 4x4 kernel	2018-10-06 14:12:32 +00:00
Arjan van de Ven	6d43c51ccf	undo slow dgemm/skylake microoptimization the compare is more costly than the work	2018-10-06 14:00:37 +00:00
Arjan van de Ven	d74dc39b0f	Add optimized *copy versions for skylakex Add optimized n/t copy versions for skylakex; in the patch the tcopy is also rewritten using intrinsics; the ncopy file will be worked on in a future commit	2018-10-06 13:51:44 +00:00
Arjan van de Ven	66b43affbc	Add a 24x8 kernel to the skylakex dgemm implementation Minor gains for small matrixes, but at 512x512 and above the gain gets more significant.	2018-10-05 13:22:21 +00:00
Arjan van de Ven	1938819c25	skylake dgemm: Add a 16x8 kernel The next step for the avx512 dgemm code is adding a 16x8 kernel. In the 8x8 kernel, each FMA has a matching load (the broadcast); in the 16x8 kernel we can reuse this load for 2 FMAs, which in turn reduces pressure on the load ports of the CPU and gives a nice performance boost (in the 25% range).	2018-10-05 13:11:35 +00:00
Martin Kroeker	b7496c3638	Function name needs to be CNAME, set from outside to allow suffixing for dynamic_arch	2018-10-04 19:14:59 +02:00
Arjan van de Ven	45fe8cb0c5	Create a AVX512 enabled version of DGEMM This patch adds dgemm_kernel_4x8_skylakex.c which is * dgemm_kernel_4x8_haswell.s converted to C + intrinsics * 8x8 support added * 8x8 kernel implemented using AVX512 Performance is a work in progress, but already shows a 10% - 20% increase for a wide range of matrix sizes.	2018-10-03 14:45:25 +00:00
Martin Kroeker	544b069e85	Merge pull request #1780 from martin-frbg/issue1774-2 Convert fldmia/fstmia instructions to UAL syntax for clang7	2018-09-29 09:27:47 +02:00
Martin Kroeker	9b2a7ad40d	Convert fldmia/fstmia instructions to UAL syntax for clang7 second part of fix for #1774, containing files missed in #1775	2018-09-28 23:05:15 +02:00
fengruilin	6fc85a6359	test_axpy work error on LOONGSON3A platform #1777	2018-09-26 15:14:04 +08:00
Martin Kroeker	7e5df34e6a	Convert fldmia/fstmia instructions to UAL syntax for clang7 fixes #1774	2018-09-25 09:41:58 +02:00
Andrew	1e531701b7	fix small typo	2018-09-09 16:52:25 +02:00
Martin Kroeker	ba4f433321	Merge pull request #1749 from martin-frbg/issue1531 Fix ARMV8 cross-compilation for IOS	2018-09-07 11:02:01 +02:00
Martin Kroeker	1cb7b9015e	Conditional compilation of assembly files that IOS does not like	2018-09-04 11:06:51 +02:00
Martin Kroeker	a4bd41e9f2	Fix paths to C kernels for nrm2	2018-09-04 10:51:19 +02:00
Martin Kroeker	e11126b26a	Merge pull request #1745 from martin-frbg/issue1743 Set USE_TRMM for all ZARCH variants to fix TRMM faults with zarch-gen…	2018-08-29 07:43:58 +02:00
Martin Kroeker	f3fd44a731	Set USE_TRMM for all ZARCH variants to fix TRMM faults with zarch-generic fixes #1743	2018-08-28 21:34:07 +02:00
maamountki	e6c0e39492	Optimize Zgemv	2018-08-13 12:23:40 +03:00
Martin Kroeker	375dff54fc	Merge pull request #1733 from fenrus75/dsymv Add an AVX512 enabled DSYMV (L) function	2018-08-12 18:18:36 +02:00
Martin Kroeker	a5f165275a	Merge pull request #1732 from fenrus75/dgemv Add an AVX512 enabled DGEMV (n) function	2018-08-12 18:17:42 +02:00
Martin Kroeker	8c13aa495a	Merge pull request #1730 from fenrus75/fix-sdot Fix typo in sdot function	2018-08-12 18:17:01 +02:00
Arjan van de Ven	9bec34cb67	Add an AVX512 enabled DSYMV (L) function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-11 17:46:24 +00:00
Arjan van de Ven	87bebdbd8a	Add an AVX512 enabled DGEMV (n) function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-11 17:38:12 +00:00
Arjan van de Ven	36add7570a	Fix typo in sdot function it looks like my previous pull request was short the final commit; fix a typo in sdot	2018-08-11 17:16:45 +00:00
Arjan van de Ven	cacacc8007	Add an AVX512 enabled DSCAL function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-11 17:14:57 +00:00
Martin Kroeker	1a00ef3d27	Merge pull request #1725 from fenrus75/axpy Add a AVX512 enabled SAXPY/DAXPY functions	2018-08-11 11:01:20 +02:00
Arjan van de Ven	2e99873ff7	Add a AVX512 enabled SAXPY/DAXPY functions written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-10 02:58:32 +00:00
Arjan van de Ven	00abaa865b	Add an AVX512 enabled SDOT function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-10 02:33:43 +00:00
Arjan van de Ven	7932ff3ea9	Add an AVX512 enabled DDOT function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	2018-08-09 03:55:52 +00:00
maamountki	23229011db	[ZARCH] Z14 support, BLAS 1/2 single precision implementations, Some missing double precision implementations, Gemv optimization	2018-08-06 18:20:40 +03:00
Martin Kroeker	4e103c822c	typo fix	2018-07-16 12:56:39 +02:00
Martin Kroeker	d2142760e0	Fix precision problem in DSDOT	2018-07-15 17:11:40 +02:00
Martin Kroeker	2fbfc64da8	Use C kernels for default c/zAXPY, xROT, c/zSWAP	2018-07-15 17:09:55 +02:00
Martin Kroeker	ba8388cee0	Merge pull request #1651 from martin-frbg/avx512-nodgemm Disable the 16x2 DTRMM kernel on SkylakeX as well	2018-06-30 17:48:03 +02:00
Martin Kroeker	6e54b0a027	Disable the 16x2 DTRMM kernel on SkylakeX as well	2018-06-30 17:31:06 +02:00
Martin Kroeker	40c8cbc3bf	Merge pull request #1650 from martin-frbg/avx512-nodgemm Disable the AVX512 DGEMM kernel for now	2018-06-30 13:05:46 +02:00
Martin Kroeker	f0a8dc2eec	Disable the AVX512 DGEMM kernel for now due to #1643	2018-06-30 11:34:48 +02:00
Martin Kroeker	b83e4c60c7	Remove premature exit for INC_X or INC_Y zero	2018-06-26 20:46:42 +02:00
Martin Kroeker	e344db269b	Remove premature exit for INC_X or INC_Y zero	2018-06-26 20:45:57 +02:00
Martin Kroeker	545b82efd3	Remove premature exit for INC_X or INC_Y zero	2018-06-26 20:45:00 +02:00
Martin Kroeker	e322a951fe	Remove premature exit for INC_X or INC_Y zero	2018-06-26 20:44:13 +02:00
Martin Kroeker	c628c6fa59	Merge pull request #1612 from oon3m0oo/cpus Fixed a few more unnecessary calls to num_cpu_avail.	2018-06-14 16:51:31 +02:00
Martin Kroeker	6f71c0fce4	Return a somewhat sane default value for L2 cache size if cpuid retur… (#1611 ) * Return a somewhat sane default value for L2 cache size if cpuid returned something unexpected Fixes #1610, the KVM hypervisor on Google Chromebooks returning zero for CPUID 0x80000006, causing DYNAMIC_ARCH builds of OpenBLAS to hang	2018-06-11 13:26:19 +02:00
Craig Donner	c2545b0fd6	Fixed a few more unnecessary calls to num_cpu_avail. I don't have as many benchmarks for these as for gemm, but it should still make a difference for small matrices.	2018-06-11 10:17:16 +01:00
Arjan van de Ven	89372e0993	Use AVX512 also for DGEMM this required switching to the generic gemm_beta code (which is faster anyway on SKX) for both DGEMM and SGEMM Performance for the not-retuned version is in the 30% range	2018-06-03 22:17:27 +00:00
Martin Kroeker	0023515733	Typo fix (misplaced parenthesis)	2018-06-03 13:22:59 +02:00
Arjan van de Ven	99c7bba8e4	Initial support for SkylakeX / AVX512 This patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server) target. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set, which brings 2 basic things: 1) 512 bit wide SIMD (2x width of AVX2) 2) 32 SIMD registers (2x the number on AVX2) This initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel to AVX512VL; more will follow later but this patch aims to get the infrastructure in place for this "later". Full performance tuning has not been done yet; with more registers and wider SIMD it's in theory possible to retune the kernels but even without that there's an interesting enough performance increase (30-40% range) with just this change.	2018-06-03 07:58:52 +00:00
Martin Kroeker	8562d5787a	Merge pull request #1583 from martin-frbg/issue1575 Handle INCX=0,INCY=0 case	2018-05-31 21:55:26 +02:00
Martin Kroeker	7df8c4f76f	typo fix	2018-05-31 17:23:08 +02:00
Martin Kroeker	2fc748bf72	Restore optimized swap kernel now that we have a proper fix	2018-05-31 13:41:12 +02:00
Martin Kroeker	d1b7be14aa	Handle INCX=0,INCY=0 case Fixes #1575 (sswap/dswap failing the swap utest on x86) as suggested by atsampson.	2018-05-31 12:52:04 +02:00
Martin Kroeker	961d25e9c7	Use the new zrot.c on POWER8 for crot as well fixes #1571 (the old zrot.S assembly does not handle incx=0 correctly)	2018-05-23 22:54:39 +02:00
Martin Kroeker	f5959f2543	Merge pull request #1567 from martin-frbg/mipstrmm Revert " Switch mips32 target to USE_TRMM to fix complex TRMM"	2018-05-17 20:50:23 +02:00
Martin Kroeker	82012b960b	Revert " Switch mips32 target to USE_TRMM to fix complex TRMM" ... as it was just a silly workaround for the issue seen in #1563, caused by #1419	2018-05-17 20:30:03 +02:00
Martin Kroeker	8dd3515fa2	Merge pull request #1565 from martin-frbg/mipstypo Remove extraneous brace from previous commit of mips dsdot fix	2018-05-17 20:22:58 +02:00
Martin Kroeker	95f7f0229c	Remove extraneous brace from previous commit	2018-05-17 18:43:59 +02:00
Martin Kroeker	5082fe4306	Merge pull request #1564 from martin-frbg/issue1563 Revert changes from PR#1419	2018-05-17 14:04:13 +02:00
Martin Kroeker	7a7619af6d	Revert changes from PR#1419 at least one of these changes apparently is an oversimplification, leading to TRMM breakage on some platforms as observed in #1563	2018-05-17 11:40:08 +02:00
Martin Kroeker	893b535540	Use correct data type for initializers of v2f64, v4f32 Fixes #1561	2018-05-15 14:42:12 +02:00
Martin Kroeker	018f2dad27	Switch mips32 target to USE_TRMM to fix complex TRMM	2018-05-02 20:25:32 +02:00
Martin Kroeker	9d5098dbc9	Add MIPS 1004K target (Mediatek MT7621 SOC)	2018-05-02 20:20:44 +02:00
Martin Kroeker	954f1832de	Merge pull request #1540 from martin-frbg/mips32-zasum Fix typo in MIPS P5600 complex ASUM code selection	2018-04-25 23:23:00 +02:00
Martin Kroeker	941ad280a8	Fix typo in MIPS P5600 complex ASUM code selection	2018-04-25 22:50:10 +02:00
Martin Kroeker	1da365312a	Merge pull request #1538 from martin-frbg/arm7utest Fix handling of zero INCX, INCY in ArmV7 AXPY and ROT	2018-04-25 08:38:58 +02:00
Martin Kroeker	2d0929fa7c	Move the test for zero incx,incy in ARMV7 ROT to pass the related utest (see #1469)	2018-04-24 22:43:00 +02:00
Martin Kroeker	125343cc88	Drop test for zero incx,incy in armv7 AXPY ...to pass the related utest (see #1469)	2018-04-24 22:39:50 +02:00
Martin Kroeker	8a3b6fa108	Use generic zrot.c on ppc64/POWER6 to work around utest failure from … (#1535 ) * Use generic C implementation of zrot on ppc64/POWER6 to work around utest failure from #1469	2018-04-23 19:05:49 +02:00
Martin Kroeker	9c5518319a	Revert "Fix 32bit HASWELL builds"	2018-04-22 20:20:04 +02:00
Jerry Zhao	0ee395db35	Fixed TRMM and SYMM for RISCV	2018-04-18 18:03:32 -07:00
Jerry Zhao	c167a3d6f4	Added RISCV build	2018-04-16 14:08:31 -07:00
Martin Kroeker	2ca0faf495	Merge pull request #1515 from martin-frbg/mipsdot Correct precision of mips dsdot	2018-04-11 08:21:25 +02:00
Martin Kroeker	0fe434598b	Fix precision of mips dsdot	2018-04-10 23:30:59 +02:00
Martin Kroeker	c7b55b6082	Merge pull request #1499 from quickwritereader/develop Implemented missing vsx simd kernels for power8 blas1/2 double. z13 modifications	2018-03-27 21:43:23 +02:00
Martin Kroeker	840e01061f	Merge pull request #1491 from martin-frbg/ddot_mt Add multithreading support for Haswell DDOT	2018-03-27 21:43:05 +02:00
QWR QWR	28ca97015d	power8:Added initial zgemv_(t\|n) ,i(d\|z)amax,i(d\|z)amin,dgemv_t(transposed),zrot z13: improved zgemv_(t\|n)_4,zscal,zaxpy	2018-03-27 14:54:41 +00:00
Martin Kroeker	6a6ffaff1e	Merge pull request #1494 from martin-frbg/x86_dsdot Use generic/dot.c instead of the inferior arm/dot.c for x86 DSDOT	2018-03-17 15:26:47 +01:00
Martin Kroeker	28ac9ea5a6	Use generic/dot.c instead of the inferior arm/dot.c for x86 DSDOT to resolve dsdot utest failure seen in #1492	2018-03-17 13:49:15 +01:00
Martin Kroeker	a55694dd5b	Declare dot_compute static to avoid conflicts in multiarch builds	2018-03-16 22:23:36 +01:00
Martin Kroeker	85a41e9cdb	Add multithreading support for Haswell DDOT copied from ashwinyes' implementation in dot_thunderx2t99.c	2018-03-16 16:58:47 +01:00
Martin Kroeker	81215711a2	Re-enable DAXPY microkernels for x86_64 as the inaccuracies seen in the original testcase for #1332 appear to be due to an artefact that amplifies the very small rounding differences between FMA and discrete multiply+add	2018-03-04 19:37:03 +01:00
Martin Kroeker	22167170b3	Merge pull request #1477 from quickwritereader/develop Power8 blas3 copy-pack routines	2018-02-28 18:46:54 +01:00
Ashwin Sekhar T K	fa9ca65c0e	ARM64: Fix utest dsdot errors	2018-02-27 10:47:55 +00:00
Martin Kroeker	719b68f077	Merge pull request #1473 from martin-frbg/p2align Replace .align with .p2aligns in dscal.c and the Nehalem microkernels as well	2018-02-27 08:28:20 +01:00
Martin Kroeker	fe9f15f2d8	Merge pull request #1472 from martin-frbg/utest-fixes Fix limited DSDOT precision on arm,aarch64 and zarch	2018-02-26 22:48:07 +01:00
Martin Kroeker	497f0c3d8a	Replace .align with .p2align in the Nehalem microkernels	2018-02-26 20:58:33 +01:00
Martin Kroeker	ea37db828e	Convert .align to .p2align for OSX compatibility	2018-02-26 20:48:03 +01:00
Martin Kroeker	6e70287776	Use generic/dot.c for DSDOT on ARMV5 and above The default arm/dot.c is less precise when used for DSDOT, as shown by utest	2018-02-25 19:57:23 +01:00
Martin Kroeker	58f236ad73	Use generic/dot.c for DSDOT on zarch	2018-02-25 19:52:14 +01:00
Martin Kroeker	e207107150	Use generic/dot.c for DSDOT on z13 The implementation in arm/dot.c has lower precision, as shown by the utest for dsdot.	2018-02-25 19:51:25 +01:00
Martin Kroeker	c9d408064a	Use dot.S also for DSDOT on CORTEXA57	2018-02-25 19:48:09 +01:00
Martin Kroeker	288d1a3f6e	Use dot.S also for DSDOT on ARMV8	2018-02-25 19:45:16 +01:00
Martin Kroeker	7c1925acec	Use .p2align instead of .align for compatibility on Sandybridge as well	2018-02-24 19:43:15 +01:00
Martin Kroeker	2359c7c1a9	Use .p2align instead of .align for portability The OSX assembler apparently mishandles the argument to decimal .align, leading to a significant loss of performance as observed in #730, #901 and most recently #1470	2018-02-24 17:50:13 +01:00
Martin Kroeker	e7366a4161	Restore the remaining utests (#1462 ) * Restore the remaining utests * Try fork test on Cygwin and Linux only, it hangs on at least ARMv8/Android as well * Use generic sswap/dswap kernels for NEHALEM 32bit to fix fault found by the restored swap utest * Disable zdotu test for MS cl to work around runtime error -1073741819 on AppVeyor for now (probably coding error in the initialization of the complex numbers or wrong choice of zdotu API)	2018-02-20 10:07:17 +01:00
the mslm	2c0a008281	dgemm_ncopy_4_ save/restore	2018-02-18 01:30:17 +00:00
the mslm	c5425daa6b	power8 ?gemm_tcopy save/restore	2018-02-16 23:36:46 +00:00
Martin Kroeker	b47e6822aa	Enable most assembly kernels in the generic ARMV8 target ref #1439	2018-02-06 11:42:58 +01:00
Abdelrauf	60596a1abc	Merge branch 'develop' into develop	2018-01-31 16:17:04 -08:00
Abdelrauf	afd514c25d	small fix inside ifdef z13mvc . (z13mvc code is not used in production)	2018-01-31 18:30:59 -05:00
Martin Kroeker	f45776ec1f	Merge pull request #1440 from quickwritereader/develop small corrections	2018-01-31 23:48:47 +01:00
Martin Kroeker	e388459a27	Merge pull request #1419 from brada4/develop Initialize unitialized values for repeated calls	2018-01-31 23:48:34 +01:00
Abdelrauf	f653e7a18d	small fix small fix inside ifdef z13mvc . (z13mvc code is not used in production)	2018-01-31 07:49:38 -08:00
the mslm	f946a89432	zscal (case: real alpha=0 ) mikrokernel shift&mem fix , da_i as input reg. small typo fixes	2018-01-26 19:25:27 -08:00
Martin Kroeker	485df77612	Make USE_TRMM depend on TARGET_CORE not TARGET Fixes #1432 (and possibly other DTRMM-related failures on Haswell and related architectures when built with cmake)	2018-01-26 23:20:00 +01:00
Martin Kroeker	e4c71a799a	Merge pull request #1426 from quickwritereader/develop (Z13 ) Blas1 mikrokernels can be inlined by gcc. Refactoring,fixes,tunings	2018-01-20 17:34:54 +01:00
the mslm	2619ad7ea5	Blas1 mikrokernels can be inlined by gcc. Refactoring ( symbolic operan names). Some fixes and tunings	2018-01-19 19:24:35 -08:00
Andrew	e5cc3d72c0	core.IdenticalExpr clang501 checker	2018-01-19 23:17:43 +01:00
Andrew	4938faa822	core.IdenticalExpr clang501 checker	2018-01-19 23:15:58 +01:00
Andrew	9fa986337d	add missing brackets to silence indentation warnings gcc721	2018-01-19 23:11:12 +01:00
Andrew	3eed97f6b9	Initialize values to silence cppcheck	2018-01-12 22:35:00 +01:00
Andrew	13e137fbc9	Initialize uninitialized variables (cppcheck)	2018-01-12 22:33:41 +01:00
Martin Kroeker	3d23f45107	Merge pull request #1415 from quickwritereader/develop (Z systems Z13) small fixes, some (i(dz)amin,i(dz)amax,(dz)dot,(dz)asum) mikrokernels…	2018-01-11 08:35:02 +01:00
Abdelrauf	87669d1c0a	small fixes, some (i(dz)amin,i(dz)amax,(dz)dot,(dz)asum) mikrokernels can be inlined	2018-01-10 20:36:53 -05:00
Martin Kroeker	42285d8e70	Merge pull request #1410 from brada4/develop Address warnings #1357	2018-01-06 20:02:46 +01:00
Andrew	d602b99386	LAPACK helpers in C that need care too	2018-01-02 14:38:50 +01:00
Andrew	4d0b005e5b	Eliminate remaining unused results in kernels (clang5 analyzer)	2018-01-01 20:54:39 +01:00
Martin Kroeker	b81656936f	Merge pull request #1409 from martin-frbg/issue1292-2 Tag %1 and %2 as both input and output operands	2017-12-31 20:18:48 +01:00
Martin Kroeker	b973990df2	Tag %1 and %2 as both input and output operands fix from #1292 extended to the other gemv microkernels	2017-12-31 18:03:36 +01:00
Martin Kroeker	1e31124eb0	Merge pull request #1406 from martin-frbg/issue1292 Tag %1 and %2 as both input and output	2017-12-30 14:52:03 +01:00
Martin Kroeker	cc9500db41	Merge pull request #1403 from brada4/develop Address few more warnings	2017-12-30 14:51:34 +01:00
Martin Kroeker	723f396a20	Tag %1 and %2 as both input and output The inline assembly modifies its input operands, so mark them as output to avoid surprises with optimization. Fixes #1292	2017-12-29 23:56:41 +01:00
Andrew	03e5ff0687	initialize potentially unitialized variables (clang5)	2017-12-26 09:24:24 +01:00
Andrew	47deec2c1a	fix couple of dead assignment warnings	2017-12-22 00:56:35 +01:00
Martin Kroeker	43c0622e7b	Retire Piledriver/Steamroller/Excavator daxpy microkernels as well related to issue #1332	2017-12-13 18:40:39 +01:00
Martin Kroeker	0623636c98	Use Sandybridge daxpy kernel on Haswell and Zen for now The testcase from #1332 exposes a problem in daxpy_microk_haswell-2.c that is not seen with any of the other Intel x86_64 microkernels.	2017-12-10 19:24:31 +01:00
Andrew	281a2b952f	warning cleanup (#1380 ) * dead increments in driver/level2 * dead increments in kernel/generic * part dead increments in kernel/x86_64	2017-12-05 19:54:10 +01:00
Martin Kroeker	8213385ab8	Work around compiler warnings for unused variables in the generic zgemm3m_Xcopy kernels	2017-12-02 22:51:58 +01:00
Martin Kroeker	db00a51e6b	Merge pull request #1371 from martin-frbg/develop Add trivially optimized DSDOT for POWER8	2017-11-29 19:55:21 +01:00
martin	7a4b3cfbf8	Add trivially optimized DSDOT for POWER8	2017-11-28 18:38:07 +01:00
Martin Kroeker	6c77b5f267	Merge pull request #1369 from martin-frbg/dsdot Add optimized dsdot to all other x86_64 kernels that use sdot.c	2017-11-28 18:15:31 +01:00
Andrew	441a9c8385	more dead increments clang4 scan-build deadcode.deadstores	2017-11-26 17:24:08 +01:00
Andrew	1236dbe5a6	Eliminate 2-8 dead increments code	2017-11-26 13:26:11 +01:00
Martin Kroeker	c92cd6d162	Add trivially optimized dsdot based on sdot	2017-11-24 20:05:27 +01:00
Martin Kroeker	cae5d9a20b	Add trivially optimized dsdot based on sdot	2017-11-24 20:04:29 +01:00
Martin Kroeker	3d891c3106	Add trivially optimized dsdot based on sdot	2017-11-24 20:03:40 +01:00
Martin Kroeker	4fbdcfa823	Add trivially optimized dsdot based on sdot	2017-11-24 20:02:28 +01:00
Martin Kroeker	1bb6a96ebc	Add trivially optimized dsdot based on sdot	2017-11-24 20:01:42 +01:00
Martin Kroeker	6bd163f37a	Add trivially optimized dsdot based on sdot	2017-11-24 20:00:23 +01:00
Martin Kroeker	f0333333d1	Add trivially optimized dsdot based on sdot	2017-11-24 19:59:28 +01:00
Andrew	e89b979b2c	fix spurious compiler warning fix (no code change)	2017-11-24 18:39:04 +01:00
Andrew	7e9b29b9b8	fix spurious compiler warning (no code change)	2017-11-24 18:36:37 +01:00
Martin Kroeker	6157d0902a	Merge pull request #1358 from martin-frbg/unused_vars Clean up spurious unused variables in the kernels	2017-11-15 11:31:43 +01:00
Martin Kroeker	3fea849bbf	Remove unused variables from Haswell dtrmm and Bulldozer dtrsm	2017-11-14 23:35:10 +01:00
Martin Kroeker	8f177621bc	Remove unused variables at0...at3 from ?symv_U	2017-11-14 23:32:25 +01:00
Martin Kroeker	5f402b7759	Remove unused (loop?) variable j from the gemv_n_4 implementations	2017-11-14 23:29:42 +01:00
Martin Kroeker	65bf0a343c	Remove unused variable btpr	2017-11-14 23:25:50 +01:00
Martin Kroeker	acf3d34bc5	Silence an unused variable warning with a cast l2 cache size is not universally needed to assign default unrolling limits, but neither putting its declaration inside an ifdef nor cloning it into all ifdef sections that need it really makes sense here.	2017-11-14 23:23:44 +01:00
Martin Kroeker	ab87ee6b48	Merge pull request #1329 from martin-frbg/dsdot (Trivial) optimized dsdot implementation for HASWELL	2017-10-25 19:13:38 +02:00
Martin Kroeker	a07807caac	Eliminate loop code when called as/from dsdot	2017-10-25 16:45:41 +02:00
Ashwin Sekhar T K	a0128aa489	ARM64: Convert all labels to local labels While debugging/profiling applications using perf or other tools, the kernels appear scattered in the profile reports. This is because the labels within the kernels are not local and each label is shown as a separate function. To avoid this, all the labels within the kernels are changed to local labels.	2017-10-24 11:40:05 +00:00
Martin Kroeker	0e2cf102e1	Fix 32bit HASWELL	2017-10-24 10:07:44 +02:00
Martin Kroeker	5e3e91d0fc	Split the microkernel workload into chunks of 32 floats for dsdot mode to limit loss of precision	2017-10-22 18:18:51 +02:00
Martin Kroeker	28c3fa8950	Add dsdot	2017-10-16 23:29:03 +02:00
Martin Kroeker	8ac87c1cb6	Implement DSDOT with unchanged sdot microkernels	2017-10-16 23:27:51 +02:00
Martin Kroeker	c7a8512d12	Cmake fixes for DYNAMIC_ARCH builds and whitespace in path names (#1323 ) * prebuild.cmake: Put quotes around path names that may contain whitespace (Copied from alexkaratakis' PR #1295) * kernel/CMakeLists.txt: Fix common_lapack header inclusion and DYNAMIC_ARCH generation of ?neg_tcopy and ?laswp_ncopy files * lapack/CMakeLists.txt: Use correct template for ?laswp_(plus,minus) functions	2017-10-09 23:34:18 +02:00
Martin Kroeker	97ecd4996a	Merge pull request #1319 from martin-frbg/issue601 Fix out-of-bounds memory accesses exposed by xccblat3 testcase	2017-10-08 23:31:06 +02:00

... 10 11 12 13 14 ...

2023 Commits