OpenBLAS

Author	SHA1	Message	Date
Martin Kroeker	f1bf040b25	Merge pull request #2988 from xiegengxin/smp-asum Improve the performance of dasum and sasum when SMP is defined	2020-11-22 12:24:13 +01:00
Gengxin Xie	d6e7e05bb3	Improve the performance of dasum and sasum when SMP is defined	2020-11-13 14:20:52 +08:00
Qiyu8	a87e537b8c	modify macro	2020-11-11 15:53:48 +08:00
Qiyu8	5bc0a7583f	only FMA3 and vector larger than 128 have positive effects.	2020-11-11 15:18:01 +08:00
Qiyu8	8c0b206d4c	Optimize the performance of rot by using universal intrinsics	2020-11-11 14:33:12 +08:00
Martin Kroeker	ff16329cb7	Merge pull request #2972 from xiegengxin/rot-intrinsic Improve the performance of rot by using AVX512 and AVX2 intrinsic	2020-11-08 22:43:00 +01:00
Gengxin Xie	725ffbf041	fix typo	2020-11-05 16:25:17 +08:00
Gengxin Xie	d9ba49165a	Improve the performance of rot by using AVX512 and AVX2 intrinsic	2020-11-05 15:12:36 +08:00
Chen, Guobing	a7b1f9b1bb	Implementation of BF16 based gemv 1. Add a new API -- sbgemv to support bfloat16 based gemv 2. Implement a generic kernel for sbgemv 3. Implement an avx512-bf16 based kernel for sbgemv Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2020-10-29 02:08:23 +08:00
İsmail Dönmez	4a1d00f589	Fix build with -Werror=return-type dgemm_tcopy_16_skylakex.c CNAME function should return an int, add a return 0 similar to other files.	2020-10-21 08:43:39 +02:00
Bart Oldeman	b073d759d0	x86_64: clobber all xmm registers after vzeroupper As observed using GCC 10 using -march=native -ftree-vectorize on Knights Landing, it is now smart enough to find clobbers inside non-inlined static functions. In particular, sgemv counted on a kernel to preserve the whole %ymm2 register (since it was not in the clobber list), but the top part was destroyed by vzeroupper. This caused many tests to fail. This patch makes sure all xmm (and ymm/zmm by extension) registers are listed as clobbered to avoid this happening, as most kernels already did correctly in fact.	2020-10-20 02:16:47 +00:00
Bart Oldeman	03e781b766	sgemm_direct_skylakex: fix `75eeb26` regression. The `#if defined(SKYLAKEX) \|\| defined (COOPERLAKE)` from that commit was before #include "common.h" so caused the compiled function to be empty, returning garbage results for qualifying sgemm's on those architectures. Closes #2914	2020-10-18 19:58:07 +00:00
Martin Kroeker	c339c40c01	Silence a redefinition warning	2020-10-15 19:08:12 +02:00
Qiyu8	bfdf4b56da	Add double precision universal intrinsics for X86/ARM	2020-10-15 10:29:42 +08:00
Martin Kroeker	756802df61	Merge pull request #2890 from martin-frbg/s-d-sum Revert special handling of Windows xNRM2 and enable C+intrinsics kern…	2020-10-14 09:02:03 +02:00
Martin Kroeker	8d2df7d066	Revert special handling of Windows xNRM2 and enable C+intrinsics kernel for SSUM/DSUM	2020-10-13 00:14:29 +02:00
Martin Kroeker	08929430cd	Merge pull request #2886 from martin-frbg/issue_2767 Rename "HALF" precision functions (sh prefix) to "BFLOAT16" with "sb" prefix	2020-10-13 00:04:35 +02:00
Martin Kroeker	0c84ffe05f	Merge pull request #2881 from mattip/fninit add fninit to reset fpu registers before assembler routines	2020-10-12 23:50:41 +02:00
Matti Picus	403eb513a0	use emms instead, add WIN guards	2020-10-12 18:15:01 +03:00
Martin Kroeker	dc8a1afa63	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:53:50 +02:00
Martin Kroeker	fd94236042	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:42:07 +02:00
Martin Kroeker	68ce719fac	Rename shdot_microk_cooperlake.c to sbdot_microk_cooperlake.c	2020-10-11 23:41:13 +02:00
Martin Kroeker	d7dd9b396c	Rename shdot.c to sbdot.c	2020-10-11 23:40:43 +02:00
Martin Kroeker	7812486091	Use generic C for D/Z nrm2 kernels on Windows to work around fpu exception bug	2020-10-06 21:33:16 +02:00
Matti Picus	a5b164946c	add fninit to reset fpu registers before assembler routines	2020-10-05 22:13:25 +03:00
Qiyu8	14f7dad3b7	performance improved	2020-09-22 16:52:15 +08:00
Qiyu8	325b539c26	Optimize the performance of daxpy by using universal intrinsics	2020-09-22 10:38:35 +08:00
Martin Kroeker	91c84e1c01	Merge pull request #2796 from Guobing-Chen/BF16_dot_coversion_apis Add bfloat16 based dot and conversion with single/double	2020-09-14 15:00:19 +02:00
Martin Kroeker	e72430fe46	Merge pull request #2803 from xiegengxin/AVX2-asum Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic	2020-09-06 18:32:15 +02:00
Chen, Guobing	deaeb6c5b8	Add bfloat16 based dot and conversion with single/double 1. Added bfloat16 based dot as new API: shdot 2. Implemented generic kernel and cooperlake-specific (AVX512-BF16) kernel for shdot 3. Added 4 conversion APIs for bfloat16 data type <=> single/double: shstobf16 shdtobf16 sbf16tos dbf16tod shstobf16 -- convert single float array to bfloat16 array shdtobf16 -- convert double float array to bfloat16 array sbf16tos -- convert bfloat16 array to single float array dbf16tod -- convert bfloat16 array to double float array 4. Implemented generic kernels for all 4 conversion APIs, and cooperlake-specific kernel for shstobf16 and shdtobf16 5. Update level1 thread facilitate functions and macros to support multi-threading for these new APIs 6. Fix Cooperlake platform detection/specify issue when under dynamic-arch building 7. Change the typedef of bfloat16 from unsigned short to more strict uint16_t Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2020-09-04 02:31:25 +08:00
Gengxin Xie	1b0f17eeed	align to 64, using SSE when input size is small	2020-09-03 14:25:54 +08:00
Gengxin Xie	448152cdd8	define __AVX2__ to ensure the haswell code compiled with avx2	2020-08-31 14:39:08 +08:00
Gengxin Xie	cb3c190a3a	Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic	2020-08-31 11:44:08 +08:00
Martin Kroeker	b2053239fc	Fix mssing dummy parameter (imag part of alpha) of zdot_thread_function	2020-08-23 15:08:16 +02:00
Martin Kroeker	9ee21a0a39	Merge pull request #2780 from Guobing-Chen/CPL_build_support Enable COOPERLAKE build target	2020-08-20 19:54:29 +02:00
Martin Kroeker	75eeb265d7	[WIP] Refactor the driver code for direct SGEMM (#2782 ) Move "direct SGEMM" functionality out of the SkylakeX SGEMM kernel and make it available (on x86_64 targets only for now) in DYNAMIC_ARCH builds * Add sgemm_direct targets in the kernel Makefile.L3 and CMakeLists.txt * Add direct_sgemm functions to the gotoblas struct in common_param.h * Move sgemm_direct_performant helper to separate file * Update gemm.c to macros for sgemm_direct to support dynamic_arch naming via common_s,h * (Conditionally) add sgemm_direct functions in setparam-ref.c	2020-08-19 14:51:09 +02:00
Chen, Guobing	e740c4873d	Enable COOPERLAKE build target Enable new build target platform -- COOPERLAKE. This target platform supports all the SKYLAKEX supported ISAs + avx512bf16. So all the SKYLAKEX specific kernels/drivers and related code are now extended to be also active on COOPERLAKE. Besides, new BF16 related kernels are active under this target.	2020-08-13 06:18:00 +08:00
Martin Kroeker	81dcfdcf39	Multiply by 2 instead of left-shifting a potentially negative number fixes GCC ubsan warning in the BLAS tests	2020-08-02 18:29:56 +02:00
Martin Kroeker	0ef4b3f1f2	Multiply instead of doing a left shift of a potentially negative number fixes GCC ubsan report in the BLAS tests	2020-08-02 18:27:40 +02:00
Martin Kroeker	aa53a8a5cb	Multiply by two instead of left-shifting one place fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests	2020-08-02 18:25:09 +02:00
Martin Kroeker	aa3a1e7d8c	Multiply by two rather than left shift by one place fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests	2020-08-02 18:22:31 +02:00
Martin Kroeker	e30ad0e521	Strip UTF8 byte order marker from source	2020-06-26 09:00:43 +02:00
Martin Kroeker	93592d1260	Merge pull request #2675 from wjc404/develop AVX512 DGEMM TCOPY_16 Function	2020-06-23 09:29:02 +02:00
wjc404	086d87a302	AVX512 dgemm tcopy_16 function	2020-06-20 00:07:43 +08:00
Martin Kroeker	c3574ffe53	Merge pull request #2646 from wjc404/develop Optimize AVX512 parallel DGEMM performance	2020-06-07 13:18:22 +02:00
wjc404	0e3ac4a06b	Add files via upload	2020-06-06 14:56:57 +08:00
Martin Kroeker	2271c3506b	Work around excessive LAPACK test failures on Skylake-X Something in the plain C parts of x86_64 cscal.c and zscal.c appears to be miscompiled by both gfortran9 and ifort when compiling for skylakex-avx512, even when the optimized Haswell microkernel is not in use.	2020-05-09 23:49:18 +02:00
Martin Kroeker	90dba9f716	Duplicate earlier Clang 9.0.0 workaround for corresponding Apple Clang version As discussed on the original PR #2329, the "Apple Clang 11.0.3" that appears to be based the same LLVM release produces the same miscompilation of this file.	2020-05-05 10:44:50 +02:00
Martin Kroeker	5b0093b5fe	Convert aligned moves to unaligned should have no performance impact on reasonably modern cpus and fixes occasional crashes in actual user code.	2020-04-13 14:58:52 +02:00
Martin Kroeker	567d2760e6	Merge pull request #2520 from wjc404/develop Fix avx512 sgemm performance bug when ldc is a multiple of 1024	2020-03-30 20:15:59 +02:00

1 2 3 4 5 ...

625 Commits