OpenBLAS

Author	SHA1	Message	Date
Zhang Xianyi	d7ba7679b6	Merge branch 'develop' into risc-v	2020-10-16 23:27:38 +08:00
Martin Kroeker	df70667043	fix core list for sse/sse2	2020-10-16 09:55:48 +02:00
Martin Kroeker	f071d1207a	add sse2	2020-10-15 22:10:32 +02:00
Martin Kroeker	dc6cefd2f5	Expressly enable -msse for 32bit DYNAMIC_ARCH kernels	2020-10-15 20:16:15 +02:00
Martin Kroeker	c339c40c01	Silence a redefinition warning	2020-10-15 19:08:12 +02:00
Martin Kroeker	10379fc83b	Use ifdef instead of if	2020-10-15 19:05:37 +02:00
Martin Kroeker	4c25910da0	Merge pull request #2896 from martin-frbg/intrin-double Add compiler flag for SSE4 where available	2020-10-15 11:12:35 +02:00
Martin Kroeker	ae6ac83991	Revert "add double precision SSE"	2020-10-15 08:37:02 +02:00
Qiyu8	4fac91ef37	adapt arm platform	2020-10-15 11:08:10 +08:00
Qiyu8	bfdf4b56da	Add double precision universal intrinsics for X86/ARM	2020-10-15 10:29:42 +08:00
Martin Kroeker	ebf0470fc2	add sse4.1 for DYNAMIC_ARCH kernels	2020-10-14 20:34:33 +02:00
Martin Kroeker	c9c3ae07af	Add double precision operations	2020-10-14 18:10:45 +02:00
Martin Kroeker	756802df61	Merge pull request #2890 from martin-frbg/s-d-sum Revert special handling of Windows xNRM2 and enable C+intrinsics kern…	2020-10-14 09:02:03 +02:00
Rajalakshmi Srinivasaraghavan	0826d68f93	POWER10: Change the packing format for bfloat16 As the new MMA instructions need the inputs in 4x2 order for bfloat16, changing the format in copy/packing code. This avoids permute instructions in the gemm kernel inner loop.	2020-10-13 16:05:10 -05:00
Rajalakshmi Srinivasaraghavan	b5d30b390d	Fix build issues with bfloat16 This patch fixes compilation errors due to recent renaming from SH to SB with BUILD_BFLOAT16.	2020-10-13 11:00:22 -05:00
Martin Kroeker	fecedc9c69	Add -mssse3	2020-10-13 11:55:41 +02:00
Martin Kroeker	0eacbca85f	Add Haswell and Zen to temporary sse3 whitelist	2020-10-13 11:42:39 +02:00
Martin Kroeker	6999086a2b	whitelist SANDYBRIDGE for SSE3	2020-10-13 10:32:19 +02:00
Martin Kroeker	8d2df7d066	Revert special handling of Windows xNRM2 and enable C+intrinsics kernel for SSUM/DSUM	2020-10-13 00:14:29 +02:00
Martin Kroeker	08929430cd	Merge pull request #2886 from martin-frbg/issue_2767 Rename "HALF" precision functions (sh prefix) to "BFLOAT16" with "sb" prefix	2020-10-13 00:04:35 +02:00
Martin Kroeker	0c84ffe05f	Merge pull request #2881 from mattip/fninit add fninit to reset fpu registers before assembler routines	2020-10-12 23:50:41 +02:00
Matti Picus	403eb513a0	use emms instead, add WIN guards	2020-10-12 18:15:01 +03:00
Qiyu8	0ed1f07660	Optimize the performance of sum by using universal intrinsics	2020-10-12 19:48:53 +08:00
Martin Kroeker	3aecafad80	Change "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-12 00:00:55 +02:00
Martin Kroeker	756062afa5	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:56:17 +02:00
Martin Kroeker	2061f7fdff	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:54:53 +02:00
Martin Kroeker	dc8a1afa63	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:53:50 +02:00
Martin Kroeker	fd94236042	Rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:42:07 +02:00
Martin Kroeker	68ce719fac	Rename shdot_microk_cooperlake.c to sbdot_microk_cooperlake.c	2020-10-11 23:41:13 +02:00
Martin Kroeker	d7dd9b396c	Rename shdot.c to sbdot.c	2020-10-11 23:40:43 +02:00
Martin Kroeker	9ae80490e0	rename "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-11 23:39:42 +02:00
Martin Kroeker	d314d1f49f	Rename shgemm_kernel_power10.c to sbgemm_kernel_power10.c	2020-10-11 23:37:38 +02:00
Martin Kroeker	c589c3e2a1	Merge pull request #2882 from martin-frbg/issue2709 Use generic C for (D/Z)NRM2 on Windows x86_64	2020-10-11 22:22:30 +02:00
Martin Kroeker	ec638a82bf	Merge pull request #2852 from martin-frbg/issue2588-cmake Support building only a subset of variable types	2020-10-11 22:21:33 +02:00
Martin Kroeker	6b6adf8a4a	Allow compiling only a subset of kernels for specific variable types	2020-10-11 14:52:09 +02:00
Martin Kroeker	ac653c94f3	Merge branch 'develop' into issue2588-cmake	2020-10-11 13:57:07 +02:00
Martin Kroeker	7a53128481	Add whitelist of DYNAMIC_ARCH kernels for which -msse3 needs to be enabled	2020-10-11 01:06:46 +02:00
Martin Kroeker	e1b7123bbe	Merge pull request #2867 from Qiyu8/usimd-floatdot Optimize the performance of dot by using universal intrinsics in X86/ARM	2020-10-10 12:10:25 +02:00
Qiyu8	f32d34a015	add sse3 compiler flag	2020-10-10 10:36:15 +08:00
Martin Kroeker	7812486091	Use generic C for D/Z nrm2 kernels on Windows to work around fpu exception bug	2020-10-06 21:33:16 +02:00
Matti Picus	a5b164946c	add fninit to reset fpu registers before assembler routines	2020-10-05 22:13:25 +03:00
User User-User	d2333e7842	aarch64 fix std=c18 compilation	2020-10-03 18:00:34 +03:00
Qiyu8	60e6c68e38	Adapt ARM architect	2020-09-29 16:36:14 +08:00
Qiyu8	1b1a757f5f	Optimize the performance of dot by using universal intrinsics in X86/ARM	2020-09-28 20:36:53 +08:00
Rajalakshmi Srinivasaraghavan	2df4235e00	Optimize dcopy/zcopy for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores. Tested in simulator and no new failures.	2020-09-27 21:42:32 -05:00
Martin Kroeker	dfbc62ef7e	Support building only a subset of types	2020-09-22 23:25:59 +02:00
Qiyu8	14f7dad3b7	performance improved	2020-09-22 16:52:15 +08:00
Qiyu8	325b539c26	Optimize the performance of daxpy by using universal intrinsics	2020-09-22 10:38:35 +08:00
Marius Hillenbrand	22aa81f3e5	s390x: fix cscal and zscal implementations The implementation of complex scalar * vector multiplication for Z14 makes some LAPACK tests fail because the numerical differences to the reference implementation exceed the threshold (as can be seen by running make lapack-test and replacing kernel/zarch/cscal.c with a generic implementation for comparison). The complex multiplication uses terms of the form a * b + c * d for both real and imaginary parts. The assembly code (and compiler-emitted code as well) uses fused multiply add operations for the second product and sum. The results can be "surprising", for example when both terms in the imaginary part nearly cancel each other out. In that case, the second product contributes more digits to the sum than the first product that has been rounded before. One option is to use separate multiplications (which then round the same way) and a distinct add. Change the code to pursue that path, by (1) requesting the compiler not to contract the operations into FMAs and (2) replacing the assembly kernel with corresponding vectorized C code (where change 1 also applies). Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-21 13:10:05 +02:00
Marius Hillenbrand	f91057cbad	s390x: move common vector definitions and utils into header ... to facilitate reuse beyond gemm_vec.c and avoid code duplication. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-21 11:32:08 +02:00

1 2 3 4 5 ...

1526 Commits