OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Wangyang Guo	2e44ca0136	sbgemm: add missing cblas_sbgemm definition	2021-08-30 17:40:30 +08:00
Wangyang Guo	1d83ca4bca	Small Matrix: support BFLOAT16 data type	2021-08-30 17:40:20 +08:00
Wangyang Guo	c17d6dacb2	Small Matrix: skip compile in unimplemented data type	2021-08-05 05:46:13 +00:00
Wangyang Guo	aa50185647	Small Matrix: better handle with GEMM3M marco	2021-08-05 02:45:53 +00:00
Wangyang Guo	478d1086c1	Small Matrix: support DYNAMIC_ARCH build	2021-08-04 03:12:41 +00:00
Wangyang Guo	5dc7c3c8e5	Small Matrix: add GEMM_SMALL_MATRIX_PERMIT to tune small matrics case	2021-08-02 07:06:54 +00:00
Xianyi Zhang	6022e5629c	Refs #2587 fix small matrix c/zgemm bug.	2021-08-02 07:06:54 +00:00
Xianyi Zhang	57ed58cefe	Refs #2587 Add small matrix optimization reference kernel for c/zgemm.	2021-08-02 07:06:54 +00:00
Xianyi Zhang	17d32a4a82	Change a1b0 gemm to b0 gemm.	2021-08-02 07:06:54 +00:00
Xianyi Zhang	4271cfcc6f	Fix gemm interface bug for small matrix.	2021-08-02 07:06:51 +00:00
Xianyi Zhang	be3349405d	Add alpha=1.0 beta=0.0 for small gemm.	2021-08-02 07:01:47 +00:00
Xianyi Zhang	0a2077901c	Add small marix optimization kernel interface. make SMALL_MATRIX_OPT=1	2021-08-02 07:01:47 +00:00
Martin Kroeker	1dea57ab25	Revert PR #3250 (shortcut without buffer allocation) as it is unsafe on some x86_64	2021-07-14 20:32:57 +02:00
Martin Kroeker	7bb59fceb7	Clean up some warnings	2021-07-11 16:00:29 +02:00
Martin Kroeker	4ed99c2ce3	Merge pull request #3292 from martin-frbg/syrk_limit Add lower limit for multithreading in xSYRK	2021-07-07 20:46:28 +02:00
Martin Kroeker	8186963d8c	Add lower limit for multithreading	2021-07-04 17:00:26 +02:00
Martin Kroeker	726c44242b	Add lower threshold for multithreading	2021-07-01 17:41:05 +02:00
Martin Kroeker	1b5620b66e	Add lower threshold for multithreading in ?potrf and ?potri	2021-06-26 23:47:41 +02:00
Martin Kroeker	baf03a0937	Merge pull request #3252 from martin-frbg/more_shortcuts Further shortcuts for (small) cases that do not need buffer allocation	2021-06-15 16:14:20 +02:00
Martin Kroeker	7aab5e826c	Merge pull request #3250 from martin-frbg/gemv-shortcut Add shortcut for small-size S/D GEMV_N with increments of one	2021-06-15 14:50:14 +02:00
Martin Kroeker	f84197c1a7	Add shortcuts for (small) cases that do not need expensive buffer allocation	2021-05-29 22:28:00 +02:00
Martin Kroeker	734bd265a8	revert symv changes for now	2021-05-29 15:40:03 +02:00
Martin Kroeker	1217eb910d	Fix copy-paste errors in variables used	2021-05-28 09:38:48 +02:00
Martin Kroeker	d6d7a6685d	Add shortcuts for (small) cases that do not need expensive buffer allocation	2021-05-27 22:39:18 +02:00
Martin Kroeker	f0e7345fb8	Add shortcut for small-size gemv_n with increments of one	2021-05-26 22:02:34 +02:00
Martin Kroeker	03297ff9f0	Add fast path for small xSYR with INCX==1	2021-05-22 20:41:18 +02:00
Gordon Fossum	8b599836db	Add error message token for SBGEMM in gemm.c	2021-05-04 13:55:02 -05:00
Martin Kroeker	904b221f03	Add cast to prevent overflow of intermediate result	2021-05-01 14:47:22 +02:00
Martin Kroeker	c5fb91f1bc	Fix division by zero in the non-x86 codepath	2021-04-29 09:47:18 +02:00
Harmen Stoppels	ec6b354c32	use /usr/bin/env perl	2021-02-24 14:07:20 +01:00
Martin Kroeker	bd906e3410	fix copy-paste error in build rules for cblas_crotg and cblas_zrotg	2021-01-30 16:46:25 +01:00
Alex Henrie	f1bf2603e6	Remove dead assignment to dflag in rotmg functions	2021-01-14 19:40:32 -07:00
Alex Henrie	6f32991eae	Don't define the mode variable when not needed in gemm functions	2021-01-14 19:40:31 -07:00
Martin Kroeker	a8f249458d	Build CBLAS interfaces for CROTG and ZROTG as well	2021-01-13 00:29:38 +01:00
Martin Kroeker	ac3e2a3fdd	Add CBLAS interfaces for csrot and zdrot	2021-01-12 23:22:00 +01:00
Martin Kroeker	857afcc41d	Use ifeq instead of ifdef for user-definable build options	2020-11-22 16:31:44 +01:00
Chen, Guobing	a7b1f9b1bb	Implementation of BF16 based gemv 1. Add a new API -- sbgemv to support bfloat16 based gemv 2. Implement a generic kernel for sbgemv 3. Implement an avx512-bf16 based kernel for sbgemv Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2020-10-29 02:08:23 +08:00
Martin Kroeker	6a1f3e40af	Remove debug printout of object list	2020-10-26 21:37:04 +01:00
Rajalakshmi Srinivasaraghavan	b5d30b390d	Fix build issues with bfloat16 This patch fixes compilation errors due to recent renaming from SH to SB with BUILD_BFLOAT16.	2020-10-13 11:00:22 -05:00
Martin Kroeker	1e7eb7b7a9	Fix typos in currently unused sections	2020-10-13 09:17:15 +02:00
Martin Kroeker	052f31bc3c	Change "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-12 00:02:16 +02:00
Martin Kroeker	0f7d73ff6d	Allow supporting only a subset of variable types	2020-10-11 14:53:26 +02:00
Martin Kroeker	b475b4bd0d	Support building only a subset of types	2020-09-22 23:25:04 +02:00
Chen, Guobing	deaeb6c5b8	Add bfloat16 based dot and conversion with single/double 1. Added bfloat16 based dot as new API: shdot 2. Implemented generic kernel and cooperlake-specific (AVX512-BF16) kernel for shdot 3. Added 4 conversion APIs for bfloat16 data type <=> single/double: shstobf16 shdtobf16 sbf16tos dbf16tod shstobf16 -- convert single float array to bfloat16 array shdtobf16 -- convert double float array to bfloat16 array sbf16tos -- convert bfloat16 array to single float array dbf16tod -- convert bfloat16 array to double float array 4. Implemented generic kernels for all 4 conversion APIs, and cooperlake-specific kernel for shstobf16 and shdtobf16 5. Update level1 thread facilitate functions and macros to support multi-threading for these new APIs 6. Fix Cooperlake platform detection/specify issue when under dynamic-arch building 7. Change the typedef of bfloat16 from unsigned short to more strict uint16_t Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2020-09-04 02:31:25 +08:00
Martin Kroeker	75eeb265d7	[WIP] Refactor the driver code for direct SGEMM (#2782 ) Move "direct SGEMM" functionality out of the SkylakeX SGEMM kernel and make it available (on x86_64 targets only for now) in DYNAMIC_ARCH builds * Add sgemm_direct targets in the kernel Makefile.L3 and CMakeLists.txt * Add direct_sgemm functions to the gotoblas struct in common_param.h * Move sgemm_direct_performant helper to separate file * Update gemm.c to macros for sgemm_direct to support dynamic_arch naming via common_s,h * (Conditionally) add sgemm_direct functions in setparam-ref.c	2020-08-19 14:51:09 +02:00
Martin Kroeker	fee361ae64	fix another source of NO_CBLAS=0 surprise	2020-08-11 13:27:19 +02:00
Ashwin Sekhar T K	4e1be0e481	ARM64: Add THUNDERX3T110 Target	2020-07-26 23:32:24 -07:00
Martin Kroeker	5dd14e3d48	Make building the bfloat16 functions conditional on option BUILD_HALF (#2590 ) * make building the bfloat16 BLAS functions conditional on BUILD_HALF * pass the BUILD_HALF option to gensymbol * Pass BUILD_HALF as a compiler define for dynamic_arch builds	2020-05-01 09:58:30 +02:00
Martin Kroeker	2db5178e2d	enable cblas interfaces to GEMM3M in CMAKE builds	2020-04-22 11:01:28 +02:00
Rajalakshmi Srinivasaraghavan	7eb55504b1	RFC : Add half precision gemm for bfloat16 in OpenBLAS This patch adds support for bfloat16 data type matrix multiplication kernel. For architectures that don't support bfloat16, it is defined as unsigned short (2 bytes). Default unroll sizes can be changed as per architecture as done for SGEMM and for now 8 and 4 are used for M and N. Size of ncopy/tcopy can be changed as per architecture requirement and for now, size 2 is used. Added shgemm in kernel/power/KERNEL.POWER9 and tested in powerpc64le and powerpc64. For reference, added a small test compare_sgemm_shgemm.c to compare sgemm and shgemm output. This patch does not cover OpenBLAS test, benchmark and lapack tests for shgemm. Complex type implementation can be discussed and added once this is approved.	2020-04-14 14:55:08 -05:00
Martin Kroeker	8229c163b7	Use runtime check for AVX512 (sgemm_direct) capability when using DYNAMIC_ARCH	2020-03-26 21:12:56 +01:00
Martin Kroeker	6a14b34c20	Avoid calling DIRECT codepath in DYNAMIC_ARCH on non-SKX	2020-03-22 14:33:16 +01:00
Martin Kroeker	d65e9a2bbd	Merge pull request #2253 from thrasibule/xerbla fix error messages	2020-01-18 20:39:04 +01:00
Guillaume Horel	2463938879	fix error message	2019-09-11 10:35:25 -04:00
Guillaume Horel	5d6525c87c	more bugfix	2019-09-10 17:30:57 -04:00
Guillaume Horel	459bb9291d	fix error codes	2019-09-10 17:10:33 -04:00
Guillaume Horel	5997b6b491	bugfix	2019-09-08 11:14:49 -04:00
Guillaume Horel	7ec7b999a5	add missing file	2019-09-08 11:14:49 -04:00
Guillaume Horel	af9ac0898a	fix Makefile	2019-09-08 11:14:49 -04:00
Guillaume Horel	9b2f0323d6	update Makefile	2019-09-08 11:14:49 -04:00
Guillaume Horel	ea747cf933	start working on ?trtrs	2019-09-08 11:14:49 -04:00
luz.paz	daf2fec12d	Misc. typo fixes Found via `codespell -q 3 -w -L ith,als,dum,nd,amin,nto,wis,ba -S ./relapack,./kernel,./lapack-netlib`	2019-04-29 17:03:56 -04:00
Martin Kroeker	268c28db7d	Merge pull request #2095 from martin-frbg/trsm Correct length of name string in xerbla call	2019-04-28 09:55:25 +02:00
Martin Kroeker	0bd956fd21	Correct length of name string in xerbla call	2019-04-27 22:49:04 +02:00
Martin Kroeker	79cfc24a62	Add interface for ?sum (derived from ?asum)	2019-03-30 21:59:18 +01:00
Martin Kroeker	c19a449096	Merge pull request #2071 from martin-frbg/issue2068 Provide CBLAS interfaces to I?MIN and I?MAX	2019-03-30 14:54:28 +01:00
Martin Kroeker	3d1e36d4cb	Build CBLAS interfaces for I?MIN and I?MAX	2019-03-30 12:38:41 +01:00
Martin Kroeker	e29b0cfcc4	Allow multithreading TRMV again revert workaround introduced for issue #1332 as the actual cause appears to be my incorrect fix from #1262 (see #1388)	2019-02-19 21:03:30 +01:00
Martin Kroeker	8533aca964	Avoid penalizing tall skinny matrices	2019-01-23 10:03:00 +01:00
Martin Kroeker	cda81cfae0	Shift transition to multithreading towards larger matrix sizes See #1886 and JuliaRobotics issue 500. trsm benchmarks on Haswell and Zen showed that with these values performance is roughly doubled for matrix sizes between 8x8 and 14x14, and still 10 to 20 percent better near the new cutoff at 32x32.	2019-01-19 00:10:01 +01:00
Arjan van de Ven	cdc668d82b	Add a "sgemm direct" mode for small matrixes OpenBLAS has a fancy algorithm for copying the input data while laying it out in a more CPU friendly memory layout. This is great for large matrixes; the cost of the copy is easily ammortized by the gains from the better memory layout. But for small matrixes (on CPUs that can do efficient unaligned loads) this copy can be a net loss. This patch adds (for SKYLAKEX initially) a "sgemm direct" mode, that bypasses the whole copy machinary for ALPHA=1/BETA=0/... standard arguments, for small matrixes only. What is small? For the non-threaded case this has been measured to be in the MNK = 28 * 512 * 512 range, while in the threaded case it's less, around MNK = 1 * 512 * 512	2018-12-13 13:47:31 +00:00
Martin Kroeker	5393759a98	Merge pull request #1869 from martin-frbg/axpy0 Handle special case INCX=0,INCY=0 in the axpy interface	2018-11-25 20:52:49 +01:00
Martin Kroeker	c171b8ad13	Handle special case INCX=0,INCY=0 in the axpy interface	2018-11-13 13:57:18 +01:00
Martin Kroeker	96d2f2c9b2	Merge pull request #1831 from brada4/hemv disable threading in C/ZSWAP copying from S/DSWAP	2018-11-07 08:49:21 +01:00
Andrew	2992e3886a	disable threading in C/ZSWAP copying from S/DSWAP	2018-10-22 23:21:49 +03:00
Martin Kroeker	e3c262e5cf	Merge pull request #1825 from brada4/hemv Delay _hemv threading in attempt to address #1820	2018-10-21 20:34:05 +02:00
Andrew	a293bdcd5e	re-arrange new code for readability	2018-10-20 21:37:53 +03:00
Andrew	c7bbf9c987	Attempt to tame _hemv threading #1820	2018-10-20 11:13:29 +03:00
Ashwin Sekhar T K	21f46a1cf2	ARM64: Use THUNDERX2T99 Neon Kernels for ARMV8 Currently the generic ARMV8 target uses C implementations for many routines. Replace these with the neon implementations written for THUNDERX2T99 target which are upto 6x faster for certain routines.	2018-10-17 10:44:37 -07:00
Martin Kroeker	b991570210	Merge pull request #1762 from martin-frbg/issue1710-2 Add explicit casts to silence compiler warnings	2018-09-19 18:16:21 +02:00
Martin Kroeker	f3c262156e	Add an explicit cast to silence a warning for #1710	2018-09-13 14:24:29 +02:00
Martin Kroeker	30f5a69ab8	Add explicit cast to silence a warning for #1710	2018-09-13 14:23:31 +02:00
Martin Kroeker	4a553e8678	Merge pull request #1713 from martin-frbg/issue1710 Introduce blasabs macro and use it to switch between abs and labs for INTERFACE64	2018-08-04 23:51:31 +02:00
Martin Kroeker	165f00c159	fabs -> fabsl	2018-08-04 20:14:51 +02:00
Martin Kroeker	933896a1d0	Use blasabs to switch between abs and labs as needed for INTERFACE64	2018-08-04 20:06:49 +02:00
Steven G. Johnson	a4e321400b	fabs -> fabsl Fixes two calls that were using `fabs` on a `long double` argument rather than `fabsl`, which looks like it is doing an unintentional truncation to `double` precision.	2018-08-03 13:00:10 -04:00
Martin Kroeker	9cf22b7d91	Build cblas_iXamin interfaces	2018-06-23 13:27:30 +02:00
Craig Donner	c2545b0fd6	Fixed a few more unnecessary calls to num_cpu_avail. I don't have as many benchmarks for these as for gemm, but it should still make a difference for small matrices.	2018-06-11 10:17:16 +01:00
Craig Donner	66316b9f4c	Improve performance of GEMM for small matrices when SMP is defined. Always checking num_cpu_avail() regardless of whether threading will actually be used adds noticeable overhead for small matrices. Most other uses of num_cpu_avail() do so only if threading will be used, so do the same here.	2018-06-07 15:29:13 +01:00
Martin Kroeker	e8880c1699	Use a single thread for small input size copies daxpy improvement from #27, see #1560	2018-06-07 10:26:55 +02:00
Martin Kroeker	1d27fa8507	Merge pull request #1539 from martin-frbg/ztrmv-1332 Disable multithreading in ztrmv	2018-04-27 23:10:21 +02:00
Martin Kroeker	a8ed428bab	Disable multithreading in ztrmv BLAS-Tester shows that the same problem exists as with DTRMV (issue #1332)	2018-04-25 22:35:46 +02:00
Martin Kroeker	809fd0d451	Rewrite ROTMG to address cases not covered by the netlib algorithm (#1480 ) * Rewrite ROTMG based on the new implementation in GONUM based on the algorithm proposed by Tim Hopkins, see issue 1452 for the reference * Correct ROTMG utest for issue1452 and add another from gonum, also correct transposition of expected and observed values in error messages	2018-03-04 17:39:56 +01:00
Martin Kroeker	72f14a0363	Fix conditionals in the rescaling against GAMSQ	2018-02-18 12:54:52 +01:00
Martin Kroeker	798f1595d5	Fix condition in both second scaling loops	2018-02-18 12:37:09 +01:00
Martin Kroeker	0464aa6784	Remove debug printfs	2018-02-09 23:06:50 +01:00
Martin Kroeker	55840f0bc9	Keep the flag handling separate from the scaling loops Fixes #1452 and is more in line with how ATLAS does it. The earlier fix from #356 only moved the bug elsewhere, but we will never want the iterative rescaling to change the dflag setting and variable associations with each cycle.	2018-02-09 23:00:03 +01:00
Andrew	47deec2c1a	fix couple of dead assignment warnings	2017-12-22 00:56:35 +01:00
Martin Kroeker	38763ec4f3	Disable multithreading for trmv as a (hopefully temporary) workaround for #1332	2017-12-03 22:40:54 +01:00
Martin Kroeker	9251a2efde	Merge pull request #1359 from brada4/develop Eliminate mode variable where not needed in syrk interface	2017-11-18 23:47:17 +01:00

1 2 3 4 5 ...

338 Commits