OpenBLAS

Author	SHA1	Message	Date
Rajalakshmi Srinivasaraghavan	b5d30b390d	Fix build issues with bfloat16 This patch fixes compilation errors due to recent renaming from SH to SB with BUILD_BFLOAT16.	2020-10-13 11:00:22 -05:00
Martin Kroeker	006c7f6671	Change "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-12 00:06:06 +02:00
Martin Kroeker	85154c2e18	Change "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-12 00:05:05 +02:00
Martin Kroeker	887e00fd7f	Adapt for supporting only a subset of variable types	2020-10-11 14:58:57 +02:00
Martin Kroeker	886a8e3190	Adapt for supporting only a subset of variable types	2020-10-11 14:57:32 +02:00
Martin Kroeker	ac653c94f3	Merge branch 'develop' into issue2588-cmake	2020-10-11 13:57:07 +02:00
Martin Kroeker	f032d8966e	Merge pull request #2874 from Flamefire/memory_fixes Avoid out of bounds access on invalid memory free	2020-10-04 15:16:51 +02:00
Martin Kroeker	f6e4cf2f9d	Merge pull request #2876 from Flamefire/omp_fork_fix Lazyly reinit threads after a fork in OMP mode	2020-10-03 22:52:17 +02:00
User User-User	d2333e7842	aarch64 fix std=c18 compilation	2020-10-03 18:00:34 +03:00
Alexander Grund	3094fc6c83	Lazyly reinit threads after a fork in OMP mode This initializes the per-thread memory buffers which get cleared/released on a fork via pthread_at_fork. Not doing so leads to each thread calling blas_memory_alloc on almost every execution which slows down the code significantly as the threads race for the memory allocation using locks to serialize that.	2020-10-01 15:41:42 +02:00
Alexander Grund	3c05f54df8	Avoid out of bounds access on invalid memory free	2020-10-01 10:48:45 +02:00
Alexander Grund	dee7c49938	Fix TABs and trailing space	2020-10-01 10:43:16 +02:00
Martin Kroeker	896bbd55e1	Add support for building only selected variable types	2020-09-26 23:25:55 +02:00
Martin Kroeker	357bff06b5	Add BUILD_vartype defines	2020-09-22 23:24:22 +02:00
Martin Kroeker	988a6f429e	Add BUILD_vartype defines	2020-09-22 23:23:33 +02:00
Martin Kroeker	e5e2fbd593	Support building only selected types	2020-09-22 23:21:30 +02:00
Martin Kroeker	3287848c8f	Support building only seleced types	2020-09-22 23:20:51 +02:00
y00512012	06cf73a239	fix a bug of trmm	2020-09-22 16:47:10 +08:00
Martin Kroeker	ddec244a5a	Merge pull request #2838 from austinpagan/gordon_trmm Adding performance patch for trmm, just like trsm (#2836)	2020-09-15 21:17:48 +02:00
fossum	dfeca46098	Adding performance patch for trmm, just like #2836	2020-09-15 08:59:50 -05:00
fossum	274d6e015b	Fixing a performance bug in trsm_[LR].c.	2020-09-14 13:10:48 -05:00
Martin Kroeker	91c84e1c01	Merge pull request #2796 from Guobing-Chen/BF16_dot_coversion_apis Add bfloat16 based dot and conversion with single/double	2020-09-14 15:00:19 +02:00
Marius Hillenbrand	a55fe06f25	s390x/DYNAMIC_ARCH: define a HW_CAP flag to support slightly older glibc versions Enable building DYNAMIC_ARCH support with older versions of glibc that do not know about the hwcap flag HWCAP_S390_VXE yet. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-08 19:34:18 +02:00
Marius Hillenbrand	4f34bcfb5e	s390x/DYNAMIC_ARCH: pass supported arch levels from Makefile to run-time code ... instead of duplicating the (old) mechanism from the Makefile that aimed to derive supported architecture generations from the gcc version. To enable builds with DYNAMIC_ARCH with older compiler releases, the Makefile and drivers/other/dynamic_arch.c need a common view of the architecture support built into the library. We follow the notation from x86 when used with DYNAMIC_LIST, where defines DYN_<ARCH NAME> denote support for a given generation to be built in. Since there are far fewer architecture generations in OpenBLAS for s390x, that does not bloat command lines too much. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-08 19:34:18 +02:00
Martin Kroeker	330044d821	Fix potentiol domain error in sqrt	2020-09-05 09:44:33 +02:00
Chen, Guobing	deaeb6c5b8	Add bfloat16 based dot and conversion with single/double 1. Added bfloat16 based dot as new API: shdot 2. Implemented generic kernel and cooperlake-specific (AVX512-BF16) kernel for shdot 3. Added 4 conversion APIs for bfloat16 data type <=> single/double: shstobf16 shdtobf16 sbf16tos dbf16tod shstobf16 -- convert single float array to bfloat16 array shdtobf16 -- convert double float array to bfloat16 array sbf16tos -- convert bfloat16 array to single float array dbf16tod -- convert bfloat16 array to double float array 4. Implemented generic kernels for all 4 conversion APIs, and cooperlake-specific kernel for shstobf16 and shdtobf16 5. Update level1 thread facilitate functions and macros to support multi-threading for these new APIs 6. Fix Cooperlake platform detection/specify issue when under dynamic-arch building 7. Change the typedef of bfloat16 from unsigned short to more strict uint16_t Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2020-09-04 02:31:25 +08:00
Chen, Guobing	0c1c903f1e	Fix OMP num specify issue In current code, no matter what number of threads specified, all available CPU count is used when invoking OMP, which leads to very bad performance if the workload is small while all available CPUs are big. Lots of time are wasted on inter-thread sync. Fix this issue by really using the number specified by the variable 'num' from calling API. Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2020-08-24 02:45:54 +08:00
Chen, Guobing	e740c4873d	Enable COOPERLAKE build target Enable new build target platform -- COOPERLAKE. This target platform supports all the SKYLAKEX supported ISAs + avx512bf16. So all the SKYLAKEX specific kernels/drivers and related code are now extended to be also active on COOPERLAKE. Besides, new BF16 related kernels are active under this target.	2020-08-13 06:18:00 +08:00
Martin Kroeker	60cd5e55fc	Protect against inadvertent activation of USE_CUDA	2020-08-01 12:31:39 +02:00
Martin Kroeker	7c02f4b1f7	Merge pull request #2744 from martin-frbg/issue2738 Add AMD Renoir/Matisse cpu autodetection and preliminary support for Zen3	2020-07-28 19:32:04 +02:00
Martin Kroeker	12918358aa	Add AMD Renoir/Matisse and preliminary support for Zen3 as Zen2 also support AMD family 22 Jaguar/Puma as Bobcat	2020-07-28 13:53:17 +00:00
Ashwin Sekhar T K	4e1be0e481	ARM64: Add THUNDERX3T110 Target	2020-07-26 23:32:24 -07:00
Martin Kroeker	ce45af8151	Update conditional for atomics to use HAVE_C11	2020-07-18 17:09:56 +00:00
Martin Kroeker	6f38de06d2	Update conditional for atomics to use HAVE_C11	2020-07-18 17:09:01 +00:00
Martin Kroeker	09eb9d2584	Update conditional for atomics to HAVE_C11	2020-07-18 17:07:38 +00:00
Martin Kroeker	791e046744	Update conditional for atomics to use HAVE_C11	2020-07-18 17:05:59 +00:00
Martin Kroeker	94bab9d1f9	Update conditional for atomics to use HAVE_C11	2020-07-18 17:03:31 +00:00
Rajalakshmi Srinivasaraghavan	af1e140e35	Change minimum gcc version for POWER10 As the MMA patches for POWER10 are backported to gcc10.2, changing the minimum gcc version needed to build OpenBLAS for POWER10.	2020-07-09 21:46:06 -05:00
Rajalakshmi Srinivasaraghavan	45d819ca82	Changing mcpu option as power10 As compiler enabled mcpu option as power10, changing it from future.	2020-07-07 11:25:20 -05:00
Martin Kroeker	584ef8d4ae	Add support for Comet Lake H & S	2020-06-27 14:36:37 +02:00
Matthew Treinish	f37e941d52	Add support to driver/others/dynamic.c too	2020-06-25 11:56:49 -04:00
User User-User	e6b9275034	address vs2019 C4293	2020-06-24 09:12:23 +03:00
Martin Kroeker	6eaeb01263	Merge pull request #2658 from RajalakshmiSR/p10 powerpc: Add support for future processor	2020-06-23 00:02:37 +02:00
Martin Kroeker	007d9f97d7	Make gotoblas_corename report the name of the selected TARGET rather than its aliases	2020-06-13 19:25:28 +02:00
Rajalakshmi Srinivasaraghavan	9fe930f205	powerpc: Add support for future processor This is the initial patch to support build infrastructure for POWER10 architecture.	2020-06-11 15:47:20 -05:00
Marius Hillenbrand	0dbe61a612	s390x: choose SIMD kernels at run-time based on OS and compiler support Extend and simplify the run-time detection for dynamic architecture support for z to check HW_CAP and only use SIMD features if advertised by the OS. While at it, also honor the env variable LD_HWCAP_MASK and do not use the CPU features masked there. Note that we can only use the SIMD features on z13 or newer (i.e., Vector Facility or Vector-Enhancements Facilities) when the operating system supports properly context-switching the vector registers. The OS advertises that support as a bit in the HW_CAP value in the auxiliary vector. While all recent Linux kernels have that support, we should maintain compatibility with older versions that may still be in use. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 11:01:16 +02:00
Marius Hillenbrand	8c338616f9	s390x: gate dynamic arch detection on gcc version and add generic When building OpenBLAS with DYNAMIC_ARCH=1 on s390x (aka zarch), make sure to include support for systems without the facilities introduced with z13 (i.e., zarch_generic). Adjust runtime detection to fallback to that generic code when running on a unknown platform other than Z13 through Z15. When detecting a Z13 or newer system, add a check for gcc support for the architecture-specific features before selecting the respective kernel. Fallback to Z13 or generic code, in case. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 11:01:16 +02:00
Martin Kroeker	5dd14e3d48	Make building the bfloat16 functions conditional on option BUILD_HALF (#2590 ) * make building the bfloat16 BLAS functions conditional on BUILD_HALF * pass the BUILD_HALF option to gensymbol * Pass BUILD_HALF as a compiler define for dynamic_arch builds	2020-05-01 09:58:30 +02:00
Martin Kroeker	f4248af26e	Fix compiler warnings	2020-04-28 10:43:12 +02:00
Rajalakshmi Srinivasaraghavan	7eb55504b1	RFC : Add half precision gemm for bfloat16 in OpenBLAS This patch adds support for bfloat16 data type matrix multiplication kernel. For architectures that don't support bfloat16, it is defined as unsigned short (2 bytes). Default unroll sizes can be changed as per architecture as done for SGEMM and for now 8 and 4 are used for M and N. Size of ncopy/tcopy can be changed as per architecture requirement and for now, size 2 is used. Added shgemm in kernel/power/KERNEL.POWER9 and tested in powerpc64le and powerpc64. For reference, added a small test compare_sgemm_shgemm.c to compare sgemm and shgemm output. This patch does not cover OpenBLAS test, benchmark and lapack tests for shgemm. Complex type implementation can be discussed and added once this is approved.	2020-04-14 14:55:08 -05:00

1 2 3 4 5 ...

522 Commits