OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Martin Kroeker	ddec244a5a	Merge pull request #2838 from austinpagan/gordon_trmm Adding performance patch for trmm, just like trsm (#2836)	2020-09-15 21:17:48 +02:00
fossum	dfeca46098	Adding performance patch for trmm, just like #2836	2020-09-15 08:59:50 -05:00
fossum	274d6e015b	Fixing a performance bug in trsm_[LR].c.	2020-09-14 13:10:48 -05:00
Martin Kroeker	91c84e1c01	Merge pull request #2796 from Guobing-Chen/BF16_dot_coversion_apis Add bfloat16 based dot and conversion with single/double	2020-09-14 15:00:19 +02:00
Marius Hillenbrand	a55fe06f25	s390x/DYNAMIC_ARCH: define a HW_CAP flag to support slightly older glibc versions Enable building DYNAMIC_ARCH support with older versions of glibc that do not know about the hwcap flag HWCAP_S390_VXE yet. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-08 19:34:18 +02:00
Marius Hillenbrand	4f34bcfb5e	s390x/DYNAMIC_ARCH: pass supported arch levels from Makefile to run-time code ... instead of duplicating the (old) mechanism from the Makefile that aimed to derive supported architecture generations from the gcc version. To enable builds with DYNAMIC_ARCH with older compiler releases, the Makefile and drivers/other/dynamic_arch.c need a common view of the architecture support built into the library. We follow the notation from x86 when used with DYNAMIC_LIST, where defines DYN_<ARCH NAME> denote support for a given generation to be built in. Since there are far fewer architecture generations in OpenBLAS for s390x, that does not bloat command lines too much. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-08 19:34:18 +02:00
Martin Kroeker	330044d821	Fix potentiol domain error in sqrt	2020-09-05 09:44:33 +02:00
Chen, Guobing	deaeb6c5b8	Add bfloat16 based dot and conversion with single/double 1. Added bfloat16 based dot as new API: shdot 2. Implemented generic kernel and cooperlake-specific (AVX512-BF16) kernel for shdot 3. Added 4 conversion APIs for bfloat16 data type <=> single/double: shstobf16 shdtobf16 sbf16tos dbf16tod shstobf16 -- convert single float array to bfloat16 array shdtobf16 -- convert double float array to bfloat16 array sbf16tos -- convert bfloat16 array to single float array dbf16tod -- convert bfloat16 array to double float array 4. Implemented generic kernels for all 4 conversion APIs, and cooperlake-specific kernel for shstobf16 and shdtobf16 5. Update level1 thread facilitate functions and macros to support multi-threading for these new APIs 6. Fix Cooperlake platform detection/specify issue when under dynamic-arch building 7. Change the typedef of bfloat16 from unsigned short to more strict uint16_t Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2020-09-04 02:31:25 +08:00
Chen, Guobing	0c1c903f1e	Fix OMP num specify issue In current code, no matter what number of threads specified, all available CPU count is used when invoking OMP, which leads to very bad performance if the workload is small while all available CPUs are big. Lots of time are wasted on inter-thread sync. Fix this issue by really using the number specified by the variable 'num' from calling API. Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2020-08-24 02:45:54 +08:00
Chen, Guobing	e740c4873d	Enable COOPERLAKE build target Enable new build target platform -- COOPERLAKE. This target platform supports all the SKYLAKEX supported ISAs + avx512bf16. So all the SKYLAKEX specific kernels/drivers and related code are now extended to be also active on COOPERLAKE. Besides, new BF16 related kernels are active under this target.	2020-08-13 06:18:00 +08:00
Martin Kroeker	60cd5e55fc	Protect against inadvertent activation of USE_CUDA	2020-08-01 12:31:39 +02:00
Martin Kroeker	7c02f4b1f7	Merge pull request #2744 from martin-frbg/issue2738 Add AMD Renoir/Matisse cpu autodetection and preliminary support for Zen3	2020-07-28 19:32:04 +02:00
Martin Kroeker	12918358aa	Add AMD Renoir/Matisse and preliminary support for Zen3 as Zen2 also support AMD family 22 Jaguar/Puma as Bobcat	2020-07-28 13:53:17 +00:00
Ashwin Sekhar T K	4e1be0e481	ARM64: Add THUNDERX3T110 Target	2020-07-26 23:32:24 -07:00
Martin Kroeker	ce45af8151	Update conditional for atomics to use HAVE_C11	2020-07-18 17:09:56 +00:00
Martin Kroeker	6f38de06d2	Update conditional for atomics to use HAVE_C11	2020-07-18 17:09:01 +00:00
Martin Kroeker	09eb9d2584	Update conditional for atomics to HAVE_C11	2020-07-18 17:07:38 +00:00
Martin Kroeker	791e046744	Update conditional for atomics to use HAVE_C11	2020-07-18 17:05:59 +00:00
Martin Kroeker	94bab9d1f9	Update conditional for atomics to use HAVE_C11	2020-07-18 17:03:31 +00:00
Rajalakshmi Srinivasaraghavan	af1e140e35	Change minimum gcc version for POWER10 As the MMA patches for POWER10 are backported to gcc10.2, changing the minimum gcc version needed to build OpenBLAS for POWER10.	2020-07-09 21:46:06 -05:00
Rajalakshmi Srinivasaraghavan	45d819ca82	Changing mcpu option as power10 As compiler enabled mcpu option as power10, changing it from future.	2020-07-07 11:25:20 -05:00
Martin Kroeker	584ef8d4ae	Add support for Comet Lake H & S	2020-06-27 14:36:37 +02:00
Matthew Treinish	f37e941d52	Add support to driver/others/dynamic.c too	2020-06-25 11:56:49 -04:00
User User-User	e6b9275034	address vs2019 C4293	2020-06-24 09:12:23 +03:00
Martin Kroeker	6eaeb01263	Merge pull request #2658 from RajalakshmiSR/p10 powerpc: Add support for future processor	2020-06-23 00:02:37 +02:00
Martin Kroeker	007d9f97d7	Make gotoblas_corename report the name of the selected TARGET rather than its aliases	2020-06-13 19:25:28 +02:00
Rajalakshmi Srinivasaraghavan	9fe930f205	powerpc: Add support for future processor This is the initial patch to support build infrastructure for POWER10 architecture.	2020-06-11 15:47:20 -05:00
Marius Hillenbrand	0dbe61a612	s390x: choose SIMD kernels at run-time based on OS and compiler support Extend and simplify the run-time detection for dynamic architecture support for z to check HW_CAP and only use SIMD features if advertised by the OS. While at it, also honor the env variable LD_HWCAP_MASK and do not use the CPU features masked there. Note that we can only use the SIMD features on z13 or newer (i.e., Vector Facility or Vector-Enhancements Facilities) when the operating system supports properly context-switching the vector registers. The OS advertises that support as a bit in the HW_CAP value in the auxiliary vector. While all recent Linux kernels have that support, we should maintain compatibility with older versions that may still be in use. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 11:01:16 +02:00
Marius Hillenbrand	8c338616f9	s390x: gate dynamic arch detection on gcc version and add generic When building OpenBLAS with DYNAMIC_ARCH=1 on s390x (aka zarch), make sure to include support for systems without the facilities introduced with z13 (i.e., zarch_generic). Adjust runtime detection to fallback to that generic code when running on a unknown platform other than Z13 through Z15. When detecting a Z13 or newer system, add a check for gcc support for the architecture-specific features before selecting the respective kernel. Fallback to Z13 or generic code, in case. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 11:01:16 +02:00
Martin Kroeker	5dd14e3d48	Make building the bfloat16 functions conditional on option BUILD_HALF (#2590 ) * make building the bfloat16 BLAS functions conditional on BUILD_HALF * pass the BUILD_HALF option to gensymbol * Pass BUILD_HALF as a compiler define for dynamic_arch builds	2020-05-01 09:58:30 +02:00
Martin Kroeker	f4248af26e	Fix compiler warnings	2020-04-28 10:43:12 +02:00
Rajalakshmi Srinivasaraghavan	7eb55504b1	RFC : Add half precision gemm for bfloat16 in OpenBLAS This patch adds support for bfloat16 data type matrix multiplication kernel. For architectures that don't support bfloat16, it is defined as unsigned short (2 bytes). Default unroll sizes can be changed as per architecture as done for SGEMM and for now 8 and 4 are used for M and N. Size of ncopy/tcopy can be changed as per architecture requirement and for now, size 2 is used. Added shgemm in kernel/power/KERNEL.POWER9 and tested in powerpc64le and powerpc64. For reference, added a small test compare_sgemm_shgemm.c to compare sgemm and shgemm output. This patch does not cover OpenBLAS test, benchmark and lapack tests for shgemm. Complex type implementation can be discussed and added once this is approved.	2020-04-14 14:55:08 -05:00
Martin Kroeker	f41600e66f	Add a read barrier in the traversing of the buffer list Needed on systems with weak memory ordering - the inferior, partially working fix from #2544 was already removed in #2551	2020-04-13 12:34:02 +02:00
Martin Kroeker	2a28448a96	Add safeguards for sufficient BUFFER_SIZE	2020-04-12 19:45:36 +02:00
Sharvil Nanavati	7b4773b24d	Add API to set thread affinity on Linux. Issue: #2545	2020-04-08 12:49:35 -07:00
Martin Kroeker	69f277f8ee	Add another memory barrier for ARM and a multicore test run on ThunderX to help detect such issues (#2544 ) * Add another memory barrier in memory.c to prevent races in memory slot allocation * Add an all-core test on Drone.io's ThunderX platform and modify dgemm_tester to use all 96 cores	2020-04-08 11:04:51 +02:00
Martin Kroeker	806f89166e	Make ARMV7 compile with xcode and add a CI job for it (#2537 ) * Add an ARMV7 iOS build on Travis * thread_local appears to be unavailable on ARMV7 iOS * Add no-thumb option for ARMV7 IOS build to get it to accept DMB ISH * Make local labels in macros of nrm2_vfpv3.S compatible with the xcode assembler	2020-04-02 10:30:37 +02:00
Martin Kroeker	ad9e53154d	Merge pull request #2484 from RajalakshmiSR/power-dynamic Fix DYNAMIC_ARCH build for POWER9	2020-03-04 08:06:06 +01:00
Martin Kroeker	e6edb7431f	Merge pull request #2466 from AGSaidi/acq-rel-1 Switch blas_server to use acq/rel semantics	2020-03-04 07:59:31 +01:00
Martin Kroeker	d68e4ba59b	Fix cut/paste glitch	2020-03-03 21:37:48 +01:00
Martin Kroeker	635c9e4e09	Restore initializers for mutex and conditional	2020-03-03 21:04:12 +01:00
Rajalakshmi Srinivasaraghavan	2afc074803	Fix DYNAMIC_ARCH build for POWER9 Setting DYNAMIC_ARCH=1 on POWER9 does not build POWER9 files due to some compiler version checks. This patch fixes some of the macros that are used to check compiler version. On fixing those checks, there are some new make failures related to icamin, icamax, isamin, isamax and caxpy files on POWER9. This patch fixes those failures as well.	2020-03-03 12:35:10 -06:00
Ali Saidi	43c2e845ab	Switch blas_server to use acq/rel semantics Heavy-weight locking isn't required to pass the work queue pointer between threads and simple atomic acquire/release semantics can be used instead. This is especially important as pthread_mutex_lock() isn't fair. We've observed substantial variation in runtime because of the the unfairness of these locks which complety goes away with this implementation. The locks themselves are left to provide a portable way for idling threads to sleep/wakeup after many unsuccessful iterations waiting.	2020-03-02 02:52:49 +00:00
Martin Kroeker	2e6963259b	Merge pull request #2471 from AGSaidi/l3-fix-2 Fix barriers in level3_thread	2020-03-01 19:41:07 +01:00
Ali Saidi	97ce6bbce2	Fix barriers in level3_thread	2020-02-29 17:45:17 +00:00
Ali Saidi	c623a965f9	Add Neoverse-N1 core The implementation is a hybird of the ARMV8 one with some of the improved TX2 rountines along with specifying -march=v8.2-a	2020-02-29 03:22:04 +00:00
Martin Kroeker	4c5fac5a2b	Typo fix	2020-02-24 20:15:04 +01:00
Martin Kroeker	9b732696c6	Add DYNAMIC_ARCH support for ARMV8 EMAG8180	2020-02-24 19:20:00 +01:00
Martin Kroeker	cb6ef49857	Merge pull request #2407 from susilehtola/patch-2 Patch out instances of Z15 in dynamic_zarch.c	2020-02-11 13:04:44 +01:00
Susi Lehtola	5a6bba3061	Patch out instances of Z15 in dynamic_zarch.c There does not appear to be a Z15 kernel yet, causing link errors from the code. This patch fixes the issue.	2020-02-11 15:07:33 +13:00

1 2 3 4 5 ...

504 Commits