OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Martin Kroeker	ff16329cb7	Merge pull request #2972 from xiegengxin/rot-intrinsic Improve the performance of rot by using AVX512 and AVX2 intrinsic	2020-11-08 22:43:00 +01:00
Gengxin Xie	d9ba49165a	Improve the performance of rot by using AVX512 and AVX2 intrinsic	2020-11-05 15:12:36 +08:00
Martin Kroeker	aa21cb5217	Merge pull request #2960 from thrasibule/avx2_detection fix avx2 detection	2020-10-31 20:24:21 +01:00
Guillaume Horel	1f564d729b	fix avx2 detection reword commits to make it clearer	2020-10-31 10:00:48 -04:00
Chen, Guobing	a7b1f9b1bb	Implementation of BF16 based gemv 1. Add a new API -- sbgemv to support bfloat16 based gemv 2. Implement a generic kernel for sbgemv 3. Implement an avx512-bf16 based kernel for sbgemv Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2020-10-29 02:08:23 +08:00
Martin Kroeker	2207a16235	Merge pull request #2952 from martin-frbg/issue2931 Try to read cpu ID from /sys/devices/.../cpu0 if HWCAP_CPUID fails	2020-10-28 09:37:32 +01:00
Martin Kroeker	b937d78a6d	Try to read cpu information from /sys/devices/system/cpu/cpu0 if HWCAP_CPUID fails	2020-10-27 17:51:32 +01:00
Martin Kroeker	fd7da56965	Move definitions that are neither needed nor supported on SUNOS	2020-10-25 12:01:50 +01:00
Martin Kroeker	ff65952e46	Move HAVE_P10_SUPPORT to the build system to be able to include a binutils version check	2020-10-20 00:55:41 +02:00
Rajalakshmi Srinivasaraghavan	b5d30b390d	Fix build issues with bfloat16 This patch fixes compilation errors due to recent renaming from SH to SB with BUILD_BFLOAT16.	2020-10-13 11:00:22 -05:00
Martin Kroeker	006c7f6671	Change "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-12 00:06:06 +02:00
Martin Kroeker	85154c2e18	Change "HALF" and "sh" to "BFLOAT16" and "sb"	2020-10-12 00:05:05 +02:00
Martin Kroeker	887e00fd7f	Adapt for supporting only a subset of variable types	2020-10-11 14:58:57 +02:00
Martin Kroeker	886a8e3190	Adapt for supporting only a subset of variable types	2020-10-11 14:57:32 +02:00
Martin Kroeker	ac653c94f3	Merge branch 'develop' into issue2588-cmake	2020-10-11 13:57:07 +02:00
Martin Kroeker	f032d8966e	Merge pull request #2874 from Flamefire/memory_fixes Avoid out of bounds access on invalid memory free	2020-10-04 15:16:51 +02:00
Martin Kroeker	f6e4cf2f9d	Merge pull request #2876 from Flamefire/omp_fork_fix Lazyly reinit threads after a fork in OMP mode	2020-10-03 22:52:17 +02:00
User User-User	d2333e7842	aarch64 fix std=c18 compilation	2020-10-03 18:00:34 +03:00
Alexander Grund	3094fc6c83	Lazyly reinit threads after a fork in OMP mode This initializes the per-thread memory buffers which get cleared/released on a fork via pthread_at_fork. Not doing so leads to each thread calling blas_memory_alloc on almost every execution which slows down the code significantly as the threads race for the memory allocation using locks to serialize that.	2020-10-01 15:41:42 +02:00
Alexander Grund	3c05f54df8	Avoid out of bounds access on invalid memory free	2020-10-01 10:48:45 +02:00
Alexander Grund	dee7c49938	Fix TABs and trailing space	2020-10-01 10:43:16 +02:00
Martin Kroeker	896bbd55e1	Add support for building only selected variable types	2020-09-26 23:25:55 +02:00
Martin Kroeker	357bff06b5	Add BUILD_vartype defines	2020-09-22 23:24:22 +02:00
Martin Kroeker	988a6f429e	Add BUILD_vartype defines	2020-09-22 23:23:33 +02:00
Martin Kroeker	e5e2fbd593	Support building only selected types	2020-09-22 23:21:30 +02:00
Martin Kroeker	3287848c8f	Support building only seleced types	2020-09-22 23:20:51 +02:00
y00512012	06cf73a239	fix a bug of trmm	2020-09-22 16:47:10 +08:00
Martin Kroeker	ddec244a5a	Merge pull request #2838 from austinpagan/gordon_trmm Adding performance patch for trmm, just like trsm (#2836)	2020-09-15 21:17:48 +02:00
fossum	dfeca46098	Adding performance patch for trmm, just like #2836	2020-09-15 08:59:50 -05:00
fossum	274d6e015b	Fixing a performance bug in trsm_[LR].c.	2020-09-14 13:10:48 -05:00
Martin Kroeker	91c84e1c01	Merge pull request #2796 from Guobing-Chen/BF16_dot_coversion_apis Add bfloat16 based dot and conversion with single/double	2020-09-14 15:00:19 +02:00
Marius Hillenbrand	a55fe06f25	s390x/DYNAMIC_ARCH: define a HW_CAP flag to support slightly older glibc versions Enable building DYNAMIC_ARCH support with older versions of glibc that do not know about the hwcap flag HWCAP_S390_VXE yet. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-08 19:34:18 +02:00
Marius Hillenbrand	4f34bcfb5e	s390x/DYNAMIC_ARCH: pass supported arch levels from Makefile to run-time code ... instead of duplicating the (old) mechanism from the Makefile that aimed to derive supported architecture generations from the gcc version. To enable builds with DYNAMIC_ARCH with older compiler releases, the Makefile and drivers/other/dynamic_arch.c need a common view of the architecture support built into the library. We follow the notation from x86 when used with DYNAMIC_LIST, where defines DYN_<ARCH NAME> denote support for a given generation to be built in. Since there are far fewer architecture generations in OpenBLAS for s390x, that does not bloat command lines too much. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-09-08 19:34:18 +02:00
Martin Kroeker	330044d821	Fix potentiol domain error in sqrt	2020-09-05 09:44:33 +02:00
Chen, Guobing	deaeb6c5b8	Add bfloat16 based dot and conversion with single/double 1. Added bfloat16 based dot as new API: shdot 2. Implemented generic kernel and cooperlake-specific (AVX512-BF16) kernel for shdot 3. Added 4 conversion APIs for bfloat16 data type <=> single/double: shstobf16 shdtobf16 sbf16tos dbf16tod shstobf16 -- convert single float array to bfloat16 array shdtobf16 -- convert double float array to bfloat16 array sbf16tos -- convert bfloat16 array to single float array dbf16tod -- convert bfloat16 array to double float array 4. Implemented generic kernels for all 4 conversion APIs, and cooperlake-specific kernel for shstobf16 and shdtobf16 5. Update level1 thread facilitate functions and macros to support multi-threading for these new APIs 6. Fix Cooperlake platform detection/specify issue when under dynamic-arch building 7. Change the typedef of bfloat16 from unsigned short to more strict uint16_t Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2020-09-04 02:31:25 +08:00
Chen, Guobing	0c1c903f1e	Fix OMP num specify issue In current code, no matter what number of threads specified, all available CPU count is used when invoking OMP, which leads to very bad performance if the workload is small while all available CPUs are big. Lots of time are wasted on inter-thread sync. Fix this issue by really using the number specified by the variable 'num' from calling API. Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	2020-08-24 02:45:54 +08:00
Chen, Guobing	e740c4873d	Enable COOPERLAKE build target Enable new build target platform -- COOPERLAKE. This target platform supports all the SKYLAKEX supported ISAs + avx512bf16. So all the SKYLAKEX specific kernels/drivers and related code are now extended to be also active on COOPERLAKE. Besides, new BF16 related kernels are active under this target.	2020-08-13 06:18:00 +08:00
Martin Kroeker	60cd5e55fc	Protect against inadvertent activation of USE_CUDA	2020-08-01 12:31:39 +02:00
Martin Kroeker	7c02f4b1f7	Merge pull request #2744 from martin-frbg/issue2738 Add AMD Renoir/Matisse cpu autodetection and preliminary support for Zen3	2020-07-28 19:32:04 +02:00
Martin Kroeker	12918358aa	Add AMD Renoir/Matisse and preliminary support for Zen3 as Zen2 also support AMD family 22 Jaguar/Puma as Bobcat	2020-07-28 13:53:17 +00:00
Ashwin Sekhar T K	4e1be0e481	ARM64: Add THUNDERX3T110 Target	2020-07-26 23:32:24 -07:00
Martin Kroeker	ce45af8151	Update conditional for atomics to use HAVE_C11	2020-07-18 17:09:56 +00:00
Martin Kroeker	6f38de06d2	Update conditional for atomics to use HAVE_C11	2020-07-18 17:09:01 +00:00
Martin Kroeker	09eb9d2584	Update conditional for atomics to HAVE_C11	2020-07-18 17:07:38 +00:00
Martin Kroeker	791e046744	Update conditional for atomics to use HAVE_C11	2020-07-18 17:05:59 +00:00
Martin Kroeker	94bab9d1f9	Update conditional for atomics to use HAVE_C11	2020-07-18 17:03:31 +00:00
Rajalakshmi Srinivasaraghavan	af1e140e35	Change minimum gcc version for POWER10 As the MMA patches for POWER10 are backported to gcc10.2, changing the minimum gcc version needed to build OpenBLAS for POWER10.	2020-07-09 21:46:06 -05:00
Rajalakshmi Srinivasaraghavan	45d819ca82	Changing mcpu option as power10 As compiler enabled mcpu option as power10, changing it from future.	2020-07-07 11:25:20 -05:00
Martin Kroeker	584ef8d4ae	Add support for Comet Lake H & S	2020-06-27 14:36:37 +02:00
Matthew Treinish	f37e941d52	Add support to driver/others/dynamic.c too	2020-06-25 11:56:49 -04:00
User User-User	e6b9275034	address vs2019 C4293	2020-06-24 09:12:23 +03:00
Martin Kroeker	6eaeb01263	Merge pull request #2658 from RajalakshmiSR/p10 powerpc: Add support for future processor	2020-06-23 00:02:37 +02:00
Martin Kroeker	007d9f97d7	Make gotoblas_corename report the name of the selected TARGET rather than its aliases	2020-06-13 19:25:28 +02:00
Rajalakshmi Srinivasaraghavan	9fe930f205	powerpc: Add support for future processor This is the initial patch to support build infrastructure for POWER10 architecture.	2020-06-11 15:47:20 -05:00
Marius Hillenbrand	0dbe61a612	s390x: choose SIMD kernels at run-time based on OS and compiler support Extend and simplify the run-time detection for dynamic architecture support for z to check HW_CAP and only use SIMD features if advertised by the OS. While at it, also honor the env variable LD_HWCAP_MASK and do not use the CPU features masked there. Note that we can only use the SIMD features on z13 or newer (i.e., Vector Facility or Vector-Enhancements Facilities) when the operating system supports properly context-switching the vector registers. The OS advertises that support as a bit in the HW_CAP value in the auxiliary vector. While all recent Linux kernels have that support, we should maintain compatibility with older versions that may still be in use. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 11:01:16 +02:00
Marius Hillenbrand	8c338616f9	s390x: gate dynamic arch detection on gcc version and add generic When building OpenBLAS with DYNAMIC_ARCH=1 on s390x (aka zarch), make sure to include support for systems without the facilities introduced with z13 (i.e., zarch_generic). Adjust runtime detection to fallback to that generic code when running on a unknown platform other than Z13 through Z15. When detecting a Z13 or newer system, add a check for gcc support for the architecture-specific features before selecting the respective kernel. Fallback to Z13 or generic code, in case. Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>	2020-05-12 11:01:16 +02:00
Martin Kroeker	5dd14e3d48	Make building the bfloat16 functions conditional on option BUILD_HALF (#2590 ) * make building the bfloat16 BLAS functions conditional on BUILD_HALF * pass the BUILD_HALF option to gensymbol * Pass BUILD_HALF as a compiler define for dynamic_arch builds	2020-05-01 09:58:30 +02:00
Martin Kroeker	f4248af26e	Fix compiler warnings	2020-04-28 10:43:12 +02:00
Rajalakshmi Srinivasaraghavan	7eb55504b1	RFC : Add half precision gemm for bfloat16 in OpenBLAS This patch adds support for bfloat16 data type matrix multiplication kernel. For architectures that don't support bfloat16, it is defined as unsigned short (2 bytes). Default unroll sizes can be changed as per architecture as done for SGEMM and for now 8 and 4 are used for M and N. Size of ncopy/tcopy can be changed as per architecture requirement and for now, size 2 is used. Added shgemm in kernel/power/KERNEL.POWER9 and tested in powerpc64le and powerpc64. For reference, added a small test compare_sgemm_shgemm.c to compare sgemm and shgemm output. This patch does not cover OpenBLAS test, benchmark and lapack tests for shgemm. Complex type implementation can be discussed and added once this is approved.	2020-04-14 14:55:08 -05:00
Martin Kroeker	f41600e66f	Add a read barrier in the traversing of the buffer list Needed on systems with weak memory ordering - the inferior, partially working fix from #2544 was already removed in #2551	2020-04-13 12:34:02 +02:00
Martin Kroeker	2a28448a96	Add safeguards for sufficient BUFFER_SIZE	2020-04-12 19:45:36 +02:00
Sharvil Nanavati	7b4773b24d	Add API to set thread affinity on Linux. Issue: #2545	2020-04-08 12:49:35 -07:00
Martin Kroeker	69f277f8ee	Add another memory barrier for ARM and a multicore test run on ThunderX to help detect such issues (#2544 ) * Add another memory barrier in memory.c to prevent races in memory slot allocation * Add an all-core test on Drone.io's ThunderX platform and modify dgemm_tester to use all 96 cores	2020-04-08 11:04:51 +02:00
Martin Kroeker	806f89166e	Make ARMV7 compile with xcode and add a CI job for it (#2537 ) * Add an ARMV7 iOS build on Travis * thread_local appears to be unavailable on ARMV7 iOS * Add no-thumb option for ARMV7 IOS build to get it to accept DMB ISH * Make local labels in macros of nrm2_vfpv3.S compatible with the xcode assembler	2020-04-02 10:30:37 +02:00
Martin Kroeker	ad9e53154d	Merge pull request #2484 from RajalakshmiSR/power-dynamic Fix DYNAMIC_ARCH build for POWER9	2020-03-04 08:06:06 +01:00
Martin Kroeker	e6edb7431f	Merge pull request #2466 from AGSaidi/acq-rel-1 Switch blas_server to use acq/rel semantics	2020-03-04 07:59:31 +01:00
Martin Kroeker	d68e4ba59b	Fix cut/paste glitch	2020-03-03 21:37:48 +01:00
Martin Kroeker	635c9e4e09	Restore initializers for mutex and conditional	2020-03-03 21:04:12 +01:00
Rajalakshmi Srinivasaraghavan	2afc074803	Fix DYNAMIC_ARCH build for POWER9 Setting DYNAMIC_ARCH=1 on POWER9 does not build POWER9 files due to some compiler version checks. This patch fixes some of the macros that are used to check compiler version. On fixing those checks, there are some new make failures related to icamin, icamax, isamin, isamax and caxpy files on POWER9. This patch fixes those failures as well.	2020-03-03 12:35:10 -06:00
Ali Saidi	43c2e845ab	Switch blas_server to use acq/rel semantics Heavy-weight locking isn't required to pass the work queue pointer between threads and simple atomic acquire/release semantics can be used instead. This is especially important as pthread_mutex_lock() isn't fair. We've observed substantial variation in runtime because of the the unfairness of these locks which complety goes away with this implementation. The locks themselves are left to provide a portable way for idling threads to sleep/wakeup after many unsuccessful iterations waiting.	2020-03-02 02:52:49 +00:00
Martin Kroeker	2e6963259b	Merge pull request #2471 from AGSaidi/l3-fix-2 Fix barriers in level3_thread	2020-03-01 19:41:07 +01:00
Ali Saidi	97ce6bbce2	Fix barriers in level3_thread	2020-02-29 17:45:17 +00:00
Ali Saidi	c623a965f9	Add Neoverse-N1 core The implementation is a hybird of the ARMV8 one with some of the improved TX2 rountines along with specifying -march=v8.2-a	2020-02-29 03:22:04 +00:00
Martin Kroeker	4c5fac5a2b	Typo fix	2020-02-24 20:15:04 +01:00
Martin Kroeker	9b732696c6	Add DYNAMIC_ARCH support for ARMV8 EMAG8180	2020-02-24 19:20:00 +01:00
Martin Kroeker	cb6ef49857	Merge pull request #2407 from susilehtola/patch-2 Patch out instances of Z15 in dynamic_zarch.c	2020-02-11 13:04:44 +01:00
Susi Lehtola	5a6bba3061	Patch out instances of Z15 in dynamic_zarch.c There does not appear to be a Z15 kernel yet, causing link errors from the code. This patch fixes the issue.	2020-02-11 15:07:33 +13:00
Susi Lehtola	dff173e50e	Fix typo in dynamic_zarch.c	2020-02-11 14:46:30 +13:00
wjc404	2f96a2c55b	Update trmm_R.c	2020-02-05 10:15:02 +08:00
wjc404	833bd0f8ff	Update trmm_L.c	2020-02-05 10:09:41 +08:00
wjc404	77b8f49556	Update level3_thread.c	2020-02-04 20:33:08 +08:00
wjc404	1c3e20ce48	Update level3.c	2020-02-04 20:30:23 +08:00
Martin Kroeker	1f62a82789	Merge pull request #2376 from wjc404/develop Fix remaining bugs in parallel GEMM3M	2020-01-23 21:50:19 +01:00
wjc404	e9fb8f62b1	Update level3_gemm3m_thread.c	2020-01-22 17:40:03 +00:00
Martin Kroeker	78100b8093	Free Windows thread memory with MEM_RELEASE rather than MEM_DECOMMIT as suggested by hjmndv in #2370	2020-01-18 15:06:39 +01:00
int_13h	96ad579428	add in runtime cpu detection for zarch (#2349 ) add in runtime cpu detection for zarch	2019-12-31 18:03:27 +01:00
wjc404	4c35b8dbaa	Update gemm3m_level3.c	2019-12-27 18:03:01 +08:00
Jehan	13226e3101	driver: more reasonable thread wait timeout on Windows. It used to be 5ms, which might not be long enough in some cases for the thread to exit well, but then when set to 5000 (5s), it would slow down any program depending on OpenBlas. Let's just set it to 50ms, which is at least 10 times longer than originally, but still reasonable in case of failed thread termination.	2019-12-13 09:52:33 +01:00
Martin Kroeker	a4896b5538	Update DYNAMIC_ARCH support for ARM64 and PPC (#2332 ) * Update DYNAMIC_ARCH list of ARM64 targets for gmake * Update arm64 cpu list for runtime detection * Update DYNAMIC_ARCH list of ARM64 targets for cmake and add POWERPC targets	2019-12-04 11:06:03 +01:00
Martin Kroeker	3518617f5b	Add Intel Goldmont+ cpuid was originally in #2228 but that PR had misplaced the file in the toplevel directory	2019-12-03 08:32:29 +01:00
Martin Kroeker	a4c3668f99	Merge pull request #2321 from martin-frbg/issue2319 Fix race conditions in multithreaded GEMM3M	2019-11-28 09:30:24 +01:00
Martin Kroeker	f95989cbc1	Fix AVX512 capability test (always returning zero) from #2322	2019-11-23 22:38:07 +01:00
Martin Kroeker	f3065a0eed	Fix race conditions in multithreaded GEMM3M by adding barriers (and a mutex lock for the non-OpenMP case) like it was already done for GEMM in level3_thread.c some time ago	2019-11-23 19:54:56 +01:00
Jehan	1f6071590d	Fix usage of TerminateThread() causing critical section corruption. This patch was submitted to the GIMP project by a publisher wishing to keep confidentiality (hence anonymously). I just pass along the patch. Here is the patch explanation which came with: First they remind us what Microsoft documentation says about TerminateThread: > TerminateThread is a dangerous function that should only be used in > the most extreme cases. You should call TerminateThread only if you > know exactly what the target thread is doing, and you control all of > the code that the target thread could possibly be running at the time > of the termination. (https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-terminatethread) Then they say that 5 milliseconds time-out might not be long enough for the thread to exit gracefully. They propose to set it to a much higher value (for instance here 5 seconds). And finally you should always check the return value of WaitForSingleObject(). In particular you want to run TerminateThread() only if WaitForSingleObject() failed, not on success case.	2019-11-20 13:00:49 +01:00
Martin Kroeker	82b75f97e5	Disable the old QCDOC qalloc by default and copy utility functions from memory.c 1. qalloc() appears to have been a special routine written for the PPC440-based QCDOC supercomputer(s) from around 2005, its source does not seem to be readily available. So switch the #if 1 in the code to rely on standard malloc() by default. 2. Utility functions like get_num_procs, get_num_threads that were added to the "normally" used memory.c in the meantime were still missing here.	2019-11-17 19:22:04 +01:00
Martin Kroeker	1b90989662	Add NetBSD to the xBSD conditionals	2019-10-25 12:52:49 +02:00
Martin Kroeker	5f6206fa2d	Simplify OSX/IOS cross-compilation and add a CI test for it (#2279 ) * Add automatic fixups for OSX/IOS cross-compilation * Add OSX/IOS cross-compilation test to Travis CI * Handle platforms that lack hwcap.h by falling back to ARMV8 * Fix PROLOGUE for OSX/IOS	2019-10-08 20:13:14 +02:00
Martin Kroeker	8617d75548	Revert "Avoid taking root of negative number in symv_thread.c"	2019-10-01 23:50:41 +02:00
Sebastian Berg	6355c25dde	Avoid taking root of negative number in symv_thread.c This is similar to fixes in gh-1929, but there was one remaining occurance of this type of pattern in the driver/level2/*_thread.c files.	2019-09-29 22:03:12 -07:00
Martin Kroeker	673e5a0495	Replace several POWER8/9 C kernels with their gcc7-generated assembly versions (#2263 ) * Add gcc7-generated assembly files for POWER8/9 isa/ica-min/max and POWER9 caxpy To work around internal compiler errors encountered when compiling the original C source with gcc 4 and 5, and wrong code generated by gcc 8.3.0 * Use gcc-generated assembly instead of original C sources to work around internal compiler errors encountered with gcc 4.8/5.4 and wrong code generation by gcc 8.3 * Use gcc-generated assembly instead of the original C source to work around internal compiler errors encountered with gcc 4.8 and 5.4, and wrong code generation by gcc 8.3 * Add gcc7-generated assembler version of caxpy for power8 to work around wrong code generated by gcc 8.3 * Handle CONJ define for caxpyc * Handle CONJ define for caxpyc * Add gcc7-generated assembly cdot for POWER9 * Use prebuilt assembly for POWER9 cdot created with gcc 7.3.1 to work around ICE in older gcc versions * Exclude POWER9 from DYNAMIC_ARCH when gcc versions is lower than 6 * Update Makefile.system * Use PROLOGUE macro to ensure correct function name for DYNAMIC_ARCH * Disable POWER9 with old gcc versions	2019-09-22 22:35:22 +02:00
Andrew	4de545aa7d	address minor warnings from gcc7	2019-09-07 10:21:08 +03:00
Martin Kroeker	bf1430f7d7	Merge pull request #2208 from martin-frbg/munmap-debug Provide more information on mmap/munmap failure	2019-08-09 07:55:35 +02:00
Martin Kroeker	1776ad82c0	Add files via upload	2019-08-09 00:08:11 +02:00
Martin Kroeker	4e2f81cfa1	Provide more information on mmap/munmap failure for #2207	2019-08-08 23:15:35 +02:00
Martin Kroeker	3d36c45116	Add CPUID identification of Intel Ice Lake	2019-08-01 22:52:35 +02:00
Martin Kroeker	21d05a4835	Merge pull request #2140 from martin-frbg/pgi19 Do not try ancient PGI hacks with recent versions of that compiler	2019-05-26 12:39:20 +02:00
Martin Kroeker	1778fd4219	Do not try ancient PGI hacks with recent versions of that compiler should fix #2139	2019-05-22 13:48:27 +02:00
Martin Kroeker	86dda5c2fa	Add option USE_LOCKING for SMP-like locking in USE_THREAD=0 builds	2019-05-15 23:21:20 +02:00
Martin Kroeker	5cabda79d0	Merge pull request #2117 from martin-frbg/issue2114 Fix errors in cpu affinity setup with glibc 2.6	2019-05-07 18:18:16 +02:00
Martin Kroeker	a6a8cc2b7f	Fix errors in cpu enumeration with glibc 2.6 for #2114	2019-05-07 13:34:52 +02:00
Martin Kroeker	a387a23518	Merge pull request #2101 from luzpaz/misc-typos Misc. typo fixes in comments and documentation	2019-05-04 22:28:29 +02:00
Martin Kroeker	b43c8382c8	Correct argument of CPU_ISSET for glibc <2.5 fixes #2104	2019-05-01 10:46:46 +02:00
luz.paz	daf2fec12d	Misc. typo fixes Found via `codespell -q 3 -w -L ith,als,dum,nd,amin,nto,wis,ba -S ./relapack,./kernel,./lapack-netlib`	2019-04-29 17:03:56 -04:00
Jeff Baylor	40e53e52d6	snprintf define consolidated to common.h	2019-04-22 17:01:34 -07:00
Rashmica Gupta	bcdf1d4917	Add in runtime CPU detection for POWER.	2019-04-09 14:20:16 +10:00
Erik M. Bray	8ba9e2a61a	Also call CloseHandle on each thread, as well as on the event so as to not leak thread handles.	2019-03-19 11:21:44 +01:00
Erik M. Bray	4ad694eda1	Fix for #2063 : The DllMain used in Cygwin did not run the thread memory pool cleanup upon THREAD_DETACH which is needed when compiled with USE_TLS=1.	2019-03-19 09:26:50 +01:00
Martin Kroeker	3ce28fb81a	Merge pull request #2055 from martin-frbg/atomid Add CPUID data for Intel Denverton (as Nehalem)	2019-03-12 22:57:07 +01:00
Martin Kroeker	04f2226ea6	Add Intel Denverton	2019-03-12 16:09:55 +01:00
Martin Kroeker	4741ce803b	Merge pull request #2045 from martin-frbg/2033-3 Do not compile in AVX512 check if AVX support is disabled	2019-03-06 22:40:26 +01:00
Martin Kroeker	11cfd0bd75	Do not compile in AVX512 check if AVX support is disabled xgetbv is function depends on NO_AVX being undefined - we could change that too, but that combo is unlikely to work anyway	2019-03-05 16:04:25 +01:00
Martin Kroeker	d7b2c53c0b	Merge pull request #2039 from brada4/meminit Address warning in memory.c	2019-03-05 12:11:15 +01:00
Martin Kroeker	10d841d8b9	Merge pull request #2026 from martin-frbg/trmv_threads Correct range limiting in trmv_thread and re-enable TRMV multithreading	2019-03-04 15:08:31 +01:00
Martin Kroeker	6c83b878f6	Merge pull request #2040 from martin-frbg/locks2002 Restore locking optimizations for OpenMP case	2019-03-04 15:07:14 +01:00
Martin Kroeker	af480b02a4	Restore locking optimizations for OpenMP case restore another accidentally dropped part of #1468 that was missed in #2004 to address performance regression reported in #1461	2019-03-03 14:17:07 +01:00
Andrew	e4a79be6bb	address warning introed with #1814 et al	2019-03-03 09:05:11 +02:00
Martin Kroeker	45333d5793	Fix error introduced during cleanup	2019-02-19 22:16:33 +01:00
Martin Kroeker	78d9910236	Correct range_n limiting same bug as seen in #1388, somehow missed in corresponding PR #1389	2019-02-19 20:59:48 +01:00
Martin Kroeker	03a2bf2602	Fix potential memory leak in cpu enumeration on Linux (#2008 ) * Fix potential memory leak in cpu enumeration with glibc An early return after a failed call to sched_getaffinity would leak the previously allocated cpu_set_t. Wrong calculation of the size argument in that call increased the likelyhood of that failure. Fixes #2003	2019-02-10 23:24:45 +01:00
Martin Kroeker	69edc5bbe7	Restore dropped patches in the non-TLS branch of memory.c (#2004 ) * Restore dropped patches in the non-TLS branch of memory.c As discovered in #2002, the reintroduction of the "original" non-TLS version of memory.c as an alternate branch had inadvertently used `ba1f91f` rather than `a8002e2` , thereby dropping the commits for #1450, #1468, #1501, #1504 and #1520.	2019-02-07 20:06:13 +01:00
caiyu	29dc72889f	Add support for Hygon Dhyana	2019-01-16 14:25:19 +08:00
Martin Kroeker	dbc9a060ef	Fix missing braces in support_av() call	2019-01-14 22:41:31 +01:00
Martin Kroeker	21c0f2af7b	Merge pull request #1957 from martin-frbg/issue1954 Move TLS key deletion to openblas_quit	2019-01-10 12:04:08 +01:00
Martin Kroeker	ad2c386d6a	Move TLS key deletion to openblas_quit fixes #1954 (as suggested by thrasibule in that issue)	2019-01-10 00:32:50 +01:00
Martin Kroeker	31ed19e8b9	Add message for SkylakeX and KNL fallbacks to Haswell	2019-01-05 19:41:13 +01:00
Martin Kroeker	e1574fa2b4	Add xcr0 (os support) check	2019-01-05 18:08:02 +01:00
Martin Kroeker	ae1d1f74f7	Query AVX2 and AVX512 capability for runtime cpu selection	2019-01-05 16:55:33 +01:00
Martin Kroeker	8643521127	Merge pull request #1943 from martin-frbg/issue1748 Re-enable loop unrolling in trmv and remove the scary warning	2018-12-30 20:07:01 +01:00
Martin Kroeker	5a720cf9ca	Re-enable loop unrolling in trmv and remove the scary warning fixes #1748 as that half of the fix for #1332 appears to have been an overreaction on my part.	2018-12-30 15:22:37 +01:00
Martin Kroeker	ccd5945d38	Merge pull request #1942 from martin-frbg/issue1720 Delete the pthread key on cleanup in TLS mode	2018-12-30 14:47:05 +01:00
Martin Kroeker	bba1e67269	Delete the pthread key on cleanup in TLS mode to avoid a crash when OpenBLAS was loaded via dlopen and libc tries to clean up the leaked TLS after dlclose Fixes #1720	2018-12-29 21:59:31 +01:00
Martin Kroeker	f343ed65b5	Avoid taking the root of a negative number Fixes #1924 where numpy 1.17+ would report the (transient) FE_INVALID exception raised for the domain error.	2018-12-22 22:30:29 +01:00
Martin Kroeker	0bf6d74e5f	Fix typo in previous commit for arm dynamic arch	2018-12-07 19:37:33 +01:00
Martin Kroeker	2b355592e3	Make sure to use the arm version of dynamic.c in ARM64 DYNAMIC_ARCH cf. #1908	2018-12-07 16:25:55 +01:00
Andrew	2601cd58ab	remove surplus locking code , only enabled w x86, disabled or never enabled on all others	2018-11-30 11:38:19 +01:00
Martin Kroeker	97d7298973	call it OpenBLAS not just version	2018-11-29 11:52:08 +01:00
Martin Kroeker	de0d0ed52f	Improve formatting of config output	2018-11-29 11:28:19 +01:00
Martin Kroeker	816775e309	Add version information to openblas_get_config output	2018-11-29 00:06:44 +01:00
Martin Kroeker	f72fdf525c	Merge pull request #1875 from martin-frbg/issue1851 Serialize accesses to parallelized level3 functions from multiple cal…	2018-11-25 20:53:46 +01:00
Martin Kroeker	113cb00b95	fix missing parenthesis	2018-11-19 21:01:36 +01:00
Martin Kroeker	5192651706	Add CriticalSection handling instead of mutexes for Windows	2018-11-19 17:58:22 +01:00
Martin Kroeker	2e6fae2aad	Serialize accesses to parallelized level3 functions from multiple callers for #1851	2018-11-19 14:02:50 +01:00
Martin Kroeker	368d14f8c8	Fix harmless typo fixes #1872	2018-11-16 14:58:28 +01:00
Martin Kroeker	0427277cef	Allow optimization for small m, large n only if it can be made threadsafe otherwise the introduction of a static array in `8e5a108` to improve #532 breaks concurrent calls from multiple threads as seen in #1844	2018-11-10 15:45:54 +01:00
Arjan van de Ven	5b708e5eb1	sgemm/dgemm: add a way for an arch kernel to specify prefered sizes The current gemm threading code can make very unfortunate choices, for example on my 10 core system a 1024x1024x1024 matrix multiply ends up chunking into blocks of 102... which is not a vector friendly size and performance ends up horrible. this patch adds a helper define where an architecture can specify a preference for size multiples. This is different from existing defines that are minimum sizes and such. The performance increase with this patch for the 1024x1024x1024 sgemm is 2.3x (!!)	2018-11-01 01:43:20 +00:00
Martin Kroeker	f5595d0262	Merge pull request #1843 from martin-frbg/aix_numprocs Add get_num_procs implementation for AIX	2018-10-31 21:25:15 +01:00
Martin Kroeker	326d394a0f	Add get_num_procs implementation for AIX (and copy HAIKU implementation to the non-TLS version of the code as well)	2018-10-31 18:38:22 +01:00
Erik M. Bray	38cf5d9364	ensure that threading has been initialized in the first place before calling openblas_set_num_threads	2018-10-28 21:16:52 +00:00
Ashwin Sekhar T K	d5aeff636f	ARM64: Enable DYNAMIC_ARCH Enable DYNAMIC_ARCH feature on ARM64. This patch uses the cpuid feature in linux kernel to detect the core type at runtime (https://www.kernel.org/doc/Documentation/arm64/cpu-feature-registers.txt). If this feature is missing in kernel, then the user should use the OPENBLAS_CORETYPE env variable to select the desired core type.	2018-10-22 01:49:35 -07:00
Ashwin Sekhar T K	d50abc8903	ARM64: Move parameters from parameter.c to param.h Remove the runtime setting of P, Q, R parameters for targets ARMV8, THUNDERX2T99. Instead set them as constants in param.h at compile time.	2018-10-22 01:45:51 -07:00
Ashwin Sekhar T K	21f46a1cf2	ARM64: Use THUNDERX2T99 Neon Kernels for ARMV8 Currently the generic ARMV8 target uses C implementations for many routines. Replace these with the neon implementations written for THUNDERX2T99 target which are upto 6x faster for certain routines.	2018-10-17 10:44:37 -07:00
Andrew	3439158dea	address #1782 2nd loop	2018-10-03 21:20:50 +02:00
Martin Kroeker	28aa94bf4b	Include thread numbers in failure message from blas_thread_init to aid in debugging cases like #1767	2018-09-22 14:00:15 +02:00
Martin Kroeker	1ad1e79062	Catch inadvertent USE_TLS=0 declaration for #1766	2018-09-19 18:03:43 +02:00
Martin Kroeker	b402626509	Do not use the new TLS code for non-threaded builds even if USE_TLS is set Workaround for #1761 as that exposed a problem in the new code (which was intended to speed up multithreaded code only anyway).	2018-09-16 12:43:36 +02:00
Martin Kroeker	b55690a659	typo fix	2018-08-26 11:31:07 +02:00
Martin Kroeker	b902a40986	Rewrite glibc version check	2018-08-26 11:18:02 +02:00
Martin Kroeker	5991d1a6cd	Update memory.c	2018-08-25 22:12:40 +02:00
Martin Kroeker	b1b743f434	Merge branch 'develop' into interim033	2018-08-25 19:45:19 +02:00
Martin Kroeker	fd42ca462d	Combo of default pre-0.3.1 memory.c and band-aided version of PR1739	2018-08-25 19:35:16 +02:00
Zoltán Mizsei	6463bffd59	Haiku supporting patches	2018-08-02 20:49:14 +02:00
Martin Kroeker	8ef7d4fb54	Merge pull request #1706 from oon3m0oo/develop Fix #1705 where we incorrectly calculate page locations.	2018-08-02 18:53:34 +02:00
Craig Donner	6400868e55	Fix #1705 where we incorrectly calculate page locations. Since we now use an allocation size that isn't a multiple of PAGESIZE, finding the pages for run_bench wasn't terminating properly. Now we detect if we've found enough pages for the allocation and terminate the loop.	2018-08-02 16:21:19 +01:00
Martin Kroeker	66fcdd5be8	Merge pull request #1695 from martin-frbg/issue1692 Unset memory table entry, not just the local pointer to it on shutdown	2018-07-22 16:34:09 +02:00
Martin Kroeker	43ac839c16	Unset memory table entry, not just the temporary pointer to it on shutdown to fix crash with multiple instances of OpenBLAS, #1692	2018-07-22 09:19:19 +02:00
Martin Kroeker	7ba5936ecd	Merge pull request #1688 from martin-frbg/issue1673 Temporarily disable special handling of OPENMP thread memory allocation	2018-07-19 19:03:45 +02:00
Martin Kroeker	b14f44d2ad	Temporarily disable special handling of OPENMP thread memory allocation for issue #1673	2018-07-19 08:57:56 +02:00
Martin Kroeker	36aea5ce2d	Merge pull request #1680 from martin-frbg/snprint Fix wrong redefinitions of snprintf for older MSVC	2018-07-12 14:05:13 +02:00
Martin Kroeker	571e9de2ac	Fix definition of snprintf for MSVC MS _snprintf_s takes an additional argument for the size of the buffer, so is not a direct replacement (utest/ctest.h from which I copied was wrong)	2018-07-12 11:42:25 +02:00
Martin Kroeker	448ed15115	Merge pull request #1678 from martin-frbg/issue1677 Define snprintf for older versions of MSVC	2018-07-12 09:21:34 +02:00
Martin Kroeker	045fb5ea2c	Define snprintf for older versions of MSVC for #1677	2018-07-12 07:30:58 +02:00
Martin Kroeker	4dd70d98d7	Merge pull request #1667 from xianyi/revert-1642-develop Revert "Rewrite &= -> = and simplify the initial blocking phase."	2018-07-04 08:27:21 +02:00
Martin Kroeker	504310eeb9	Merge pull request #1665 from martin-frbg/cpuid-ryzen2 Add cpuid for AMD Ryzen 2	2018-07-04 08:19:40 +02:00
Martin Kroeker	ea1f39518f	Merge pull request #1663 from martin-frbg/issue1641 Double MAX_ALLOCATING_THREADS to fix segfaults with Go and Octave	2018-07-04 08:19:11 +02:00
Martin Kroeker	5f2a3c05cd	Revert "Rewrite &= -> = and simplify the initial blocking phase."	2018-07-03 21:42:28 +02:00
Martin Kroeker	d0ec4325cf	Add cpuid for AMD Ryzen 2	2018-07-03 21:03:24 +02:00
Martin Kroeker	a49203b48c	Double MAX_ALLOCATING_THREADS to fix segfaults with Go and Octave for #1641	2018-07-03 17:35:54 +02:00
Martin Kroeker	9d15a3bd16	Fix typo that broke compilation with DYNAMIC_ARCH and NO_AVX2 fixes 1659	2018-07-02 14:40:41 +02:00
Martin Kroeker	3d3c19717c	Merge pull request #1655 from martin-frbg/issue1641 Fix apparent off-by-one error in calculation of MAX_ALLOCATING_THREADS	2018-07-01 08:41:22 +02:00
Martin Kroeker	4e9c34018e	Fix apparent off-by-one error in calculation of MAX_ALLOCATING_THREADS fixes #1641	2018-06-30 23:57:50 +02:00
Martin Kroeker	750162a05f	Try gradual fallback for cores not in the dynamic core list	2018-06-25 21:02:31 +02:00
Martin Kroeker	e6d93f20f1	Merge pull request #2 from martin-frbg/develop merge develop	2018-06-25 20:48:10 +02:00
Craig Donner	0144068537	Rewrite &= -> = and simplify the initial blocking phase.	2018-06-25 15:08:55 +01:00
Martin Kroeker	1833a67071	Add support for a user-defined list of dynamic targets	2018-06-23 19:42:15 +02:00
Craig Donner	28c28ed275	Fix data races reported by TSAN.	2018-06-21 16:41:02 +01:00
oon3m0oo	a399d00425	Further improvements to memory.c. (#1625 ) - Compiler TLS is now used only used when the compiler supports it - If compiler TLS is unsupported, we use platform-specific TLS - Only one variable (an index) is now in TLS - We only access TLS once per alloc, and never when freeing - Allocation / release info is now stored within the allocation itself, by over-allocating; this saves having external structures do the bookkeeping, and reduces some of the redundant data that was being stored (such as addresses) - We never hit the alloc lock when not using SMP or when using OpenMP (that was my fault) - Now that there are fewer tracking structures I think this is a bit easier to read than before	2018-06-20 22:04:03 +02:00
Martin Kroeker	5a6a2bed9a	Merge pull request #1623 from fenrus75/fast-thread Initialize only the required subset of the jobs array, fix barriers and improve switch ratio on SkylakeX and Haswell. For issue #1622	2018-06-18 09:02:40 +02:00
Martin Kroeker	2d8cc7193a	Support upcoming Intel Cannon Lake CPUs as Skylake X (#1621 ) * Support upcoming Cannon Lake as Skylake X	2018-06-17 23:38:14 +02:00
Arjan van de Ven	73de17664d	Add missing barriers in gemm scheduler a few places in the gemm scheduler code were missing barriers; the code likely worked OK due to heavy use of volatile / _Atomic but there's no reason to get this incorrect	2018-06-17 17:50:43 +00:00
Arjan van de Ven	d148ec4ea1	Don't use _Atomic for jobs sometimes... The use of _Atomic leads to really bad code generation in the compiler (on x86, you get 2 "mfence" memory barriers around each access with gcc8, despite x86 being ordered and cache coherent). But there's a fallback in the code that just uses volatile which is more than plenty in practice. If we're nervous about cross thread synchronization for these variables, we should make the YIELD function be a compiler/memory barrier instead. performance before (after last commit) Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10630.0 10.6 0.7% 18112.8 6.2 -0.7% 64 x 64 20374.8 13.0 1.9% 40487.0 6.5 0.4% 65 x 65 141955.2 1.9 -428.3% 146708.8 1.9 -179.2% 80 x 80 178921.1 2.9 -369.6% 186032.7 2.8 -156.6% 96 x 96 205436.2 4.3 -233.4% 224513.1 3.9 -97.0% 112 x 112 244408.2 5.8 -162.7% 262158.7 5.4 -47.1% 128 x 128 321334.5 6.5 -141.3% 333829.0 6.3 -29.2% Performance with this patch (roughly a 2x improvement): Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7% 64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0% 65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3% 80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6% 96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4% 112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6% 128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3%	2018-06-17 15:39:15 +00:00

... 2 3 4 5 6 ...

681 Commits