OpenBLAS

Author	SHA1	Message	Date
Martin Kroeker	5192651706	Add CriticalSection handling instead of mutexes for Windows	2018-11-19 17:58:22 +01:00
Martin Kroeker	2e6fae2aad	Serialize accesses to parallelized level3 functions from multiple callers for #1851	2018-11-19 14:02:50 +01:00
Martin Kroeker	368d14f8c8	Fix harmless typo fixes #1872	2018-11-16 14:58:28 +01:00
Martin Kroeker	0427277cef	Allow optimization for small m, large n only if it can be made threadsafe otherwise the introduction of a static array in `8e5a108` to improve #532 breaks concurrent calls from multiple threads as seen in #1844	2018-11-10 15:45:54 +01:00
Arjan van de Ven	5b708e5eb1	sgemm/dgemm: add a way for an arch kernel to specify prefered sizes The current gemm threading code can make very unfortunate choices, for example on my 10 core system a 1024x1024x1024 matrix multiply ends up chunking into blocks of 102... which is not a vector friendly size and performance ends up horrible. this patch adds a helper define where an architecture can specify a preference for size multiples. This is different from existing defines that are minimum sizes and such. The performance increase with this patch for the 1024x1024x1024 sgemm is 2.3x (!!)	2018-11-01 01:43:20 +00:00
Martin Kroeker	f5595d0262	Merge pull request #1843 from martin-frbg/aix_numprocs Add get_num_procs implementation for AIX	2018-10-31 21:25:15 +01:00
Martin Kroeker	326d394a0f	Add get_num_procs implementation for AIX (and copy HAIKU implementation to the non-TLS version of the code as well)	2018-10-31 18:38:22 +01:00
Erik M. Bray	38cf5d9364	ensure that threading has been initialized in the first place before calling openblas_set_num_threads	2018-10-28 21:16:52 +00:00
Ashwin Sekhar T K	d5aeff636f	ARM64: Enable DYNAMIC_ARCH Enable DYNAMIC_ARCH feature on ARM64. This patch uses the cpuid feature in linux kernel to detect the core type at runtime (https://www.kernel.org/doc/Documentation/arm64/cpu-feature-registers.txt). If this feature is missing in kernel, then the user should use the OPENBLAS_CORETYPE env variable to select the desired core type.	2018-10-22 01:49:35 -07:00
Ashwin Sekhar T K	d50abc8903	ARM64: Move parameters from parameter.c to param.h Remove the runtime setting of P, Q, R parameters for targets ARMV8, THUNDERX2T99. Instead set them as constants in param.h at compile time.	2018-10-22 01:45:51 -07:00
Ashwin Sekhar T K	21f46a1cf2	ARM64: Use THUNDERX2T99 Neon Kernels for ARMV8 Currently the generic ARMV8 target uses C implementations for many routines. Replace these with the neon implementations written for THUNDERX2T99 target which are upto 6x faster for certain routines.	2018-10-17 10:44:37 -07:00
Andrew	3439158dea	address #1782 2nd loop	2018-10-03 21:20:50 +02:00
Martin Kroeker	28aa94bf4b	Include thread numbers in failure message from blas_thread_init to aid in debugging cases like #1767	2018-09-22 14:00:15 +02:00
Martin Kroeker	1ad1e79062	Catch inadvertent USE_TLS=0 declaration for #1766	2018-09-19 18:03:43 +02:00
Martin Kroeker	b402626509	Do not use the new TLS code for non-threaded builds even if USE_TLS is set Workaround for #1761 as that exposed a problem in the new code (which was intended to speed up multithreaded code only anyway).	2018-09-16 12:43:36 +02:00
Martin Kroeker	b55690a659	typo fix	2018-08-26 11:31:07 +02:00
Martin Kroeker	b902a40986	Rewrite glibc version check	2018-08-26 11:18:02 +02:00
Martin Kroeker	5991d1a6cd	Update memory.c	2018-08-25 22:12:40 +02:00
Martin Kroeker	b1b743f434	Merge branch 'develop' into interim033	2018-08-25 19:45:19 +02:00
Martin Kroeker	fd42ca462d	Combo of default pre-0.3.1 memory.c and band-aided version of PR1739	2018-08-25 19:35:16 +02:00
Zoltán Mizsei	6463bffd59	Haiku supporting patches	2018-08-02 20:49:14 +02:00
Martin Kroeker	8ef7d4fb54	Merge pull request #1706 from oon3m0oo/develop Fix #1705 where we incorrectly calculate page locations.	2018-08-02 18:53:34 +02:00
Craig Donner	6400868e55	Fix #1705 where we incorrectly calculate page locations. Since we now use an allocation size that isn't a multiple of PAGESIZE, finding the pages for run_bench wasn't terminating properly. Now we detect if we've found enough pages for the allocation and terminate the loop.	2018-08-02 16:21:19 +01:00
Martin Kroeker	66fcdd5be8	Merge pull request #1695 from martin-frbg/issue1692 Unset memory table entry, not just the local pointer to it on shutdown	2018-07-22 16:34:09 +02:00
Martin Kroeker	43ac839c16	Unset memory table entry, not just the temporary pointer to it on shutdown to fix crash with multiple instances of OpenBLAS, #1692	2018-07-22 09:19:19 +02:00
Martin Kroeker	7ba5936ecd	Merge pull request #1688 from martin-frbg/issue1673 Temporarily disable special handling of OPENMP thread memory allocation	2018-07-19 19:03:45 +02:00
Martin Kroeker	b14f44d2ad	Temporarily disable special handling of OPENMP thread memory allocation for issue #1673	2018-07-19 08:57:56 +02:00
Martin Kroeker	36aea5ce2d	Merge pull request #1680 from martin-frbg/snprint Fix wrong redefinitions of snprintf for older MSVC	2018-07-12 14:05:13 +02:00
Martin Kroeker	571e9de2ac	Fix definition of snprintf for MSVC MS _snprintf_s takes an additional argument for the size of the buffer, so is not a direct replacement (utest/ctest.h from which I copied was wrong)	2018-07-12 11:42:25 +02:00
Martin Kroeker	448ed15115	Merge pull request #1678 from martin-frbg/issue1677 Define snprintf for older versions of MSVC	2018-07-12 09:21:34 +02:00
Martin Kroeker	045fb5ea2c	Define snprintf for older versions of MSVC for #1677	2018-07-12 07:30:58 +02:00
Martin Kroeker	4dd70d98d7	Merge pull request #1667 from xianyi/revert-1642-develop Revert "Rewrite &= -> = and simplify the initial blocking phase."	2018-07-04 08:27:21 +02:00
Martin Kroeker	504310eeb9	Merge pull request #1665 from martin-frbg/cpuid-ryzen2 Add cpuid for AMD Ryzen 2	2018-07-04 08:19:40 +02:00
Martin Kroeker	ea1f39518f	Merge pull request #1663 from martin-frbg/issue1641 Double MAX_ALLOCATING_THREADS to fix segfaults with Go and Octave	2018-07-04 08:19:11 +02:00
Martin Kroeker	5f2a3c05cd	Revert "Rewrite &= -> = and simplify the initial blocking phase."	2018-07-03 21:42:28 +02:00
Martin Kroeker	d0ec4325cf	Add cpuid for AMD Ryzen 2	2018-07-03 21:03:24 +02:00
Martin Kroeker	a49203b48c	Double MAX_ALLOCATING_THREADS to fix segfaults with Go and Octave for #1641	2018-07-03 17:35:54 +02:00
Martin Kroeker	9d15a3bd16	Fix typo that broke compilation with DYNAMIC_ARCH and NO_AVX2 fixes 1659	2018-07-02 14:40:41 +02:00
Martin Kroeker	3d3c19717c	Merge pull request #1655 from martin-frbg/issue1641 Fix apparent off-by-one error in calculation of MAX_ALLOCATING_THREADS	2018-07-01 08:41:22 +02:00
Martin Kroeker	4e9c34018e	Fix apparent off-by-one error in calculation of MAX_ALLOCATING_THREADS fixes #1641	2018-06-30 23:57:50 +02:00
Martin Kroeker	750162a05f	Try gradual fallback for cores not in the dynamic core list	2018-06-25 21:02:31 +02:00
Martin Kroeker	e6d93f20f1	Merge pull request #2 from martin-frbg/develop merge develop	2018-06-25 20:48:10 +02:00
Craig Donner	0144068537	Rewrite &= -> = and simplify the initial blocking phase.	2018-06-25 15:08:55 +01:00
Martin Kroeker	1833a67071	Add support for a user-defined list of dynamic targets	2018-06-23 19:42:15 +02:00
Craig Donner	28c28ed275	Fix data races reported by TSAN.	2018-06-21 16:41:02 +01:00
oon3m0oo	a399d00425	Further improvements to memory.c. (#1625 ) - Compiler TLS is now used only used when the compiler supports it - If compiler TLS is unsupported, we use platform-specific TLS - Only one variable (an index) is now in TLS - We only access TLS once per alloc, and never when freeing - Allocation / release info is now stored within the allocation itself, by over-allocating; this saves having external structures do the bookkeeping, and reduces some of the redundant data that was being stored (such as addresses) - We never hit the alloc lock when not using SMP or when using OpenMP (that was my fault) - Now that there are fewer tracking structures I think this is a bit easier to read than before	2018-06-20 22:04:03 +02:00
Martin Kroeker	5a6a2bed9a	Merge pull request #1623 from fenrus75/fast-thread Initialize only the required subset of the jobs array, fix barriers and improve switch ratio on SkylakeX and Haswell. For issue #1622	2018-06-18 09:02:40 +02:00
Martin Kroeker	2d8cc7193a	Support upcoming Intel Cannon Lake CPUs as Skylake X (#1621 ) * Support upcoming Cannon Lake as Skylake X	2018-06-17 23:38:14 +02:00
Arjan van de Ven	73de17664d	Add missing barriers in gemm scheduler a few places in the gemm scheduler code were missing barriers; the code likely worked OK due to heavy use of volatile / _Atomic but there's no reason to get this incorrect	2018-06-17 17:50:43 +00:00
Arjan van de Ven	d148ec4ea1	Don't use _Atomic for jobs sometimes... The use of _Atomic leads to really bad code generation in the compiler (on x86, you get 2 "mfence" memory barriers around each access with gcc8, despite x86 being ordered and cache coherent). But there's a fallback in the code that just uses volatile which is more than plenty in practice. If we're nervous about cross thread synchronization for these variables, we should make the YIELD function be a compiler/memory barrier instead. performance before (after last commit) Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10630.0 10.6 0.7% 18112.8 6.2 -0.7% 64 x 64 20374.8 13.0 1.9% 40487.0 6.5 0.4% 65 x 65 141955.2 1.9 -428.3% 146708.8 1.9 -179.2% 80 x 80 178921.1 2.9 -369.6% 186032.7 2.8 -156.6% 96 x 96 205436.2 4.3 -233.4% 224513.1 3.9 -97.0% 112 x 112 244408.2 5.8 -162.7% 262158.7 5.4 -47.1% 128 x 128 321334.5 6.5 -141.3% 333829.0 6.3 -29.2% Performance with this patch (roughly a 2x improvement): Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7% 64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0% 65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3% 80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6% 96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4% 112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6% 128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3%	2018-06-17 15:39:15 +00:00

... 4 5 6 7 8 ...

631 Commits