OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Martin Kroeker	5cabda79d0	Merge pull request #2117 from martin-frbg/issue2114 Fix errors in cpu affinity setup with glibc 2.6	2019-05-07 18:18:16 +02:00
Martin Kroeker	a6a8cc2b7f	Fix errors in cpu enumeration with glibc 2.6 for #2114	2019-05-07 13:34:52 +02:00
Martin Kroeker	a387a23518	Merge pull request #2101 from luzpaz/misc-typos Misc. typo fixes in comments and documentation	2019-05-04 22:28:29 +02:00
Martin Kroeker	b43c8382c8	Correct argument of CPU_ISSET for glibc <2.5 fixes #2104	2019-05-01 10:46:46 +02:00
luz.paz	daf2fec12d	Misc. typo fixes Found via `codespell -q 3 -w -L ith,als,dum,nd,amin,nto,wis,ba -S ./relapack,./kernel,./lapack-netlib`	2019-04-29 17:03:56 -04:00
Jeff Baylor	40e53e52d6	snprintf define consolidated to common.h	2019-04-22 17:01:34 -07:00
Rashmica Gupta	bcdf1d4917	Add in runtime CPU detection for POWER.	2019-04-09 14:20:16 +10:00
Erik M. Bray	8ba9e2a61a	Also call CloseHandle on each thread, as well as on the event so as to not leak thread handles.	2019-03-19 11:21:44 +01:00
Erik M. Bray	4ad694eda1	Fix for #2063 : The DllMain used in Cygwin did not run the thread memory pool cleanup upon THREAD_DETACH which is needed when compiled with USE_TLS=1.	2019-03-19 09:26:50 +01:00
Martin Kroeker	3ce28fb81a	Merge pull request #2055 from martin-frbg/atomid Add CPUID data for Intel Denverton (as Nehalem)	2019-03-12 22:57:07 +01:00
Martin Kroeker	04f2226ea6	Add Intel Denverton	2019-03-12 16:09:55 +01:00
Martin Kroeker	4741ce803b	Merge pull request #2045 from martin-frbg/2033-3 Do not compile in AVX512 check if AVX support is disabled	2019-03-06 22:40:26 +01:00
Martin Kroeker	11cfd0bd75	Do not compile in AVX512 check if AVX support is disabled xgetbv is function depends on NO_AVX being undefined - we could change that too, but that combo is unlikely to work anyway	2019-03-05 16:04:25 +01:00
Martin Kroeker	d7b2c53c0b	Merge pull request #2039 from brada4/meminit Address warning in memory.c	2019-03-05 12:11:15 +01:00
Martin Kroeker	10d841d8b9	Merge pull request #2026 from martin-frbg/trmv_threads Correct range limiting in trmv_thread and re-enable TRMV multithreading	2019-03-04 15:08:31 +01:00
Martin Kroeker	6c83b878f6	Merge pull request #2040 from martin-frbg/locks2002 Restore locking optimizations for OpenMP case	2019-03-04 15:07:14 +01:00
Martin Kroeker	af480b02a4	Restore locking optimizations for OpenMP case restore another accidentally dropped part of #1468 that was missed in #2004 to address performance regression reported in #1461	2019-03-03 14:17:07 +01:00
Andrew	e4a79be6bb	address warning introed with #1814 et al	2019-03-03 09:05:11 +02:00
Martin Kroeker	45333d5793	Fix error introduced during cleanup	2019-02-19 22:16:33 +01:00
Martin Kroeker	78d9910236	Correct range_n limiting same bug as seen in #1388, somehow missed in corresponding PR #1389	2019-02-19 20:59:48 +01:00
Martin Kroeker	03a2bf2602	Fix potential memory leak in cpu enumeration on Linux (#2008 ) * Fix potential memory leak in cpu enumeration with glibc An early return after a failed call to sched_getaffinity would leak the previously allocated cpu_set_t. Wrong calculation of the size argument in that call increased the likelyhood of that failure. Fixes #2003	2019-02-10 23:24:45 +01:00
Martin Kroeker	69edc5bbe7	Restore dropped patches in the non-TLS branch of memory.c (#2004 ) * Restore dropped patches in the non-TLS branch of memory.c As discovered in #2002, the reintroduction of the "original" non-TLS version of memory.c as an alternate branch had inadvertently used `ba1f91f` rather than `a8002e2` , thereby dropping the commits for #1450, #1468, #1501, #1504 and #1520.	2019-02-07 20:06:13 +01:00
caiyu	29dc72889f	Add support for Hygon Dhyana	2019-01-16 14:25:19 +08:00
Martin Kroeker	dbc9a060ef	Fix missing braces in support_av() call	2019-01-14 22:41:31 +01:00
Martin Kroeker	21c0f2af7b	Merge pull request #1957 from martin-frbg/issue1954 Move TLS key deletion to openblas_quit	2019-01-10 12:04:08 +01:00
Martin Kroeker	ad2c386d6a	Move TLS key deletion to openblas_quit fixes #1954 (as suggested by thrasibule in that issue)	2019-01-10 00:32:50 +01:00
Martin Kroeker	31ed19e8b9	Add message for SkylakeX and KNL fallbacks to Haswell	2019-01-05 19:41:13 +01:00
Martin Kroeker	e1574fa2b4	Add xcr0 (os support) check	2019-01-05 18:08:02 +01:00
Martin Kroeker	ae1d1f74f7	Query AVX2 and AVX512 capability for runtime cpu selection	2019-01-05 16:55:33 +01:00
Martin Kroeker	8643521127	Merge pull request #1943 from martin-frbg/issue1748 Re-enable loop unrolling in trmv and remove the scary warning	2018-12-30 20:07:01 +01:00
Martin Kroeker	5a720cf9ca	Re-enable loop unrolling in trmv and remove the scary warning fixes #1748 as that half of the fix for #1332 appears to have been an overreaction on my part.	2018-12-30 15:22:37 +01:00
Martin Kroeker	ccd5945d38	Merge pull request #1942 from martin-frbg/issue1720 Delete the pthread key on cleanup in TLS mode	2018-12-30 14:47:05 +01:00
Martin Kroeker	bba1e67269	Delete the pthread key on cleanup in TLS mode to avoid a crash when OpenBLAS was loaded via dlopen and libc tries to clean up the leaked TLS after dlclose Fixes #1720	2018-12-29 21:59:31 +01:00
Martin Kroeker	f343ed65b5	Avoid taking the root of a negative number Fixes #1924 where numpy 1.17+ would report the (transient) FE_INVALID exception raised for the domain error.	2018-12-22 22:30:29 +01:00
Martin Kroeker	0bf6d74e5f	Fix typo in previous commit for arm dynamic arch	2018-12-07 19:37:33 +01:00
Martin Kroeker	2b355592e3	Make sure to use the arm version of dynamic.c in ARM64 DYNAMIC_ARCH cf. #1908	2018-12-07 16:25:55 +01:00
Andrew	2601cd58ab	remove surplus locking code , only enabled w x86, disabled or never enabled on all others	2018-11-30 11:38:19 +01:00
Martin Kroeker	97d7298973	call it OpenBLAS not just version	2018-11-29 11:52:08 +01:00
Martin Kroeker	de0d0ed52f	Improve formatting of config output	2018-11-29 11:28:19 +01:00
Martin Kroeker	816775e309	Add version information to openblas_get_config output	2018-11-29 00:06:44 +01:00
Martin Kroeker	f72fdf525c	Merge pull request #1875 from martin-frbg/issue1851 Serialize accesses to parallelized level3 functions from multiple cal…	2018-11-25 20:53:46 +01:00
Martin Kroeker	113cb00b95	fix missing parenthesis	2018-11-19 21:01:36 +01:00
Martin Kroeker	5192651706	Add CriticalSection handling instead of mutexes for Windows	2018-11-19 17:58:22 +01:00
Martin Kroeker	2e6fae2aad	Serialize accesses to parallelized level3 functions from multiple callers for #1851	2018-11-19 14:02:50 +01:00
Martin Kroeker	368d14f8c8	Fix harmless typo fixes #1872	2018-11-16 14:58:28 +01:00
Martin Kroeker	0427277cef	Allow optimization for small m, large n only if it can be made threadsafe otherwise the introduction of a static array in `8e5a108` to improve #532 breaks concurrent calls from multiple threads as seen in #1844	2018-11-10 15:45:54 +01:00
Arjan van de Ven	5b708e5eb1	sgemm/dgemm: add a way for an arch kernel to specify prefered sizes The current gemm threading code can make very unfortunate choices, for example on my 10 core system a 1024x1024x1024 matrix multiply ends up chunking into blocks of 102... which is not a vector friendly size and performance ends up horrible. this patch adds a helper define where an architecture can specify a preference for size multiples. This is different from existing defines that are minimum sizes and such. The performance increase with this patch for the 1024x1024x1024 sgemm is 2.3x (!!)	2018-11-01 01:43:20 +00:00
Martin Kroeker	f5595d0262	Merge pull request #1843 from martin-frbg/aix_numprocs Add get_num_procs implementation for AIX	2018-10-31 21:25:15 +01:00
Martin Kroeker	326d394a0f	Add get_num_procs implementation for AIX (and copy HAIKU implementation to the non-TLS version of the code as well)	2018-10-31 18:38:22 +01:00
Erik M. Bray	38cf5d9364	ensure that threading has been initialized in the first place before calling openblas_set_num_threads	2018-10-28 21:16:52 +00:00
Ashwin Sekhar T K	d5aeff636f	ARM64: Enable DYNAMIC_ARCH Enable DYNAMIC_ARCH feature on ARM64. This patch uses the cpuid feature in linux kernel to detect the core type at runtime (https://www.kernel.org/doc/Documentation/arm64/cpu-feature-registers.txt). If this feature is missing in kernel, then the user should use the OPENBLAS_CORETYPE env variable to select the desired core type.	2018-10-22 01:49:35 -07:00
Ashwin Sekhar T K	d50abc8903	ARM64: Move parameters from parameter.c to param.h Remove the runtime setting of P, Q, R parameters for targets ARMV8, THUNDERX2T99. Instead set them as constants in param.h at compile time.	2018-10-22 01:45:51 -07:00
Ashwin Sekhar T K	21f46a1cf2	ARM64: Use THUNDERX2T99 Neon Kernels for ARMV8 Currently the generic ARMV8 target uses C implementations for many routines. Replace these with the neon implementations written for THUNDERX2T99 target which are upto 6x faster for certain routines.	2018-10-17 10:44:37 -07:00
Andrew	3439158dea	address #1782 2nd loop	2018-10-03 21:20:50 +02:00
Martin Kroeker	28aa94bf4b	Include thread numbers in failure message from blas_thread_init to aid in debugging cases like #1767	2018-09-22 14:00:15 +02:00
Martin Kroeker	1ad1e79062	Catch inadvertent USE_TLS=0 declaration for #1766	2018-09-19 18:03:43 +02:00
Martin Kroeker	b402626509	Do not use the new TLS code for non-threaded builds even if USE_TLS is set Workaround for #1761 as that exposed a problem in the new code (which was intended to speed up multithreaded code only anyway).	2018-09-16 12:43:36 +02:00
Martin Kroeker	b55690a659	typo fix	2018-08-26 11:31:07 +02:00
Martin Kroeker	b902a40986	Rewrite glibc version check	2018-08-26 11:18:02 +02:00
Martin Kroeker	5991d1a6cd	Update memory.c	2018-08-25 22:12:40 +02:00
Martin Kroeker	b1b743f434	Merge branch 'develop' into interim033	2018-08-25 19:45:19 +02:00
Martin Kroeker	fd42ca462d	Combo of default pre-0.3.1 memory.c and band-aided version of PR1739	2018-08-25 19:35:16 +02:00
Zoltán Mizsei	6463bffd59	Haiku supporting patches	2018-08-02 20:49:14 +02:00
Martin Kroeker	8ef7d4fb54	Merge pull request #1706 from oon3m0oo/develop Fix #1705 where we incorrectly calculate page locations.	2018-08-02 18:53:34 +02:00
Craig Donner	6400868e55	Fix #1705 where we incorrectly calculate page locations. Since we now use an allocation size that isn't a multiple of PAGESIZE, finding the pages for run_bench wasn't terminating properly. Now we detect if we've found enough pages for the allocation and terminate the loop.	2018-08-02 16:21:19 +01:00
Martin Kroeker	66fcdd5be8	Merge pull request #1695 from martin-frbg/issue1692 Unset memory table entry, not just the local pointer to it on shutdown	2018-07-22 16:34:09 +02:00
Martin Kroeker	43ac839c16	Unset memory table entry, not just the temporary pointer to it on shutdown to fix crash with multiple instances of OpenBLAS, #1692	2018-07-22 09:19:19 +02:00
Martin Kroeker	7ba5936ecd	Merge pull request #1688 from martin-frbg/issue1673 Temporarily disable special handling of OPENMP thread memory allocation	2018-07-19 19:03:45 +02:00
Martin Kroeker	b14f44d2ad	Temporarily disable special handling of OPENMP thread memory allocation for issue #1673	2018-07-19 08:57:56 +02:00
Martin Kroeker	36aea5ce2d	Merge pull request #1680 from martin-frbg/snprint Fix wrong redefinitions of snprintf for older MSVC	2018-07-12 14:05:13 +02:00
Martin Kroeker	571e9de2ac	Fix definition of snprintf for MSVC MS _snprintf_s takes an additional argument for the size of the buffer, so is not a direct replacement (utest/ctest.h from which I copied was wrong)	2018-07-12 11:42:25 +02:00
Martin Kroeker	448ed15115	Merge pull request #1678 from martin-frbg/issue1677 Define snprintf for older versions of MSVC	2018-07-12 09:21:34 +02:00
Martin Kroeker	045fb5ea2c	Define snprintf for older versions of MSVC for #1677	2018-07-12 07:30:58 +02:00
Martin Kroeker	4dd70d98d7	Merge pull request #1667 from xianyi/revert-1642-develop Revert "Rewrite &= -> = and simplify the initial blocking phase."	2018-07-04 08:27:21 +02:00
Martin Kroeker	504310eeb9	Merge pull request #1665 from martin-frbg/cpuid-ryzen2 Add cpuid for AMD Ryzen 2	2018-07-04 08:19:40 +02:00
Martin Kroeker	ea1f39518f	Merge pull request #1663 from martin-frbg/issue1641 Double MAX_ALLOCATING_THREADS to fix segfaults with Go and Octave	2018-07-04 08:19:11 +02:00
Martin Kroeker	5f2a3c05cd	Revert "Rewrite &= -> = and simplify the initial blocking phase."	2018-07-03 21:42:28 +02:00
Martin Kroeker	d0ec4325cf	Add cpuid for AMD Ryzen 2	2018-07-03 21:03:24 +02:00
Martin Kroeker	a49203b48c	Double MAX_ALLOCATING_THREADS to fix segfaults with Go and Octave for #1641	2018-07-03 17:35:54 +02:00
Martin Kroeker	9d15a3bd16	Fix typo that broke compilation with DYNAMIC_ARCH and NO_AVX2 fixes 1659	2018-07-02 14:40:41 +02:00
Martin Kroeker	3d3c19717c	Merge pull request #1655 from martin-frbg/issue1641 Fix apparent off-by-one error in calculation of MAX_ALLOCATING_THREADS	2018-07-01 08:41:22 +02:00
Martin Kroeker	4e9c34018e	Fix apparent off-by-one error in calculation of MAX_ALLOCATING_THREADS fixes #1641	2018-06-30 23:57:50 +02:00
Martin Kroeker	750162a05f	Try gradual fallback for cores not in the dynamic core list	2018-06-25 21:02:31 +02:00
Martin Kroeker	e6d93f20f1	Merge pull request #2 from martin-frbg/develop merge develop	2018-06-25 20:48:10 +02:00
Craig Donner	0144068537	Rewrite &= -> = and simplify the initial blocking phase.	2018-06-25 15:08:55 +01:00
Martin Kroeker	1833a67071	Add support for a user-defined list of dynamic targets	2018-06-23 19:42:15 +02:00
Craig Donner	28c28ed275	Fix data races reported by TSAN.	2018-06-21 16:41:02 +01:00
oon3m0oo	a399d00425	Further improvements to memory.c. (#1625 ) - Compiler TLS is now used only used when the compiler supports it - If compiler TLS is unsupported, we use platform-specific TLS - Only one variable (an index) is now in TLS - We only access TLS once per alloc, and never when freeing - Allocation / release info is now stored within the allocation itself, by over-allocating; this saves having external structures do the bookkeeping, and reduces some of the redundant data that was being stored (such as addresses) - We never hit the alloc lock when not using SMP or when using OpenMP (that was my fault) - Now that there are fewer tracking structures I think this is a bit easier to read than before	2018-06-20 22:04:03 +02:00
Martin Kroeker	5a6a2bed9a	Merge pull request #1623 from fenrus75/fast-thread Initialize only the required subset of the jobs array, fix barriers and improve switch ratio on SkylakeX and Haswell. For issue #1622	2018-06-18 09:02:40 +02:00
Martin Kroeker	2d8cc7193a	Support upcoming Intel Cannon Lake CPUs as Skylake X (#1621 ) * Support upcoming Cannon Lake as Skylake X	2018-06-17 23:38:14 +02:00
Arjan van de Ven	73de17664d	Add missing barriers in gemm scheduler a few places in the gemm scheduler code were missing barriers; the code likely worked OK due to heavy use of volatile / _Atomic but there's no reason to get this incorrect	2018-06-17 17:50:43 +00:00
Arjan van de Ven	d148ec4ea1	Don't use _Atomic for jobs sometimes... The use of _Atomic leads to really bad code generation in the compiler (on x86, you get 2 "mfence" memory barriers around each access with gcc8, despite x86 being ordered and cache coherent). But there's a fallback in the code that just uses volatile which is more than plenty in practice. If we're nervous about cross thread synchronization for these variables, we should make the YIELD function be a compiler/memory barrier instead. performance before (after last commit) Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10630.0 10.6 0.7% 18112.8 6.2 -0.7% 64 x 64 20374.8 13.0 1.9% 40487.0 6.5 0.4% 65 x 65 141955.2 1.9 -428.3% 146708.8 1.9 -179.2% 80 x 80 178921.1 2.9 -369.6% 186032.7 2.8 -156.6% 96 x 96 205436.2 4.3 -233.4% 224513.1 3.9 -97.0% 112 x 112 244408.2 5.8 -162.7% 262158.7 5.4 -47.1% 128 x 128 321334.5 6.5 -141.3% 333829.0 6.3 -29.2% Performance with this patch (roughly a 2x improvement): Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7% 64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0% 65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3% 80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6% 96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4% 112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6% 128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3%	2018-06-17 15:39:15 +00:00
Arjan van de Ven	9e162146a9	Only initialize the part of the jobs array that will get used The jobs array is getting initialized in O(compiled cpus^2) complexity. Distros and people with bigger systems will use pretty high values (128 or 256 or more) for this value, leading to interesting bubbles in performance. Baseline (single threaded performance) gets roughly 13 - 15 multiplications per cycle in the interesting range (threading kicks in at 65x65 mult by 65x65). The hardware is capable of 32 multiplications per cycle theoretically. Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10703.9 10.6 0.0% 17990.6 6.3 0.0% 64 x 64 20778.4 12.8 0.0% 40629.2 6.5 0.0% 65 x 65 26869.9 10.3 0.0% 52545.7 5.3 0.0% 80 x 80 38104.5 13.5 0.0% 72492.7 7.1 0.0% 96 x 96 61626.4 14.4 0.0% 113983.8 7.8 0.0% 112 x 112 91803.8 15.3 0.0% 180987.3 7.8 0.0% 128 x 128 133161.4 15.8 0.0% 258374.3 8.1 0.0% When threading is turned on TARGET=SKYLAKEX F_COMPILER=GFORTRAN SHARED=1 DYNAMIC_THREADS=1 USE_OPENMP=0 NUM_THREADS=128 Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10725.9 10.5 -0.2% 18134.9 6.2 -0.8% 64 x 64 20500.6 12.9 1.3% 40929.1 6.5 -0.7% 65 x 65 2040832.1 0.1 -7495.2% 2097633.6 0.1 -3892.0% 80 x 80 2063129.1 0.2 -5314.4% 2119925.2 0.2 -2824.3% 96 x 96 2070374.5 0.4 -3259.6% 2173604.4 0.4 -1806.9% 112 x 112 2111721.5 0.7 -2169.6% 2263330.8 0.6 -1170.0% 128 x 128 2276181.5 0.9 -1609.3% 2377228.9 0.9 -820.1% There is a deep deep cliff once you hit 65x65 With this patch Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10630.0 10.6 0.7% 18112.8 6.2 -0.7% 64 x 64 20374.8 13.0 1.9% 40487.0 6.5 0.4% 65 x 65 141955.2 1.9 -428.3% 146708.8 1.9 -179.2% 80 x 80 178921.1 2.9 -369.6% 186032.7 2.8 -156.6% 96 x 96 205436.2 4.3 -233.4% 224513.1 3.9 -97.0% 112 x 112 244408.2 5.8 -162.7% 262158.7 5.4 -47.1% 128 x 128 321334.5 6.5 -141.3% 333829.0 6.3 -29.2% The cliff is very significantly reduced. (more to follow)	2018-06-17 15:32:03 +00:00
Martin Kroeker	47bf0dba8f	Add build-time option for OMP scheduler; document MULTITHREAD_THRESHOLD range (#1620 ) * Allow choosing the OpenMP scheduler and add range hint for GEMM_MULTITHREAD_THRESHOLD * Amended description of GEMM_MULTITHREAD_THRESHOLD to reflect #742 making it track floating point operations rather than matrix size	2018-06-15 11:25:05 +02:00
Craig Donner	bf40f806ef	Remove the need for most locking in memory.c. Using thread local storage for tracking memory allocations means that threads no longer have to lock at all when doing memory allocations / frees. This particularly helps the gemm driver since it does an allocation per invocation. Even without threading at all, this helps, since even calling a lock with no contention has a cost: Before this change, no threading: ``` ---------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------- BM_SGEMM/4 102 ns 102 ns 13504412 BM_SGEMM/6 175 ns 175 ns 7997580 BM_SGEMM/8 205 ns 205 ns 6842073 BM_SGEMM/10 266 ns 266 ns 5294919 BM_SGEMM/16 478 ns 478 ns 2963441 BM_SGEMM/20 690 ns 690 ns 2144755 BM_SGEMM/32 1906 ns 1906 ns 716981 BM_SGEMM/40 2983 ns 2983 ns 473218 BM_SGEMM/64 9421 ns 9422 ns 148450 BM_SGEMM/72 12630 ns 12631 ns 112105 BM_SGEMM/80 15845 ns 15846 ns 89118 BM_SGEMM/90 25675 ns 25676 ns 54332 BM_SGEMM/100 29864 ns 29865 ns 47120 BM_SGEMM/112 37841 ns 37842 ns 36717 BM_SGEMM/128 56531 ns 56532 ns 25361 BM_SGEMM/140 75886 ns 75888 ns 18143 BM_SGEMM/150 98493 ns 98496 ns 14299 BM_SGEMM/160 102620 ns 102622 ns 13381 BM_SGEMM/170 135169 ns 135173 ns 10231 BM_SGEMM/180 146170 ns 146172 ns 9535 BM_SGEMM/189 190226 ns 190231 ns 7397 BM_SGEMM/200 194513 ns 194519 ns 7210 BM_SGEMM/256 396561 ns 396573 ns 3531 ``` with this change: ``` ---------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------- BM_SGEMM/4 95 ns 95 ns 14500387 BM_SGEMM/6 166 ns 166 ns 8381763 BM_SGEMM/8 196 ns 196 ns 7277044 BM_SGEMM/10 256 ns 256 ns 5515721 BM_SGEMM/16 463 ns 463 ns 3025197 BM_SGEMM/20 636 ns 636 ns 2070213 BM_SGEMM/32 1885 ns 1885 ns 739444 BM_SGEMM/40 2969 ns 2969 ns 472152 BM_SGEMM/64 9371 ns 9372 ns 148932 BM_SGEMM/72 12431 ns 12431 ns 112919 BM_SGEMM/80 15615 ns 15616 ns 89978 BM_SGEMM/90 25397 ns 25398 ns 55041 BM_SGEMM/100 29445 ns 29446 ns 47540 BM_SGEMM/112 37530 ns 37531 ns 37286 BM_SGEMM/128 55373 ns 55375 ns 25277 BM_SGEMM/140 76241 ns 76241 ns 18259 BM_SGEMM/150 102196 ns 102200 ns 13736 BM_SGEMM/160 101521 ns 101525 ns 13556 BM_SGEMM/170 136182 ns 136184 ns 10567 BM_SGEMM/180 146861 ns 146864 ns 9035 BM_SGEMM/189 192632 ns 192632 ns 7231 BM_SGEMM/200 198547 ns 198555 ns 6995 BM_SGEMM/256 392316 ns 392330 ns 3539 ``` Before, when built with USE_THREAD=1, GEMM_MULTITHREAD_THRESHOLD = 4, the cost of small matrix operations was overshadowed by thread locking (look smaller than 32) even when not explicitly spawning threads: ``` ---------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------- BM_SGEMM/4 328 ns 328 ns 4170562 BM_SGEMM/6 396 ns 396 ns 3536400 BM_SGEMM/8 418 ns 418 ns 3330102 BM_SGEMM/10 491 ns 491 ns 2863047 BM_SGEMM/16 710 ns 710 ns 2028314 BM_SGEMM/20 871 ns 871 ns 1581546 BM_SGEMM/32 2132 ns 2132 ns 657089 BM_SGEMM/40 3197 ns 3196 ns 437969 BM_SGEMM/64 9645 ns 9645 ns 144987 BM_SGEMM/72 35064 ns 32881 ns 50264 BM_SGEMM/80 37661 ns 35787 ns 42080 BM_SGEMM/90 36507 ns 36077 ns 40091 BM_SGEMM/100 32513 ns 31850 ns 48607 BM_SGEMM/112 41742 ns 41207 ns 37273 BM_SGEMM/128 67211 ns 65095 ns 21933 BM_SGEMM/140 68263 ns 67943 ns 19245 BM_SGEMM/150 121854 ns 115439 ns 10660 BM_SGEMM/160 116826 ns 115539 ns 10000 BM_SGEMM/170 126566 ns 122798 ns 11960 BM_SGEMM/180 130088 ns 127292 ns 11503 BM_SGEMM/189 120309 ns 116634 ns 13162 BM_SGEMM/200 114559 ns 110993 ns 10000 BM_SGEMM/256 217063 ns 207806 ns 6417 ``` and after, it's gone (note this includes my other change which reduces calls to num_cpu_avail): ``` ---------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------- BM_SGEMM/4 95 ns 95 ns 12347650 BM_SGEMM/6 166 ns 166 ns 8259683 BM_SGEMM/8 193 ns 193 ns 7162210 BM_SGEMM/10 258 ns 258 ns 5415657 BM_SGEMM/16 471 ns 471 ns 2981009 BM_SGEMM/20 666 ns 666 ns 2148002 BM_SGEMM/32 1903 ns 1903 ns 738245 BM_SGEMM/40 2969 ns 2969 ns 473239 BM_SGEMM/64 9440 ns 9440 ns 148442 BM_SGEMM/72 37239 ns 33330 ns 46813 BM_SGEMM/80 57350 ns 55949 ns 32251 BM_SGEMM/90 36275 ns 36249 ns 42259 BM_SGEMM/100 31111 ns 31008 ns 45270 BM_SGEMM/112 43782 ns 40912 ns 34749 BM_SGEMM/128 67375 ns 64406 ns 22443 BM_SGEMM/140 76389 ns 67003 ns 21430 BM_SGEMM/150 72952 ns 71830 ns 19793 BM_SGEMM/160 97039 ns 96858 ns 11498 BM_SGEMM/170 123272 ns 122007 ns 11855 BM_SGEMM/180 126828 ns 126505 ns 11567 BM_SGEMM/189 115179 ns 114665 ns 11044 BM_SGEMM/200 89289 ns 87259 ns 16147 BM_SGEMM/256 226252 ns 222677 ns 7375 ``` I've also tested this with ThreadSanitizer and found no data races during execution. I'm not sure why 200 is always faster than it's neighbors, we must be hitting some optimal cache size or something.	2018-06-14 16:54:58 +01:00
Martin Kroeker	63f7395fb4	Move some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option	2018-06-09 16:31:38 +02:00
Martin Kroeker	38ad05bd04	Extend loop range to find SkylakeX in force_coretype	2018-06-05 10:26:49 +02:00
Martin Kroeker	8be027e4c6	Update dynamic.c	2018-06-04 14:36:39 +02:00
Martin Kroeker	ac7b6e3e9a	Fix misplaced endif	2018-06-04 08:23:40 +02:00
Martin Kroeker	ef626c6824	typo fix	2018-06-04 00:13:19 +02:00

1 2 3 4 5 ...

473 Commits