Commit Graph

640 Commits

Author SHA1 Message Date
Rafael Cardoso Fernandes Sousa 0e8b4adf22 Remove unused commented code (#if directive) 2021-09-15 22:18:48 +00:00
Martin Kroeker fa8bf57768
Merge pull request #3380 from martin-frbg/structwarn
Remove extraneous qualifiers from struct definition
2021-09-15 07:19:09 +02:00
Martin Kroeker dd09f0173e
Remove extraneous qualifiers from struct definition 2021-09-14 21:52:26 +02:00
Martin Kroeker 2f8220d757
Add sbgemm 2021-09-14 16:14:43 +02:00
Martin Kroeker 5f6a609253
Add sbgemv 2021-09-14 16:13:57 +02:00
Wangyang Guo 045ed5c91d sbgemm: fix build error in BFLOAT16 disabled 2021-09-07 23:37:08 +08:00
Wangyang Guo 8356a604f0 sbgemm: cooperlake: tuning for block params 2021-09-07 21:30:46 +08:00
Martin Kroeker cd10d1c03b
Fix typo 2021-08-30 14:38:28 +02:00
Martin Kroeker 2db1a99aca
Clean up debug messages 2021-08-30 14:21:25 +02:00
Martin Kroeker 89fc5b8f4f
Fix unmap logic 2021-08-29 19:50:24 +02:00
Martin Kroeker 7fd12a5e69
Add likely() hints for gcc 2021-08-29 13:54:51 +02:00
Martin Kroeker 2ba9a567aa
Fix typo 2021-08-28 17:14:59 +02:00
Martin Kroeker b4b952eece
Add auxiliary tracking space for thread buffer frees too 2021-08-28 17:03:53 +02:00
Martin Kroeker 7d1becc575
Allocate an auxiliary struct when running out of preconfigured threads 2021-08-28 14:18:36 +02:00
Martin Kroeker 898212efcd
Actually add the message to the TLS section 2021-08-02 14:50:14 +02:00
Martin Kroeker 210a1584c5
Rebase source and edit TLS version of the message as well 2021-08-02 14:19:16 +02:00
Martin Kroeker f2a7a67f5a
Improve the "tried to allocate too many buffers" error message 2021-07-31 17:23:40 +02:00
Craig Watson 4d7dfe4845 Include Haiku in processor count checks 2021-07-27 09:00:30 +00:00
JonasZhou 0fca36c8c3 Add cpu detection support for Zhaoxin processors
Signed-off-by: JonasZhou <JonasZhou@zhaoxin.com>
2021-07-12 13:43:45 +08:00
River Dillon 2f6326a630 Remove <linux/unistd.h> 2021-07-10 00:36:07 -07:00
Martin Kroeker 8f22ac552b
Add vendor string Shanghai as successor to Centaur 2021-07-08 18:28:49 +02:00
Martin Kroeker eb2fdd3af0
Recognize newer Zhaoxin/Centaur processors as Nehalem 2021-07-08 12:23:15 +02:00
User User-User 750719528a bugz 2021-06-20 16:40:43 +02:00
User User-User 6423b282a1 dynamic_arch 2021-06-20 14:19:41 +02:00
Martin Kroeker 307c4c0786
Fix typo 2021-06-16 13:41:16 +02:00
Martin Kroeker e83df93975
Work around another recent macro name collision with winnt.h 2021-06-16 12:32:34 +02:00
Martin Kroeker cbfd3c87e1
Recognize Intel Ice Lake SP as Cooper Lake 2021-05-14 20:44:06 +02:00
Martin Kroeker 623d580b4c
Restore __volatile__ keyword 2021-04-16 10:27:32 +02:00
Martin Kroeker 186368ddc3
Fix compilation with CLANG 2021-03-16 16:52:57 +01:00
Martin Kroeker 1a3ad4b670
Fix signatures of the TLS-mode dll_callback and p_process_term functions for Win64 2021-02-22 19:40:36 +01:00
Peter Hawkins dbbf92c1d1 Fix race in blas_thread_shutdown.
blas_server_avail was read without holding server_lock. If multiple threads call blas_thread_shutdown simultaneously, for example, by calling fork(), then they can attempt to shut down multiple times. This can lead to a segmentation fault.
2021-02-18 13:46:50 -05:00
Martin Kroeker cb429d6b12
Merge pull request #3110 from martin-frbg/issue3108
Fix get_num_procs()  in the USE_TLS branch for non-glibc systems
2021-02-18 15:45:25 +01:00
Martin Kroeker b0bded3f2f
Fix get_num_procs() in the USE_TLS branch for non-glibc systems 2021-02-18 11:14:05 +01:00
Martin Kroeker e4e5042e38
Recognize Intel Tiger Lake as SkylakeX 2021-02-11 20:17:11 +01:00
Martin Kroeker 0cc36770f1
Merge pull request #3073 from xoviat/embedded
add embedded option
2021-01-31 18:02:41 +01:00
Martin Kroeker eea0c0f2ed
Merge pull request #3085 from alexhenrie/memory_alloc
Fix null pointer check in blas_memory_alloc
2021-01-26 20:11:42 +01:00
Martin Kroeker 0cb9e9fc8d
Remove the VORTEX support bits again for now 2021-01-25 19:02:21 +01:00
Alex Henrie 113840da12 Fix null pointer check in blas_memory_alloc 2021-01-24 22:20:44 -07:00
Martin Kroeker deb2e66bcc
Add DYNAMIC_LIST support for ARM64 2021-01-24 23:18:52 +01:00
xoviat 2e8d6e8690 add functions for embedded 2021-01-23 22:12:17 -06:00
Martin Kroeker b94dab5250
patch to support power10 in builtin_cpu_is was backported to gcc 10.2, so allow that as wel 2021-01-20 21:34:36 +01:00
Martin Kroeker 63fa3c3f8f
Require gcc 11 for builtin_cpu_is(power10)
fixes #3074
2021-01-20 15:41:04 +01:00
xoviat b60de4447a add cortex-m platform 2021-01-19 08:57:44 -06:00
Martin Kroeker 2c445be8ba
Merge pull request #3051 from martin-frbg/rocketlake
Add CPUID information for Intel Rocket Lake
2021-01-14 15:56:25 +01:00
Martin Kroeker 6fe0f1fab9
Label get_cpu_ftr as volatile to keep gcc from rearranging the code 2021-01-11 19:05:29 +01:00
Martin Kroeker 17c16f2a71
Implement builtin_cpu_is and limit cpu choices to P8 and P9 for NVIDIA compilers 2020-12-19 23:21:22 +01:00
Martin Kroeker 865676682d
Add Intel Rocket Lake 2020-12-14 22:40:23 +01:00
Martin Kroeker 6232237dba
Make fallback from P10 to P9 conditional on suitable compiler 2020-12-11 23:41:17 +01:00
Martin Kroeker 18d8a67485
Merge pull request #2994 from antonblanchard/power10-fixes
Power10 fixes
2020-12-11 23:37:30 +01:00
Martin Kroeker 83de62c20d
Merge pull request #3026 from martin-frbg/revert747
Revert PR747 - SYRK parameter changes for Haswell and related targets
2020-12-10 16:29:41 +01:00
gxw 4b548857d6 Add msa support for loongson
1. Using core loongson3r3 and loongson3r4 for loongson
2. Add DYNAMIC_ARCH for loongson

Change-Id: I1c6b54dbeca3a0cc31d1222af36a7e9bd6ab54c1
2020-12-09 10:28:46 +08:00
Martin Kroeker a554712439
remove extra/intermediate size step for min_jj introduced in PR747 2020-12-08 21:01:36 +01:00
Martin Kroeker 5d26223f4a
remove extra/intermediate size step of min_jj from PR747 2020-12-08 20:59:56 +01:00
Martin Kroeker bc5b1ddf0d
Merge pull request #3004 from martin-frbg/bsd_getauxval
ARM64 DYNAMIC_ARCH build fix for BSD/OSX
2020-11-23 08:35:12 +01:00
Martin Kroeker e7bf8ced6c
Build fix for systems that do not support getauxval 2020-11-22 20:20:28 +01:00
Martin Kroeker 5fa305172a
Use ifeq instead of ifdef for user-definable options 2020-11-22 16:29:56 +01:00
Martin Kroeker d3ff1f889f
Convert ifndefs to ifneq 2020-11-22 16:27:17 +01:00
Alexander Grund 60005eb47b
Don't overwrite blas_thread_buffer if already set
After a fork it is possible that blas_thread_buffer has already
allocated memory buffers: goto_set_num_threads does allocate those
already and it may be called by num_cpu_avail in case the OpenBLAS
NUM_THREADS differ from the OMP num threads.
This leads to a memory leak which can cause subsequent execution of BLAS
kernels to fail.

Fixes #2993
2020-11-19 14:51:51 +01:00
Anton Blanchard 043f3d6faa POWER10: Use POWER9 as a fallback
If the toolchain is too old, or the mma features isn't set on a POWER10
fall back to the POWER9 loops.
2020-11-19 21:04:10 +11:00
Martin Kroeker ff16329cb7
Merge pull request #2972 from xiegengxin/rot-intrinsic
Improve the performance of rot by using AVX512 and AVX2 intrinsic
2020-11-08 22:43:00 +01:00
Gengxin Xie d9ba49165a Improve the performance of rot by using AVX512 and AVX2 intrinsic 2020-11-05 15:12:36 +08:00
Martin Kroeker aa21cb5217
Merge pull request #2960 from thrasibule/avx2_detection
fix avx2 detection
2020-10-31 20:24:21 +01:00
Guillaume Horel 1f564d729b fix avx2 detection
reword commits to make it clearer
2020-10-31 10:00:48 -04:00
Chen, Guobing a7b1f9b1bb Implementation of BF16 based gemv
1. Add a new API -- sbgemv to support bfloat16 based gemv
2. Implement a generic kernel for sbgemv
3. Implement an avx512-bf16 based kernel for sbgemv

Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
2020-10-29 02:08:23 +08:00
Martin Kroeker 2207a16235
Merge pull request #2952 from martin-frbg/issue2931
Try to read cpu ID from /sys/devices/.../cpu0 if HWCAP_CPUID fails
2020-10-28 09:37:32 +01:00
Martin Kroeker b937d78a6d
Try to read cpu information from /sys/devices/system/cpu/cpu0 if HWCAP_CPUID fails 2020-10-27 17:51:32 +01:00
Martin Kroeker fd7da56965
Move definitions that are neither needed nor supported on SUNOS 2020-10-25 12:01:50 +01:00
Martin Kroeker ff65952e46
Move HAVE_P10_SUPPORT to the build system
to be able to include a binutils version check
2020-10-20 00:55:41 +02:00
Rajalakshmi Srinivasaraghavan b5d30b390d Fix build issues with bfloat16
This patch fixes compilation errors due to recent renaming from SH to SB
with BUILD_BFLOAT16.
2020-10-13 11:00:22 -05:00
Martin Kroeker 006c7f6671
Change "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-12 00:06:06 +02:00
Martin Kroeker 85154c2e18
Change "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-12 00:05:05 +02:00
Martin Kroeker 887e00fd7f
Adapt for supporting only a subset of variable types 2020-10-11 14:58:57 +02:00
Martin Kroeker 886a8e3190
Adapt for supporting only a subset of variable types 2020-10-11 14:57:32 +02:00
Martin Kroeker ac653c94f3
Merge branch 'develop' into issue2588-cmake 2020-10-11 13:57:07 +02:00
Martin Kroeker f032d8966e
Merge pull request #2874 from Flamefire/memory_fixes
Avoid out of bounds access on invalid memory free
2020-10-04 15:16:51 +02:00
Martin Kroeker f6e4cf2f9d
Merge pull request #2876 from Flamefire/omp_fork_fix
Lazyly reinit threads after a fork in OMP mode
2020-10-03 22:52:17 +02:00
User User-User d2333e7842 aarch64 fix std=c18 compilation 2020-10-03 18:00:34 +03:00
Alexander Grund 3094fc6c83
Lazyly reinit threads after a fork in OMP mode
This initializes the per-thread memory buffers which get
cleared/released on a fork via pthread_at_fork. Not doing so leads to
each thread calling blas_memory_alloc on almost every execution which
slows down the code significantly as the threads race for the memory
allocation using locks to serialize that.
2020-10-01 15:41:42 +02:00
Alexander Grund 3c05f54df8
Avoid out of bounds access on invalid memory free 2020-10-01 10:48:45 +02:00
Alexander Grund dee7c49938
Fix TABs and trailing space 2020-10-01 10:43:16 +02:00
Martin Kroeker 896bbd55e1
Add support for building only selected variable types 2020-09-26 23:25:55 +02:00
Martin Kroeker 357bff06b5
Add BUILD_vartype defines 2020-09-22 23:24:22 +02:00
Martin Kroeker 988a6f429e
Add BUILD_vartype defines 2020-09-22 23:23:33 +02:00
Martin Kroeker e5e2fbd593
Support building only selected types 2020-09-22 23:21:30 +02:00
Martin Kroeker 3287848c8f
Support building only seleced types 2020-09-22 23:20:51 +02:00
y00512012 06cf73a239 fix a bug of trmm 2020-09-22 16:47:10 +08:00
Martin Kroeker ddec244a5a
Merge pull request #2838 from austinpagan/gordon_trmm
Adding performance patch for trmm, just like trsm (#2836)
2020-09-15 21:17:48 +02:00
fossum dfeca46098 Adding performance patch for trmm, just like #2836 2020-09-15 08:59:50 -05:00
fossum 274d6e015b Fixing a performance bug in trsm_[LR].c. 2020-09-14 13:10:48 -05:00
Martin Kroeker 91c84e1c01
Merge pull request #2796 from Guobing-Chen/BF16_dot_coversion_apis
Add bfloat16 based dot and conversion with single/double
2020-09-14 15:00:19 +02:00
Marius Hillenbrand a55fe06f25 s390x/DYNAMIC_ARCH: define a HW_CAP flag to support slightly older glibc versions
Enable building DYNAMIC_ARCH support with older versions of glibc that
do not know about the hwcap flag HWCAP_S390_VXE yet.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-08 19:34:18 +02:00
Marius Hillenbrand 4f34bcfb5e s390x/DYNAMIC_ARCH: pass supported arch levels from Makefile to run-time code
... instead of duplicating the (old) mechanism from the Makefile that
aimed to derive supported architecture generations from the gcc
version.

To enable builds with DYNAMIC_ARCH with older compiler releases, the
Makefile and drivers/other/dynamic_arch.c need a common view of the
architecture support built into the library.

We follow the notation from x86 when used with DYNAMIC_LIST, where
defines DYN_<ARCH NAME> denote support for a given generation to be
built in. Since there are far fewer architecture generations in OpenBLAS
for s390x, that does not bloat command lines too much.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-08 19:34:18 +02:00
Martin Kroeker 330044d821
Fix potentiol domain error in sqrt 2020-09-05 09:44:33 +02:00
Chen, Guobing deaeb6c5b8 Add bfloat16 based dot and conversion with single/double
1. Added bfloat16 based dot as new API: shdot
2. Implemented generic kernel and cooperlake-specific (AVX512-BF16) kernel for shdot
3. Added 4 conversion APIs for bfloat16 data type <=> single/double: shstobf16 shdtobf16 sbf16tos dbf16tod
     shstobf16 -- convert single float array to bfloat16 array
     shdtobf16 -- convert double float array to bfloat16 array
     sbf16tos  -- convert bfloat16 array to single float array
     dbf16tod  -- convert bfloat16 array to double float array
4. Implemented generic kernels for all 4 conversion APIs, and cooperlake-specific kernel for shstobf16 and shdtobf16
5. Update level1 thread facilitate functions and macros to support multi-threading for these new APIs
6. Fix Cooperlake platform detection/specify issue when under dynamic-arch building
7. Change the typedef of bfloat16 from unsigned short to more strict uint16_t

Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
2020-09-04 02:31:25 +08:00
Chen, Guobing 0c1c903f1e Fix OMP num specify issue
In current code, no matter what number of threads specified, all
available CPU count is used when invoking OMP, which leads to very bad
performance if the workload is small while all available CPUs are big.
Lots of time are wasted on inter-thread sync. Fix this issue by really
using the number specified by the variable 'num' from calling API.

Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
2020-08-24 02:45:54 +08:00
Chen, Guobing e740c4873d Enable COOPERLAKE build target
Enable new build target platform -- COOPERLAKE. This target platform
supports all the SKYLAKEX supported ISAs + avx512bf16. So all the
SKYLAKEX specific kernels/drivers and related code are now extended
to be also active on COOPERLAKE. Besides, new BF16 related kernels
are active under this target.
2020-08-13 06:18:00 +08:00
Martin Kroeker 60cd5e55fc
Protect against inadvertent activation of USE_CUDA 2020-08-01 12:31:39 +02:00
Martin Kroeker 7c02f4b1f7
Merge pull request #2744 from martin-frbg/issue2738
Add AMD Renoir/Matisse cpu autodetection and preliminary support for Zen3
2020-07-28 19:32:04 +02:00
Martin Kroeker 12918358aa
Add AMD Renoir/Matisse and preliminary support for Zen3 as Zen2
also support AMD family 22 Jaguar/Puma as Bobcat
2020-07-28 13:53:17 +00:00
Ashwin Sekhar T K 4e1be0e481 ARM64: Add THUNDERX3T110 Target 2020-07-26 23:32:24 -07:00