Commit Graph

416 Commits

Author SHA1 Message Date
Martin Kroeker cb429d6b12
Merge pull request #3110 from martin-frbg/issue3108
Fix get_num_procs()  in the USE_TLS branch for non-glibc systems
2021-02-18 15:45:25 +01:00
Martin Kroeker b0bded3f2f
Fix get_num_procs() in the USE_TLS branch for non-glibc systems 2021-02-18 11:14:05 +01:00
Martin Kroeker e4e5042e38
Recognize Intel Tiger Lake as SkylakeX 2021-02-11 20:17:11 +01:00
Martin Kroeker 0cc36770f1
Merge pull request #3073 from xoviat/embedded
add embedded option
2021-01-31 18:02:41 +01:00
Martin Kroeker eea0c0f2ed
Merge pull request #3085 from alexhenrie/memory_alloc
Fix null pointer check in blas_memory_alloc
2021-01-26 20:11:42 +01:00
Martin Kroeker 0cb9e9fc8d
Remove the VORTEX support bits again for now 2021-01-25 19:02:21 +01:00
Alex Henrie 113840da12 Fix null pointer check in blas_memory_alloc 2021-01-24 22:20:44 -07:00
Martin Kroeker deb2e66bcc
Add DYNAMIC_LIST support for ARM64 2021-01-24 23:18:52 +01:00
xoviat 2e8d6e8690 add functions for embedded 2021-01-23 22:12:17 -06:00
Martin Kroeker b94dab5250
patch to support power10 in builtin_cpu_is was backported to gcc 10.2, so allow that as wel 2021-01-20 21:34:36 +01:00
Martin Kroeker 63fa3c3f8f
Require gcc 11 for builtin_cpu_is(power10)
fixes #3074
2021-01-20 15:41:04 +01:00
xoviat b60de4447a add cortex-m platform 2021-01-19 08:57:44 -06:00
Martin Kroeker 2c445be8ba
Merge pull request #3051 from martin-frbg/rocketlake
Add CPUID information for Intel Rocket Lake
2021-01-14 15:56:25 +01:00
Martin Kroeker 6fe0f1fab9
Label get_cpu_ftr as volatile to keep gcc from rearranging the code 2021-01-11 19:05:29 +01:00
Martin Kroeker 17c16f2a71
Implement builtin_cpu_is and limit cpu choices to P8 and P9 for NVIDIA compilers 2020-12-19 23:21:22 +01:00
Martin Kroeker 865676682d
Add Intel Rocket Lake 2020-12-14 22:40:23 +01:00
Martin Kroeker 6232237dba
Make fallback from P10 to P9 conditional on suitable compiler 2020-12-11 23:41:17 +01:00
Martin Kroeker 18d8a67485
Merge pull request #2994 from antonblanchard/power10-fixes
Power10 fixes
2020-12-11 23:37:30 +01:00
gxw 4b548857d6 Add msa support for loongson
1. Using core loongson3r3 and loongson3r4 for loongson
2. Add DYNAMIC_ARCH for loongson

Change-Id: I1c6b54dbeca3a0cc31d1222af36a7e9bd6ab54c1
2020-12-09 10:28:46 +08:00
Martin Kroeker bc5b1ddf0d
Merge pull request #3004 from martin-frbg/bsd_getauxval
ARM64 DYNAMIC_ARCH build fix for BSD/OSX
2020-11-23 08:35:12 +01:00
Martin Kroeker e7bf8ced6c
Build fix for systems that do not support getauxval 2020-11-22 20:20:28 +01:00
Martin Kroeker 5fa305172a
Use ifeq instead of ifdef for user-definable options 2020-11-22 16:29:56 +01:00
Alexander Grund 60005eb47b
Don't overwrite blas_thread_buffer if already set
After a fork it is possible that blas_thread_buffer has already
allocated memory buffers: goto_set_num_threads does allocate those
already and it may be called by num_cpu_avail in case the OpenBLAS
NUM_THREADS differ from the OMP num threads.
This leads to a memory leak which can cause subsequent execution of BLAS
kernels to fail.

Fixes #2993
2020-11-19 14:51:51 +01:00
Anton Blanchard 043f3d6faa POWER10: Use POWER9 as a fallback
If the toolchain is too old, or the mma features isn't set on a POWER10
fall back to the POWER9 loops.
2020-11-19 21:04:10 +11:00
Martin Kroeker ff16329cb7
Merge pull request #2972 from xiegengxin/rot-intrinsic
Improve the performance of rot by using AVX512 and AVX2 intrinsic
2020-11-08 22:43:00 +01:00
Gengxin Xie d9ba49165a Improve the performance of rot by using AVX512 and AVX2 intrinsic 2020-11-05 15:12:36 +08:00
Martin Kroeker aa21cb5217
Merge pull request #2960 from thrasibule/avx2_detection
fix avx2 detection
2020-10-31 20:24:21 +01:00
Guillaume Horel 1f564d729b fix avx2 detection
reword commits to make it clearer
2020-10-31 10:00:48 -04:00
Chen, Guobing a7b1f9b1bb Implementation of BF16 based gemv
1. Add a new API -- sbgemv to support bfloat16 based gemv
2. Implement a generic kernel for sbgemv
3. Implement an avx512-bf16 based kernel for sbgemv

Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
2020-10-29 02:08:23 +08:00
Martin Kroeker 2207a16235
Merge pull request #2952 from martin-frbg/issue2931
Try to read cpu ID from /sys/devices/.../cpu0 if HWCAP_CPUID fails
2020-10-28 09:37:32 +01:00
Martin Kroeker b937d78a6d
Try to read cpu information from /sys/devices/system/cpu/cpu0 if HWCAP_CPUID fails 2020-10-27 17:51:32 +01:00
Martin Kroeker fd7da56965
Move definitions that are neither needed nor supported on SUNOS 2020-10-25 12:01:50 +01:00
Martin Kroeker ff65952e46
Move HAVE_P10_SUPPORT to the build system
to be able to include a binutils version check
2020-10-20 00:55:41 +02:00
Martin Kroeker 85154c2e18
Change "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-12 00:05:05 +02:00
Martin Kroeker ac653c94f3
Merge branch 'develop' into issue2588-cmake 2020-10-11 13:57:07 +02:00
Martin Kroeker f032d8966e
Merge pull request #2874 from Flamefire/memory_fixes
Avoid out of bounds access on invalid memory free
2020-10-04 15:16:51 +02:00
Martin Kroeker f6e4cf2f9d
Merge pull request #2876 from Flamefire/omp_fork_fix
Lazyly reinit threads after a fork in OMP mode
2020-10-03 22:52:17 +02:00
User User-User d2333e7842 aarch64 fix std=c18 compilation 2020-10-03 18:00:34 +03:00
Alexander Grund 3094fc6c83
Lazyly reinit threads after a fork in OMP mode
This initializes the per-thread memory buffers which get
cleared/released on a fork via pthread_at_fork. Not doing so leads to
each thread calling blas_memory_alloc on almost every execution which
slows down the code significantly as the threads race for the memory
allocation using locks to serialize that.
2020-10-01 15:41:42 +02:00
Alexander Grund 3c05f54df8
Avoid out of bounds access on invalid memory free 2020-10-01 10:48:45 +02:00
Alexander Grund dee7c49938
Fix TABs and trailing space 2020-10-01 10:43:16 +02:00
Martin Kroeker 896bbd55e1
Add support for building only selected variable types 2020-09-26 23:25:55 +02:00
Martin Kroeker 357bff06b5
Add BUILD_vartype defines 2020-09-22 23:24:22 +02:00
Martin Kroeker 91c84e1c01
Merge pull request #2796 from Guobing-Chen/BF16_dot_coversion_apis
Add bfloat16 based dot and conversion with single/double
2020-09-14 15:00:19 +02:00
Marius Hillenbrand a55fe06f25 s390x/DYNAMIC_ARCH: define a HW_CAP flag to support slightly older glibc versions
Enable building DYNAMIC_ARCH support with older versions of glibc that
do not know about the hwcap flag HWCAP_S390_VXE yet.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-08 19:34:18 +02:00
Marius Hillenbrand 4f34bcfb5e s390x/DYNAMIC_ARCH: pass supported arch levels from Makefile to run-time code
... instead of duplicating the (old) mechanism from the Makefile that
aimed to derive supported architecture generations from the gcc
version.

To enable builds with DYNAMIC_ARCH with older compiler releases, the
Makefile and drivers/other/dynamic_arch.c need a common view of the
architecture support built into the library.

We follow the notation from x86 when used with DYNAMIC_LIST, where
defines DYN_<ARCH NAME> denote support for a given generation to be
built in. Since there are far fewer architecture generations in OpenBLAS
for s390x, that does not bloat command lines too much.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-08 19:34:18 +02:00
Chen, Guobing deaeb6c5b8 Add bfloat16 based dot and conversion with single/double
1. Added bfloat16 based dot as new API: shdot
2. Implemented generic kernel and cooperlake-specific (AVX512-BF16) kernel for shdot
3. Added 4 conversion APIs for bfloat16 data type <=> single/double: shstobf16 shdtobf16 sbf16tos dbf16tod
     shstobf16 -- convert single float array to bfloat16 array
     shdtobf16 -- convert double float array to bfloat16 array
     sbf16tos  -- convert bfloat16 array to single float array
     dbf16tod  -- convert bfloat16 array to double float array
4. Implemented generic kernels for all 4 conversion APIs, and cooperlake-specific kernel for shstobf16 and shdtobf16
5. Update level1 thread facilitate functions and macros to support multi-threading for these new APIs
6. Fix Cooperlake platform detection/specify issue when under dynamic-arch building
7. Change the typedef of bfloat16 from unsigned short to more strict uint16_t

Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
2020-09-04 02:31:25 +08:00
Chen, Guobing 0c1c903f1e Fix OMP num specify issue
In current code, no matter what number of threads specified, all
available CPU count is used when invoking OMP, which leads to very bad
performance if the workload is small while all available CPUs are big.
Lots of time are wasted on inter-thread sync. Fix this issue by really
using the number specified by the variable 'num' from calling API.

Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
2020-08-24 02:45:54 +08:00
Chen, Guobing e740c4873d Enable COOPERLAKE build target
Enable new build target platform -- COOPERLAKE. This target platform
supports all the SKYLAKEX supported ISAs + avx512bf16. So all the
SKYLAKEX specific kernels/drivers and related code are now extended
to be also active on COOPERLAKE. Besides, new BF16 related kernels
are active under this target.
2020-08-13 06:18:00 +08:00
Martin Kroeker 60cd5e55fc
Protect against inadvertent activation of USE_CUDA 2020-08-01 12:31:39 +02:00