Commit Graph

489 Commits

Author SHA1 Message Date
Wangyang Guo 8356a604f0 sbgemm: cooperlake: tuning for block params 2021-09-07 21:30:46 +08:00
Martin Kroeker cd10d1c03b
Fix typo 2021-08-30 14:38:28 +02:00
Martin Kroeker 2db1a99aca
Clean up debug messages 2021-08-30 14:21:25 +02:00
Martin Kroeker 89fc5b8f4f
Fix unmap logic 2021-08-29 19:50:24 +02:00
Martin Kroeker 7fd12a5e69
Add likely() hints for gcc 2021-08-29 13:54:51 +02:00
Martin Kroeker 2ba9a567aa
Fix typo 2021-08-28 17:14:59 +02:00
Martin Kroeker b4b952eece
Add auxiliary tracking space for thread buffer frees too 2021-08-28 17:03:53 +02:00
Martin Kroeker 7d1becc575
Allocate an auxiliary struct when running out of preconfigured threads 2021-08-28 14:18:36 +02:00
Martin Kroeker 898212efcd
Actually add the message to the TLS section 2021-08-02 14:50:14 +02:00
Martin Kroeker 210a1584c5
Rebase source and edit TLS version of the message as well 2021-08-02 14:19:16 +02:00
Martin Kroeker f2a7a67f5a
Improve the "tried to allocate too many buffers" error message 2021-07-31 17:23:40 +02:00
Craig Watson 4d7dfe4845 Include Haiku in processor count checks 2021-07-27 09:00:30 +00:00
JonasZhou 0fca36c8c3 Add cpu detection support for Zhaoxin processors
Signed-off-by: JonasZhou <JonasZhou@zhaoxin.com>
2021-07-12 13:43:45 +08:00
River Dillon 2f6326a630 Remove <linux/unistd.h> 2021-07-10 00:36:07 -07:00
Martin Kroeker 8f22ac552b
Add vendor string Shanghai as successor to Centaur 2021-07-08 18:28:49 +02:00
Martin Kroeker eb2fdd3af0
Recognize newer Zhaoxin/Centaur processors as Nehalem 2021-07-08 12:23:15 +02:00
User User-User 750719528a bugz 2021-06-20 16:40:43 +02:00
User User-User 6423b282a1 dynamic_arch 2021-06-20 14:19:41 +02:00
Martin Kroeker cbfd3c87e1
Recognize Intel Ice Lake SP as Cooper Lake 2021-05-14 20:44:06 +02:00
Martin Kroeker 623d580b4c
Restore __volatile__ keyword 2021-04-16 10:27:32 +02:00
Martin Kroeker 186368ddc3
Fix compilation with CLANG 2021-03-16 16:52:57 +01:00
Martin Kroeker 1a3ad4b670
Fix signatures of the TLS-mode dll_callback and p_process_term functions for Win64 2021-02-22 19:40:36 +01:00
Peter Hawkins dbbf92c1d1 Fix race in blas_thread_shutdown.
blas_server_avail was read without holding server_lock. If multiple threads call blas_thread_shutdown simultaneously, for example, by calling fork(), then they can attempt to shut down multiple times. This can lead to a segmentation fault.
2021-02-18 13:46:50 -05:00
Martin Kroeker cb429d6b12
Merge pull request #3110 from martin-frbg/issue3108
Fix get_num_procs()  in the USE_TLS branch for non-glibc systems
2021-02-18 15:45:25 +01:00
Martin Kroeker b0bded3f2f
Fix get_num_procs() in the USE_TLS branch for non-glibc systems 2021-02-18 11:14:05 +01:00
Martin Kroeker e4e5042e38
Recognize Intel Tiger Lake as SkylakeX 2021-02-11 20:17:11 +01:00
Martin Kroeker 0cc36770f1
Merge pull request #3073 from xoviat/embedded
add embedded option
2021-01-31 18:02:41 +01:00
Martin Kroeker eea0c0f2ed
Merge pull request #3085 from alexhenrie/memory_alloc
Fix null pointer check in blas_memory_alloc
2021-01-26 20:11:42 +01:00
Martin Kroeker 0cb9e9fc8d
Remove the VORTEX support bits again for now 2021-01-25 19:02:21 +01:00
Alex Henrie 113840da12 Fix null pointer check in blas_memory_alloc 2021-01-24 22:20:44 -07:00
Martin Kroeker deb2e66bcc
Add DYNAMIC_LIST support for ARM64 2021-01-24 23:18:52 +01:00
xoviat 2e8d6e8690 add functions for embedded 2021-01-23 22:12:17 -06:00
Martin Kroeker b94dab5250
patch to support power10 in builtin_cpu_is was backported to gcc 10.2, so allow that as wel 2021-01-20 21:34:36 +01:00
Martin Kroeker 63fa3c3f8f
Require gcc 11 for builtin_cpu_is(power10)
fixes #3074
2021-01-20 15:41:04 +01:00
xoviat b60de4447a add cortex-m platform 2021-01-19 08:57:44 -06:00
Martin Kroeker 2c445be8ba
Merge pull request #3051 from martin-frbg/rocketlake
Add CPUID information for Intel Rocket Lake
2021-01-14 15:56:25 +01:00
Martin Kroeker 6fe0f1fab9
Label get_cpu_ftr as volatile to keep gcc from rearranging the code 2021-01-11 19:05:29 +01:00
Martin Kroeker 17c16f2a71
Implement builtin_cpu_is and limit cpu choices to P8 and P9 for NVIDIA compilers 2020-12-19 23:21:22 +01:00
Martin Kroeker 865676682d
Add Intel Rocket Lake 2020-12-14 22:40:23 +01:00
Martin Kroeker 6232237dba
Make fallback from P10 to P9 conditional on suitable compiler 2020-12-11 23:41:17 +01:00
Martin Kroeker 18d8a67485
Merge pull request #2994 from antonblanchard/power10-fixes
Power10 fixes
2020-12-11 23:37:30 +01:00
gxw 4b548857d6 Add msa support for loongson
1. Using core loongson3r3 and loongson3r4 for loongson
2. Add DYNAMIC_ARCH for loongson

Change-Id: I1c6b54dbeca3a0cc31d1222af36a7e9bd6ab54c1
2020-12-09 10:28:46 +08:00
Martin Kroeker bc5b1ddf0d
Merge pull request #3004 from martin-frbg/bsd_getauxval
ARM64 DYNAMIC_ARCH build fix for BSD/OSX
2020-11-23 08:35:12 +01:00
Martin Kroeker e7bf8ced6c
Build fix for systems that do not support getauxval 2020-11-22 20:20:28 +01:00
Martin Kroeker 5fa305172a
Use ifeq instead of ifdef for user-definable options 2020-11-22 16:29:56 +01:00
Alexander Grund 60005eb47b
Don't overwrite blas_thread_buffer if already set
After a fork it is possible that blas_thread_buffer has already
allocated memory buffers: goto_set_num_threads does allocate those
already and it may be called by num_cpu_avail in case the OpenBLAS
NUM_THREADS differ from the OMP num threads.
This leads to a memory leak which can cause subsequent execution of BLAS
kernels to fail.

Fixes #2993
2020-11-19 14:51:51 +01:00
Anton Blanchard 043f3d6faa POWER10: Use POWER9 as a fallback
If the toolchain is too old, or the mma features isn't set on a POWER10
fall back to the POWER9 loops.
2020-11-19 21:04:10 +11:00
Martin Kroeker ff16329cb7
Merge pull request #2972 from xiegengxin/rot-intrinsic
Improve the performance of rot by using AVX512 and AVX2 intrinsic
2020-11-08 22:43:00 +01:00
Gengxin Xie d9ba49165a Improve the performance of rot by using AVX512 and AVX2 intrinsic 2020-11-05 15:12:36 +08:00
Martin Kroeker aa21cb5217
Merge pull request #2960 from thrasibule/avx2_detection
fix avx2 detection
2020-10-31 20:24:21 +01:00
Guillaume Horel 1f564d729b fix avx2 detection
reword commits to make it clearer
2020-10-31 10:00:48 -04:00
Chen, Guobing a7b1f9b1bb Implementation of BF16 based gemv
1. Add a new API -- sbgemv to support bfloat16 based gemv
2. Implement a generic kernel for sbgemv
3. Implement an avx512-bf16 based kernel for sbgemv

Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
2020-10-29 02:08:23 +08:00
Martin Kroeker 2207a16235
Merge pull request #2952 from martin-frbg/issue2931
Try to read cpu ID from /sys/devices/.../cpu0 if HWCAP_CPUID fails
2020-10-28 09:37:32 +01:00
Martin Kroeker b937d78a6d
Try to read cpu information from /sys/devices/system/cpu/cpu0 if HWCAP_CPUID fails 2020-10-27 17:51:32 +01:00
Martin Kroeker fd7da56965
Move definitions that are neither needed nor supported on SUNOS 2020-10-25 12:01:50 +01:00
Martin Kroeker ff65952e46
Move HAVE_P10_SUPPORT to the build system
to be able to include a binutils version check
2020-10-20 00:55:41 +02:00
Martin Kroeker 85154c2e18
Change "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-12 00:05:05 +02:00
Martin Kroeker ac653c94f3
Merge branch 'develop' into issue2588-cmake 2020-10-11 13:57:07 +02:00
Martin Kroeker f032d8966e
Merge pull request #2874 from Flamefire/memory_fixes
Avoid out of bounds access on invalid memory free
2020-10-04 15:16:51 +02:00
Martin Kroeker f6e4cf2f9d
Merge pull request #2876 from Flamefire/omp_fork_fix
Lazyly reinit threads after a fork in OMP mode
2020-10-03 22:52:17 +02:00
User User-User d2333e7842 aarch64 fix std=c18 compilation 2020-10-03 18:00:34 +03:00
Alexander Grund 3094fc6c83
Lazyly reinit threads after a fork in OMP mode
This initializes the per-thread memory buffers which get
cleared/released on a fork via pthread_at_fork. Not doing so leads to
each thread calling blas_memory_alloc on almost every execution which
slows down the code significantly as the threads race for the memory
allocation using locks to serialize that.
2020-10-01 15:41:42 +02:00
Alexander Grund 3c05f54df8
Avoid out of bounds access on invalid memory free 2020-10-01 10:48:45 +02:00
Alexander Grund dee7c49938
Fix TABs and trailing space 2020-10-01 10:43:16 +02:00
Martin Kroeker 896bbd55e1
Add support for building only selected variable types 2020-09-26 23:25:55 +02:00
Martin Kroeker 357bff06b5
Add BUILD_vartype defines 2020-09-22 23:24:22 +02:00
Martin Kroeker 91c84e1c01
Merge pull request #2796 from Guobing-Chen/BF16_dot_coversion_apis
Add bfloat16 based dot and conversion with single/double
2020-09-14 15:00:19 +02:00
Marius Hillenbrand a55fe06f25 s390x/DYNAMIC_ARCH: define a HW_CAP flag to support slightly older glibc versions
Enable building DYNAMIC_ARCH support with older versions of glibc that
do not know about the hwcap flag HWCAP_S390_VXE yet.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-08 19:34:18 +02:00
Marius Hillenbrand 4f34bcfb5e s390x/DYNAMIC_ARCH: pass supported arch levels from Makefile to run-time code
... instead of duplicating the (old) mechanism from the Makefile that
aimed to derive supported architecture generations from the gcc
version.

To enable builds with DYNAMIC_ARCH with older compiler releases, the
Makefile and drivers/other/dynamic_arch.c need a common view of the
architecture support built into the library.

We follow the notation from x86 when used with DYNAMIC_LIST, where
defines DYN_<ARCH NAME> denote support for a given generation to be
built in. Since there are far fewer architecture generations in OpenBLAS
for s390x, that does not bloat command lines too much.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-08 19:34:18 +02:00
Chen, Guobing deaeb6c5b8 Add bfloat16 based dot and conversion with single/double
1. Added bfloat16 based dot as new API: shdot
2. Implemented generic kernel and cooperlake-specific (AVX512-BF16) kernel for shdot
3. Added 4 conversion APIs for bfloat16 data type <=> single/double: shstobf16 shdtobf16 sbf16tos dbf16tod
     shstobf16 -- convert single float array to bfloat16 array
     shdtobf16 -- convert double float array to bfloat16 array
     sbf16tos  -- convert bfloat16 array to single float array
     dbf16tod  -- convert bfloat16 array to double float array
4. Implemented generic kernels for all 4 conversion APIs, and cooperlake-specific kernel for shstobf16 and shdtobf16
5. Update level1 thread facilitate functions and macros to support multi-threading for these new APIs
6. Fix Cooperlake platform detection/specify issue when under dynamic-arch building
7. Change the typedef of bfloat16 from unsigned short to more strict uint16_t

Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
2020-09-04 02:31:25 +08:00
Chen, Guobing 0c1c903f1e Fix OMP num specify issue
In current code, no matter what number of threads specified, all
available CPU count is used when invoking OMP, which leads to very bad
performance if the workload is small while all available CPUs are big.
Lots of time are wasted on inter-thread sync. Fix this issue by really
using the number specified by the variable 'num' from calling API.

Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
2020-08-24 02:45:54 +08:00
Chen, Guobing e740c4873d Enable COOPERLAKE build target
Enable new build target platform -- COOPERLAKE. This target platform
supports all the SKYLAKEX supported ISAs + avx512bf16. So all the
SKYLAKEX specific kernels/drivers and related code are now extended
to be also active on COOPERLAKE. Besides, new BF16 related kernels
are active under this target.
2020-08-13 06:18:00 +08:00
Martin Kroeker 60cd5e55fc
Protect against inadvertent activation of USE_CUDA 2020-08-01 12:31:39 +02:00
Martin Kroeker 7c02f4b1f7
Merge pull request #2744 from martin-frbg/issue2738
Add AMD Renoir/Matisse cpu autodetection and preliminary support for Zen3
2020-07-28 19:32:04 +02:00
Martin Kroeker 12918358aa
Add AMD Renoir/Matisse and preliminary support for Zen3 as Zen2
also support AMD family 22 Jaguar/Puma as Bobcat
2020-07-28 13:53:17 +00:00
Ashwin Sekhar T K 4e1be0e481 ARM64: Add THUNDERX3T110 Target 2020-07-26 23:32:24 -07:00
Martin Kroeker 09eb9d2584
Update conditional for atomics to HAVE_C11 2020-07-18 17:07:38 +00:00
Martin Kroeker 791e046744
Update conditional for atomics to use HAVE_C11 2020-07-18 17:05:59 +00:00
Martin Kroeker 94bab9d1f9
Update conditional for atomics to use HAVE_C11 2020-07-18 17:03:31 +00:00
Rajalakshmi Srinivasaraghavan af1e140e35 Change minimum gcc version for POWER10
As the MMA patches for POWER10 are backported to gcc10.2, changing
the minimum gcc version needed to build OpenBLAS for POWER10.
2020-07-09 21:46:06 -05:00
Rajalakshmi Srinivasaraghavan 45d819ca82 Changing mcpu option as power10
As compiler enabled mcpu option as power10, changing it from future.
2020-07-07 11:25:20 -05:00
Martin Kroeker 584ef8d4ae
Add support for Comet Lake H & S 2020-06-27 14:36:37 +02:00
Matthew Treinish f37e941d52
Add support to driver/others/dynamic.c too 2020-06-25 11:56:49 -04:00
User User-User e6b9275034 address vs2019 C4293 2020-06-24 09:12:23 +03:00
Martin Kroeker 6eaeb01263
Merge pull request #2658 from RajalakshmiSR/p10
powerpc: Add support for future processor
2020-06-23 00:02:37 +02:00
Martin Kroeker 007d9f97d7
Make gotoblas_corename report the name of the selected TARGET rather than its aliases 2020-06-13 19:25:28 +02:00
Rajalakshmi Srinivasaraghavan 9fe930f205 powerpc: Add support for future processor
This is the initial patch to support build infrastructure
for POWER10 architecture.
2020-06-11 15:47:20 -05:00
Marius Hillenbrand 0dbe61a612 s390x: choose SIMD kernels at run-time based on OS and compiler support
Extend and simplify the run-time detection for dynamic architecture support for z
to check HW_CAP and only use SIMD features if advertised by the OS.
While at it, also honor the env variable LD_HWCAP_MASK and do not use
the CPU features masked there.

Note that we can only use the SIMD features on z13 or newer (i.e.,
Vector Facility or Vector-Enhancements Facilities) when the operating
system supports properly context-switching the vector registers. The OS
advertises that support as a bit in the HW_CAP value in the auxiliary
vector. While all recent Linux kernels have that support, we should
maintain compatibility with older versions that may still be in use.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-05-12 11:01:16 +02:00
Marius Hillenbrand 8c338616f9 s390x: gate dynamic arch detection on gcc version and add generic
When building OpenBLAS with DYNAMIC_ARCH=1 on s390x (aka zarch), make
sure to include support for systems without the facilities introduced
with z13 (i.e., zarch_generic). Adjust runtime detection to fallback to
that generic code when running on a unknown platform other than Z13
through Z15.

When detecting a Z13 or newer system, add a check for gcc support for
the architecture-specific features before selecting the respective
kernel. Fallback to Z13 or generic code, in case.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-05-12 11:01:16 +02:00
Martin Kroeker f4248af26e
Fix compiler warnings 2020-04-28 10:43:12 +02:00
Rajalakshmi Srinivasaraghavan 7eb55504b1 RFC : Add half precision gemm for bfloat16 in OpenBLAS
This patch adds support for bfloat16 data type matrix multiplication kernel.
For architectures that don't support bfloat16, it is defined as unsigned short
(2 bytes).  Default unroll sizes can be changed as per architecture as done for
SGEMM and for now 8 and 4 are used for M and N.  Size of ncopy/tcopy can be
changed as per architecture requirement and for now, size 2 is used.

Added shgemm in kernel/power/KERNEL.POWER9 and tested in powerpc64le and
powerpc64.  For reference, added a small test compare_sgemm_shgemm.c to compare
sgemm and shgemm output.

This patch does not cover OpenBLAS test, benchmark and lapack tests for shgemm.
Complex type implementation can be discussed and added once this is approved.
2020-04-14 14:55:08 -05:00
Martin Kroeker f41600e66f
Add a read barrier in the traversing of the buffer list
Needed on systems with weak memory ordering - the inferior, partially working fix from #2544 was already removed in #2551
2020-04-13 12:34:02 +02:00
Martin Kroeker 2a28448a96
Add safeguards for sufficient BUFFER_SIZE 2020-04-12 19:45:36 +02:00
Sharvil Nanavati 7b4773b24d Add API to set thread affinity on Linux.
Issue: #2545
2020-04-08 12:49:35 -07:00
Martin Kroeker 69f277f8ee
Add another memory barrier for ARM and a multicore test run on ThunderX to help detect such issues (#2544)
* Add another memory barrier in memory.c to prevent races in memory slot allocation

* Add an all-core test on Drone.io's ThunderX platform and modify dgemm_tester to use all 96 cores
2020-04-08 11:04:51 +02:00
Martin Kroeker ad9e53154d
Merge pull request #2484 from RajalakshmiSR/power-dynamic
Fix DYNAMIC_ARCH build for POWER9
2020-03-04 08:06:06 +01:00
Martin Kroeker e6edb7431f
Merge pull request #2466 from AGSaidi/acq-rel-1
Switch blas_server to use acq/rel semantics
2020-03-04 07:59:31 +01:00
Martin Kroeker d68e4ba59b
Fix cut/paste glitch 2020-03-03 21:37:48 +01:00
Martin Kroeker 635c9e4e09
Restore initializers for mutex and conditional 2020-03-03 21:04:12 +01:00
Rajalakshmi Srinivasaraghavan 2afc074803 Fix DYNAMIC_ARCH build for POWER9
Setting DYNAMIC_ARCH=1 on POWER9 does not build POWER9 files due to some
compiler version checks.  This patch fixes some of the macros that are used
to check compiler version.  On fixing those checks, there are some new make
failures related to icamin, icamax, isamin, isamax and caxpy files on POWER9.
This patch fixes those failures as well.
2020-03-03 12:35:10 -06:00