Commit Graph

544 Commits

Author SHA1 Message Date
Martin Kroeker 17c16f2a71
Implement builtin_cpu_is and limit cpu choices to P8 and P9 for NVIDIA compilers 2020-12-19 23:21:22 +01:00
Martin Kroeker 6232237dba
Make fallback from P10 to P9 conditional on suitable compiler 2020-12-11 23:41:17 +01:00
Martin Kroeker 18d8a67485
Merge pull request #2994 from antonblanchard/power10-fixes
Power10 fixes
2020-12-11 23:37:30 +01:00
Martin Kroeker 83de62c20d
Merge pull request #3026 from martin-frbg/revert747
Revert PR747 - SYRK parameter changes for Haswell and related targets
2020-12-10 16:29:41 +01:00
gxw 4b548857d6 Add msa support for loongson
1. Using core loongson3r3 and loongson3r4 for loongson
2. Add DYNAMIC_ARCH for loongson

Change-Id: I1c6b54dbeca3a0cc31d1222af36a7e9bd6ab54c1
2020-12-09 10:28:46 +08:00
Martin Kroeker a554712439
remove extra/intermediate size step for min_jj introduced in PR747 2020-12-08 21:01:36 +01:00
Martin Kroeker 5d26223f4a
remove extra/intermediate size step of min_jj from PR747 2020-12-08 20:59:56 +01:00
Martin Kroeker bc5b1ddf0d
Merge pull request #3004 from martin-frbg/bsd_getauxval
ARM64 DYNAMIC_ARCH build fix for BSD/OSX
2020-11-23 08:35:12 +01:00
Martin Kroeker e7bf8ced6c
Build fix for systems that do not support getauxval 2020-11-22 20:20:28 +01:00
Martin Kroeker 5fa305172a
Use ifeq instead of ifdef for user-definable options 2020-11-22 16:29:56 +01:00
Martin Kroeker d3ff1f889f
Convert ifndefs to ifneq 2020-11-22 16:27:17 +01:00
Alexander Grund 60005eb47b
Don't overwrite blas_thread_buffer if already set
After a fork it is possible that blas_thread_buffer has already
allocated memory buffers: goto_set_num_threads does allocate those
already and it may be called by num_cpu_avail in case the OpenBLAS
NUM_THREADS differ from the OMP num threads.
This leads to a memory leak which can cause subsequent execution of BLAS
kernels to fail.

Fixes #2993
2020-11-19 14:51:51 +01:00
Anton Blanchard 043f3d6faa POWER10: Use POWER9 as a fallback
If the toolchain is too old, or the mma features isn't set on a POWER10
fall back to the POWER9 loops.
2020-11-19 21:04:10 +11:00
Martin Kroeker ff16329cb7
Merge pull request #2972 from xiegengxin/rot-intrinsic
Improve the performance of rot by using AVX512 and AVX2 intrinsic
2020-11-08 22:43:00 +01:00
Gengxin Xie d9ba49165a Improve the performance of rot by using AVX512 and AVX2 intrinsic 2020-11-05 15:12:36 +08:00
Martin Kroeker aa21cb5217
Merge pull request #2960 from thrasibule/avx2_detection
fix avx2 detection
2020-10-31 20:24:21 +01:00
Guillaume Horel 1f564d729b fix avx2 detection
reword commits to make it clearer
2020-10-31 10:00:48 -04:00
Chen, Guobing a7b1f9b1bb Implementation of BF16 based gemv
1. Add a new API -- sbgemv to support bfloat16 based gemv
2. Implement a generic kernel for sbgemv
3. Implement an avx512-bf16 based kernel for sbgemv

Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
2020-10-29 02:08:23 +08:00
Martin Kroeker 2207a16235
Merge pull request #2952 from martin-frbg/issue2931
Try to read cpu ID from /sys/devices/.../cpu0 if HWCAP_CPUID fails
2020-10-28 09:37:32 +01:00
Martin Kroeker b937d78a6d
Try to read cpu information from /sys/devices/system/cpu/cpu0 if HWCAP_CPUID fails 2020-10-27 17:51:32 +01:00
Martin Kroeker fd7da56965
Move definitions that are neither needed nor supported on SUNOS 2020-10-25 12:01:50 +01:00
Martin Kroeker ff65952e46
Move HAVE_P10_SUPPORT to the build system
to be able to include a binutils version check
2020-10-20 00:55:41 +02:00
Rajalakshmi Srinivasaraghavan b5d30b390d Fix build issues with bfloat16
This patch fixes compilation errors due to recent renaming from SH to SB
with BUILD_BFLOAT16.
2020-10-13 11:00:22 -05:00
Martin Kroeker 006c7f6671
Change "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-12 00:06:06 +02:00
Martin Kroeker 85154c2e18
Change "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-12 00:05:05 +02:00
Martin Kroeker 887e00fd7f
Adapt for supporting only a subset of variable types 2020-10-11 14:58:57 +02:00
Martin Kroeker 886a8e3190
Adapt for supporting only a subset of variable types 2020-10-11 14:57:32 +02:00
Martin Kroeker ac653c94f3
Merge branch 'develop' into issue2588-cmake 2020-10-11 13:57:07 +02:00
Martin Kroeker f032d8966e
Merge pull request #2874 from Flamefire/memory_fixes
Avoid out of bounds access on invalid memory free
2020-10-04 15:16:51 +02:00
Martin Kroeker f6e4cf2f9d
Merge pull request #2876 from Flamefire/omp_fork_fix
Lazyly reinit threads after a fork in OMP mode
2020-10-03 22:52:17 +02:00
User User-User d2333e7842 aarch64 fix std=c18 compilation 2020-10-03 18:00:34 +03:00
Alexander Grund 3094fc6c83
Lazyly reinit threads after a fork in OMP mode
This initializes the per-thread memory buffers which get
cleared/released on a fork via pthread_at_fork. Not doing so leads to
each thread calling blas_memory_alloc on almost every execution which
slows down the code significantly as the threads race for the memory
allocation using locks to serialize that.
2020-10-01 15:41:42 +02:00
Alexander Grund 3c05f54df8
Avoid out of bounds access on invalid memory free 2020-10-01 10:48:45 +02:00
Alexander Grund dee7c49938
Fix TABs and trailing space 2020-10-01 10:43:16 +02:00
Martin Kroeker 896bbd55e1
Add support for building only selected variable types 2020-09-26 23:25:55 +02:00
Martin Kroeker 357bff06b5
Add BUILD_vartype defines 2020-09-22 23:24:22 +02:00
Martin Kroeker 988a6f429e
Add BUILD_vartype defines 2020-09-22 23:23:33 +02:00
Martin Kroeker e5e2fbd593
Support building only selected types 2020-09-22 23:21:30 +02:00
Martin Kroeker 3287848c8f
Support building only seleced types 2020-09-22 23:20:51 +02:00
y00512012 06cf73a239 fix a bug of trmm 2020-09-22 16:47:10 +08:00
Martin Kroeker ddec244a5a
Merge pull request #2838 from austinpagan/gordon_trmm
Adding performance patch for trmm, just like trsm (#2836)
2020-09-15 21:17:48 +02:00
fossum dfeca46098 Adding performance patch for trmm, just like #2836 2020-09-15 08:59:50 -05:00
fossum 274d6e015b Fixing a performance bug in trsm_[LR].c. 2020-09-14 13:10:48 -05:00
Martin Kroeker 91c84e1c01
Merge pull request #2796 from Guobing-Chen/BF16_dot_coversion_apis
Add bfloat16 based dot and conversion with single/double
2020-09-14 15:00:19 +02:00
Marius Hillenbrand a55fe06f25 s390x/DYNAMIC_ARCH: define a HW_CAP flag to support slightly older glibc versions
Enable building DYNAMIC_ARCH support with older versions of glibc that
do not know about the hwcap flag HWCAP_S390_VXE yet.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-08 19:34:18 +02:00
Marius Hillenbrand 4f34bcfb5e s390x/DYNAMIC_ARCH: pass supported arch levels from Makefile to run-time code
... instead of duplicating the (old) mechanism from the Makefile that
aimed to derive supported architecture generations from the gcc
version.

To enable builds with DYNAMIC_ARCH with older compiler releases, the
Makefile and drivers/other/dynamic_arch.c need a common view of the
architecture support built into the library.

We follow the notation from x86 when used with DYNAMIC_LIST, where
defines DYN_<ARCH NAME> denote support for a given generation to be
built in. Since there are far fewer architecture generations in OpenBLAS
for s390x, that does not bloat command lines too much.

Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
2020-09-08 19:34:18 +02:00
Martin Kroeker 330044d821
Fix potentiol domain error in sqrt 2020-09-05 09:44:33 +02:00
Chen, Guobing deaeb6c5b8 Add bfloat16 based dot and conversion with single/double
1. Added bfloat16 based dot as new API: shdot
2. Implemented generic kernel and cooperlake-specific (AVX512-BF16) kernel for shdot
3. Added 4 conversion APIs for bfloat16 data type <=> single/double: shstobf16 shdtobf16 sbf16tos dbf16tod
     shstobf16 -- convert single float array to bfloat16 array
     shdtobf16 -- convert double float array to bfloat16 array
     sbf16tos  -- convert bfloat16 array to single float array
     dbf16tod  -- convert bfloat16 array to double float array
4. Implemented generic kernels for all 4 conversion APIs, and cooperlake-specific kernel for shstobf16 and shdtobf16
5. Update level1 thread facilitate functions and macros to support multi-threading for these new APIs
6. Fix Cooperlake platform detection/specify issue when under dynamic-arch building
7. Change the typedef of bfloat16 from unsigned short to more strict uint16_t

Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
2020-09-04 02:31:25 +08:00
Chen, Guobing 0c1c903f1e Fix OMP num specify issue
In current code, no matter what number of threads specified, all
available CPU count is used when invoking OMP, which leads to very bad
performance if the workload is small while all available CPUs are big.
Lots of time are wasted on inter-thread sync. Fix this issue by really
using the number specified by the variable 'num' from calling API.

Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
2020-08-24 02:45:54 +08:00
Chen, Guobing e740c4873d Enable COOPERLAKE build target
Enable new build target platform -- COOPERLAKE. This target platform
supports all the SKYLAKEX supported ISAs + avx512bf16. So all the
SKYLAKEX specific kernels/drivers and related code are now extended
to be also active on COOPERLAKE. Besides, new BF16 related kernels
are active under this target.
2020-08-13 06:18:00 +08:00