Commit Graph

118 Commits

Author SHA1 Message Date
Martin Kroeker 8a1710dd0d
don't apply switch_ratio to tail of loop 2024-10-06 20:03:32 +02:00
shivammonaka 9e22d70957 Dynamic locking in Pthread Backend to allow multiple BLAS calls to be executed parallelly 2024-06-07 08:40:17 +05:30
Martin Kroeker db070a9223
add gemm_batch drivers 2024-05-31 18:29:27 +02:00
Martin Kroeker d0794f88dc
add gemm_batch driver 2024-05-29 15:49:20 +02:00
yamazaki-mitsufumi 51ab1903e7 Expanding the scop of 2D thread distribution 2024-04-18 18:20:25 +09:00
shivammonaka d49ebc54e1 Merge branch 'shivam-develop' into shivam-Locks 2024-02-29 11:58:14 +05:30
shivammonaka bc191015e3 Using OpenMP locks with NUM_PARALLEL 2024-02-29 11:47:05 +05:30
Martin Kroeker c4bd4a2e5d
fix improper function prototypes (empty parentheses) 2023-09-30 12:49:24 +02:00
Chris Sidebottom 32f2fafde7 Propagate SWITCH_RATIO to DYNAMIC_ARCH builds
Previously dynamic builds were either using the default SWITCH_RATIO
or one from the higher level architecture; this patch ensures the
dynamic builds can use this parameter as well.
2023-04-17 15:34:12 +01:00
Honglin Zhu 4989e039a5 Define SBGEMM_ALIGN_K for DYNAMIC_ARCH build 2022-10-27 14:10:26 +08:00
Honglin Zhu b00d5b9746 New sbgemm implementation for Neoverse N2
1. Use UZP instructions but not gather load and scatter store instructions to get lower latency.
    2. Padding k to a power of 4.
2022-10-26 15:09:41 +08:00
Wangyang Guo 3dc6052c7e initial support for Sapphire Rapids platform 2021-10-12 01:30:40 -07:00
Martin Kroeker 2f8220d757
Add sbgemm 2021-09-14 16:14:43 +02:00
Martin Kroeker 307c4c0786
Fix typo 2021-06-16 13:41:16 +02:00
Martin Kroeker e83df93975
Work around another recent macro name collision with winnt.h 2021-06-16 12:32:34 +02:00
Martin Kroeker a554712439
remove extra/intermediate size step for min_jj introduced in PR747 2020-12-08 21:01:36 +01:00
Martin Kroeker 5d26223f4a
remove extra/intermediate size step of min_jj from PR747 2020-12-08 20:59:56 +01:00
Martin Kroeker d3ff1f889f
Convert ifndefs to ifneq 2020-11-22 16:27:17 +01:00
Rajalakshmi Srinivasaraghavan b5d30b390d Fix build issues with bfloat16
This patch fixes compilation errors due to recent renaming from SH to SB
with BUILD_BFLOAT16.
2020-10-13 11:00:22 -05:00
Martin Kroeker 006c7f6671
Change "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-12 00:06:06 +02:00
Martin Kroeker 886a8e3190
Adapt for supporting only a subset of variable types 2020-10-11 14:57:32 +02:00
Martin Kroeker ac653c94f3
Merge branch 'develop' into issue2588-cmake 2020-10-11 13:57:07 +02:00
Martin Kroeker 988a6f429e
Add BUILD_vartype defines 2020-09-22 23:23:33 +02:00
Martin Kroeker e5e2fbd593
Support building only selected types 2020-09-22 23:21:30 +02:00
y00512012 06cf73a239 fix a bug of trmm 2020-09-22 16:47:10 +08:00
Martin Kroeker ddec244a5a
Merge pull request #2838 from austinpagan/gordon_trmm
Adding performance patch for trmm, just like trsm (#2836)
2020-09-15 21:17:48 +02:00
fossum dfeca46098 Adding performance patch for trmm, just like #2836 2020-09-15 08:59:50 -05:00
fossum 274d6e015b Fixing a performance bug in trsm_[LR].c. 2020-09-14 13:10:48 -05:00
Martin Kroeker 330044d821
Fix potentiol domain error in sqrt 2020-09-05 09:44:33 +02:00
Chen, Guobing e740c4873d Enable COOPERLAKE build target
Enable new build target platform -- COOPERLAKE. This target platform
supports all the SKYLAKEX supported ISAs + avx512bf16. So all the
SKYLAKEX specific kernels/drivers and related code are now extended
to be also active on COOPERLAKE. Besides, new BF16 related kernels
are active under this target.
2020-08-13 06:18:00 +08:00
Martin Kroeker ce45af8151
Update conditional for atomics to use HAVE_C11 2020-07-18 17:09:56 +00:00
Martin Kroeker 6f38de06d2
Update conditional for atomics to use HAVE_C11 2020-07-18 17:09:01 +00:00
Martin Kroeker 5dd14e3d48
Make building the bfloat16 functions conditional on option BUILD_HALF (#2590)
* make building the bfloat16 BLAS functions conditional on BUILD_HALF

* pass the BUILD_HALF option to gensymbol

* Pass BUILD_HALF as a compiler define for dynamic_arch builds
2020-05-01 09:58:30 +02:00
Rajalakshmi Srinivasaraghavan 7eb55504b1 RFC : Add half precision gemm for bfloat16 in OpenBLAS
This patch adds support for bfloat16 data type matrix multiplication kernel.
For architectures that don't support bfloat16, it is defined as unsigned short
(2 bytes).  Default unroll sizes can be changed as per architecture as done for
SGEMM and for now 8 and 4 are used for M and N.  Size of ncopy/tcopy can be
changed as per architecture requirement and for now, size 2 is used.

Added shgemm in kernel/power/KERNEL.POWER9 and tested in powerpc64le and
powerpc64.  For reference, added a small test compare_sgemm_shgemm.c to compare
sgemm and shgemm output.

This patch does not cover OpenBLAS test, benchmark and lapack tests for shgemm.
Complex type implementation can be discussed and added once this is approved.
2020-04-14 14:55:08 -05:00
Ali Saidi 97ce6bbce2 Fix barriers in level3_thread 2020-02-29 17:45:17 +00:00
wjc404 2f96a2c55b
Update trmm_R.c 2020-02-05 10:15:02 +08:00
wjc404 833bd0f8ff
Update trmm_L.c 2020-02-05 10:09:41 +08:00
wjc404 77b8f49556
Update level3_thread.c 2020-02-04 20:33:08 +08:00
wjc404 1c3e20ce48
Update level3.c 2020-02-04 20:30:23 +08:00
wjc404 e9fb8f62b1
Update level3_gemm3m_thread.c 2020-01-22 17:40:03 +00:00
wjc404 4c35b8dbaa
Update gemm3m_level3.c 2019-12-27 18:03:01 +08:00
Martin Kroeker f3065a0eed
Fix race conditions in multithreaded GEMM3M
by adding barriers (and a mutex lock for the non-OpenMP case) like it was already done for GEMM in level3_thread.c some time ago
2019-11-23 19:54:56 +01:00
Martin Kroeker f343ed65b5
Avoid taking the root of a negative number
Fixes #1924 where numpy 1.17+ would report the (transient) FE_INVALID exception raised for the domain error.
2018-12-22 22:30:29 +01:00
Martin Kroeker f72fdf525c
Merge pull request #1875 from martin-frbg/issue1851
Serialize accesses to parallelized level3 functions from multiple cal…
2018-11-25 20:53:46 +01:00
Martin Kroeker 113cb00b95
fix missing parenthesis 2018-11-19 21:01:36 +01:00
Martin Kroeker 5192651706
Add CriticalSection handling instead of mutexes for Windows 2018-11-19 17:58:22 +01:00
Martin Kroeker 2e6fae2aad
Serialize accesses to parallelized level3 functions from multiple callers
for #1851
2018-11-19 14:02:50 +01:00
Arjan van de Ven 5b708e5eb1 sgemm/dgemm: add a way for an arch kernel to specify prefered sizes
The current gemm threading code can make very unfortunate choices, for
example on my 10 core system a 1024x1024x1024 matrix multiply ends up
chunking into blocks of 102... which is not a vector friendly size
and performance ends up horrible.

this patch adds a helper define where an architecture can specify
a preference for size multiples.
This is different from existing defines that are minimum sizes and such.

The performance increase with this patch for the 1024x1024x1024 sgemm
is 2.3x (!!)
2018-11-01 01:43:20 +00:00
Martin Kroeker 5f2a3c05cd
Revert "Rewrite &= -> = and simplify the initial blocking phase." 2018-07-03 21:42:28 +02:00
Craig Donner 0144068537 Rewrite &= -> = and simplify the initial blocking phase. 2018-06-25 15:08:55 +01:00