Commit Graph

110 Commits

Author SHA1 Message Date
Chris Sidebottom 32f2fafde7 Propagate SWITCH_RATIO to DYNAMIC_ARCH builds
Previously dynamic builds were either using the default SWITCH_RATIO
or one from the higher level architecture; this patch ensures the
dynamic builds can use this parameter as well.
2023-04-17 15:34:12 +01:00
Honglin Zhu 4989e039a5 Define SBGEMM_ALIGN_K for DYNAMIC_ARCH build 2022-10-27 14:10:26 +08:00
Honglin Zhu b00d5b9746 New sbgemm implementation for Neoverse N2
1. Use UZP instructions but not gather load and scatter store instructions to get lower latency.
    2. Padding k to a power of 4.
2022-10-26 15:09:41 +08:00
Wangyang Guo 3dc6052c7e initial support for Sapphire Rapids platform 2021-10-12 01:30:40 -07:00
Martin Kroeker 2f8220d757
Add sbgemm 2021-09-14 16:14:43 +02:00
Martin Kroeker 307c4c0786
Fix typo 2021-06-16 13:41:16 +02:00
Martin Kroeker e83df93975
Work around another recent macro name collision with winnt.h 2021-06-16 12:32:34 +02:00
Martin Kroeker a554712439
remove extra/intermediate size step for min_jj introduced in PR747 2020-12-08 21:01:36 +01:00
Martin Kroeker 5d26223f4a
remove extra/intermediate size step of min_jj from PR747 2020-12-08 20:59:56 +01:00
Martin Kroeker d3ff1f889f
Convert ifndefs to ifneq 2020-11-22 16:27:17 +01:00
Rajalakshmi Srinivasaraghavan b5d30b390d Fix build issues with bfloat16
This patch fixes compilation errors due to recent renaming from SH to SB
with BUILD_BFLOAT16.
2020-10-13 11:00:22 -05:00
Martin Kroeker 006c7f6671
Change "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-12 00:06:06 +02:00
Martin Kroeker 886a8e3190
Adapt for supporting only a subset of variable types 2020-10-11 14:57:32 +02:00
Martin Kroeker ac653c94f3
Merge branch 'develop' into issue2588-cmake 2020-10-11 13:57:07 +02:00
Martin Kroeker 988a6f429e
Add BUILD_vartype defines 2020-09-22 23:23:33 +02:00
Martin Kroeker e5e2fbd593
Support building only selected types 2020-09-22 23:21:30 +02:00
y00512012 06cf73a239 fix a bug of trmm 2020-09-22 16:47:10 +08:00
Martin Kroeker ddec244a5a
Merge pull request #2838 from austinpagan/gordon_trmm
Adding performance patch for trmm, just like trsm (#2836)
2020-09-15 21:17:48 +02:00
fossum dfeca46098 Adding performance patch for trmm, just like #2836 2020-09-15 08:59:50 -05:00
fossum 274d6e015b Fixing a performance bug in trsm_[LR].c. 2020-09-14 13:10:48 -05:00
Martin Kroeker 330044d821
Fix potentiol domain error in sqrt 2020-09-05 09:44:33 +02:00
Chen, Guobing e740c4873d Enable COOPERLAKE build target
Enable new build target platform -- COOPERLAKE. This target platform
supports all the SKYLAKEX supported ISAs + avx512bf16. So all the
SKYLAKEX specific kernels/drivers and related code are now extended
to be also active on COOPERLAKE. Besides, new BF16 related kernels
are active under this target.
2020-08-13 06:18:00 +08:00
Martin Kroeker ce45af8151
Update conditional for atomics to use HAVE_C11 2020-07-18 17:09:56 +00:00
Martin Kroeker 6f38de06d2
Update conditional for atomics to use HAVE_C11 2020-07-18 17:09:01 +00:00
Martin Kroeker 5dd14e3d48
Make building the bfloat16 functions conditional on option BUILD_HALF (#2590)
* make building the bfloat16 BLAS functions conditional on BUILD_HALF

* pass the BUILD_HALF option to gensymbol

* Pass BUILD_HALF as a compiler define for dynamic_arch builds
2020-05-01 09:58:30 +02:00
Rajalakshmi Srinivasaraghavan 7eb55504b1 RFC : Add half precision gemm for bfloat16 in OpenBLAS
This patch adds support for bfloat16 data type matrix multiplication kernel.
For architectures that don't support bfloat16, it is defined as unsigned short
(2 bytes).  Default unroll sizes can be changed as per architecture as done for
SGEMM and for now 8 and 4 are used for M and N.  Size of ncopy/tcopy can be
changed as per architecture requirement and for now, size 2 is used.

Added shgemm in kernel/power/KERNEL.POWER9 and tested in powerpc64le and
powerpc64.  For reference, added a small test compare_sgemm_shgemm.c to compare
sgemm and shgemm output.

This patch does not cover OpenBLAS test, benchmark and lapack tests for shgemm.
Complex type implementation can be discussed and added once this is approved.
2020-04-14 14:55:08 -05:00
Ali Saidi 97ce6bbce2 Fix barriers in level3_thread 2020-02-29 17:45:17 +00:00
wjc404 2f96a2c55b
Update trmm_R.c 2020-02-05 10:15:02 +08:00
wjc404 833bd0f8ff
Update trmm_L.c 2020-02-05 10:09:41 +08:00
wjc404 77b8f49556
Update level3_thread.c 2020-02-04 20:33:08 +08:00
wjc404 1c3e20ce48
Update level3.c 2020-02-04 20:30:23 +08:00
wjc404 e9fb8f62b1
Update level3_gemm3m_thread.c 2020-01-22 17:40:03 +00:00
wjc404 4c35b8dbaa
Update gemm3m_level3.c 2019-12-27 18:03:01 +08:00
Martin Kroeker f3065a0eed
Fix race conditions in multithreaded GEMM3M
by adding barriers (and a mutex lock for the non-OpenMP case) like it was already done for GEMM in level3_thread.c some time ago
2019-11-23 19:54:56 +01:00
Martin Kroeker f343ed65b5
Avoid taking the root of a negative number
Fixes #1924 where numpy 1.17+ would report the (transient) FE_INVALID exception raised for the domain error.
2018-12-22 22:30:29 +01:00
Martin Kroeker f72fdf525c
Merge pull request #1875 from martin-frbg/issue1851
Serialize accesses to parallelized level3 functions from multiple cal…
2018-11-25 20:53:46 +01:00
Martin Kroeker 113cb00b95
fix missing parenthesis 2018-11-19 21:01:36 +01:00
Martin Kroeker 5192651706
Add CriticalSection handling instead of mutexes for Windows 2018-11-19 17:58:22 +01:00
Martin Kroeker 2e6fae2aad
Serialize accesses to parallelized level3 functions from multiple callers
for #1851
2018-11-19 14:02:50 +01:00
Arjan van de Ven 5b708e5eb1 sgemm/dgemm: add a way for an arch kernel to specify prefered sizes
The current gemm threading code can make very unfortunate choices, for
example on my 10 core system a 1024x1024x1024 matrix multiply ends up
chunking into blocks of 102... which is not a vector friendly size
and performance ends up horrible.

this patch adds a helper define where an architecture can specify
a preference for size multiples.
This is different from existing defines that are minimum sizes and such.

The performance increase with this patch for the 1024x1024x1024 sgemm
is 2.3x (!!)
2018-11-01 01:43:20 +00:00
Martin Kroeker 5f2a3c05cd
Revert "Rewrite &= -> = and simplify the initial blocking phase." 2018-07-03 21:42:28 +02:00
Craig Donner 0144068537 Rewrite &= -> = and simplify the initial blocking phase. 2018-06-25 15:08:55 +01:00
Arjan van de Ven 73de17664d Add missing barriers in gemm scheduler
a few places in the gemm scheduler code were missing barriers;
the code likely worked OK due to heavy use of volatile / _Atomic
but there's no reason to get this incorrect
2018-06-17 17:50:43 +00:00
Arjan van de Ven d148ec4ea1 Don't use _Atomic for jobs sometimes...
The use of _Atomic leads to really bad code generation in the compiler
(on x86, you get 2 "mfence" memory barriers around each access with gcc8, despite
x86 being ordered and cache coherent). But there's a fallback in the code that
just uses volatile which is more than plenty in practice.

If we're nervous about cross thread synchronization for these variables, we should
make the YIELD function be a compiler/memory barrier instead.

performance before (after last commit)

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10630.0   10.6       0.7%                             18112.8      6.2      -0.7%
  64 x 64               20374.8   13.0       1.9%                             40487.0      6.5       0.4%
  65 x 65              141955.2    1.9    -428.3%                            146708.8      1.9    -179.2%
  80 x 80              178921.1    2.9    -369.6%                            186032.7      2.8    -156.6%
  96 x 96              205436.2    4.3    -233.4%                            224513.1      3.9     -97.0%
 112 x 112             244408.2    5.8    -162.7%                            262158.7      5.4     -47.1%
 128 x 128             321334.5    6.5    -141.3%                            333829.0      6.3     -29.2%

Performance with this patch (roughly a 2x improvement):

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10756.0   10.5      -0.5%                             18296.7      6.1      -1.7%
  64 x 64               20490.0   12.9       1.4%                             40615.0      6.5       0.0%
  65 x 65               83528.3    3.3    -210.9%                             96319.0      2.9     -83.3%
  80 x 80              101453.5    5.1    -166.3%                            128021.7      4.0     -76.6%
  96 x 96              149795.1    5.9    -143.1%                            168059.4      5.3     -47.4%
 112 x 112             191481.2    7.3    -105.8%                            204165.0      6.9     -14.6%
 128 x 128             265019.2    7.9     -99.0%                            272006.4      7.7      -5.3%
2018-06-17 15:39:15 +00:00
Arjan van de Ven 9e162146a9 Only initialize the part of the jobs array that will get used
The jobs array is getting initialized in O(compiled cpus^2) complexity.
Distros and people with bigger systems will use pretty high values
(128 or 256 or more) for this value, leading to interesting bubbles
in performance.

Baseline (single threaded performance) gets roughly 13 - 15 multiplications per cycle
in the interesting range (threading kicks in at 65x65 mult by 65x65).
The hardware is capable of 32 multiplications per cycle theoretically.

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10703.9   10.6       0.0%                             17990.6      6.3       0.0%
  64 x 64               20778.4   12.8       0.0%                             40629.2      6.5       0.0%
  65 x 65               26869.9   10.3       0.0%                             52545.7      5.3       0.0%
  80 x 80               38104.5   13.5       0.0%                             72492.7      7.1       0.0%
  96 x 96               61626.4   14.4       0.0%                            113983.8      7.8       0.0%
 112 x 112              91803.8   15.3       0.0%                            180987.3      7.8       0.0%
 128 x 128             133161.4   15.8       0.0%                            258374.3      8.1       0.0%

When threading is turned on
TARGET=SKYLAKEX F_COMPILER=GFORTRAN  SHARED=1 DYNAMIC_THREADS=1 USE_OPENMP=0  NUM_THREADS=128

  Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10725.9   10.5      -0.2%                             18134.9      6.2      -0.8%
  64 x 64               20500.6   12.9       1.3%                             40929.1      6.5      -0.7%
  65 x 65             2040832.1    0.1   -7495.2%                           2097633.6      0.1   -3892.0%
  80 x 80             2063129.1    0.2   -5314.4%                           2119925.2      0.2   -2824.3%
  96 x 96             2070374.5    0.4   -3259.6%                           2173604.4      0.4   -1806.9%
 112 x 112            2111721.5    0.7   -2169.6%                           2263330.8      0.6   -1170.0%
 128 x 128            2276181.5    0.9   -1609.3%                           2377228.9      0.9    -820.1%

There is a deep deep cliff once you hit 65x65

With this patch

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10630.0   10.6       0.7%                             18112.8      6.2      -0.7%
  64 x 64               20374.8   13.0       1.9%                             40487.0      6.5       0.4%
  65 x 65              141955.2    1.9    -428.3%                            146708.8      1.9    -179.2%
  80 x 80              178921.1    2.9    -369.6%                            186032.7      2.8    -156.6%
  96 x 96              205436.2    4.3    -233.4%                            224513.1      3.9     -97.0%
 112 x 112             244408.2    5.8    -162.7%                            262158.7      5.4     -47.1%
 128 x 128             321334.5    6.5    -141.3%                            333829.0      6.3     -29.2%

The cliff is very significantly reduced.
(more to follow)
2018-06-17 15:32:03 +00:00
Martin Kroeker a91f1587b9
Work around name clash with Windows10's winnt.h
fixes #1503
2018-05-31 13:26:00 +02:00
Zhiyong Dang 3716267124 Change _STDC_VERSION__ to __STDC_VERSION__
Change-Id: Id3fa4e8d9eedd4ef7230df69b611e7f397301a42
2018-05-11 12:15:08 +08:00
Martin Kroeker 6a99fcce94
Use _Atomic instead of volatile for thread safety where C11 is supported
Suggested by dodomorandi in #660
2018-03-10 00:03:49 +01:00
Andrew 11a627c54e remove surplus parentheses to silence clang5 2018-01-01 20:56:26 +01:00
Andrew bfc2a88594 remove unused buffer 2017-12-22 00:55:40 +01:00