193 lines
10 KiB
Markdown
193 lines
10 KiB
Markdown
# Developer manual
|
|
|
|
## Source code layout
|
|
|
|
```
|
|
OpenBLAS/
|
|
├── benchmark Benchmark codes for BLAS
|
|
├── cmake CMakefiles
|
|
├── ctest Test codes for CBLAS interfaces
|
|
├── driver Implemented in C
|
|
│ ├── level2
|
|
│ ├── level3
|
|
│ ├── mapper
|
|
│ └── others Memory management, threading, etc
|
|
├── exports Generate shared library
|
|
├── interface Implement BLAS and CBLAS interfaces (calling driver or kernel)
|
|
│ ├── lapack
|
|
│ └── netlib
|
|
├── kernel Optimized assembly kernels for CPU architectures
|
|
│ ├── alpha Original GotoBLAS kernels for DEC Alpha
|
|
│ ├── arm ARMV5,V6,V7 kernels (including generic C codes used by other architectures)
|
|
│ ├── arm64 ARMV8
|
|
│ ├── generic General kernel codes written in plain C, parts used by many architectures.
|
|
│ ├── ia64 Original GotoBLAS kernels for Intel Itanium
|
|
│ ├── mips
|
|
│ ├── mips64
|
|
│ ├── power
|
|
| ├── riscv64
|
|
| ├── simd Common code for Universal Intrinsics, used by some x86_64 and arm64 kernels
|
|
│ ├── sparc
|
|
│ ├── x86
|
|
│ ├── x86_64
|
|
│ └── zarch
|
|
├── lapack Optimized LAPACK codes (replacing those in regular LAPACK)
|
|
│ ├── getf2
|
|
│ ├── getrf
|
|
│ ├── getrs
|
|
│ ├── laswp
|
|
│ ├── lauu2
|
|
│ ├── lauum
|
|
│ ├── potf2
|
|
│ ├── potrf
|
|
│ ├── trti2
|
|
│ ├── trtri
|
|
│ └── trtrs
|
|
├── lapack-netlib LAPACK codes from netlib reference implementation
|
|
├── reference BLAS Fortran reference implementation (unused)
|
|
├── relapack Elmar Peise's recursive LAPACK (implemented on top of regular LAPACK)
|
|
├── test Test codes for BLAS
|
|
└── utest Regression test
|
|
|
|
```
|
|
|
|
A call tree for `dgemm` looks as follows:
|
|
```
|
|
interface/gemm.c
|
|
│
|
|
driver/level3/level3.c
|
|
│
|
|
gemm assembly kernels at kernel/
|
|
```
|
|
|
|
To find the kernel currently used for a particular supported CPU, please check the corresponding `kernel/$(ARCH)/KERNEL.$(CPU)` file.
|
|
|
|
Here is an example for `kernel/x86_64/KERNEL.HASWELL`:
|
|
```
|
|
...
|
|
DTRMMKERNEL = dtrmm_kernel_4x8_haswell.c
|
|
DGEMMKERNEL = dgemm_kernel_4x8_haswell.S
|
|
...
|
|
```
|
|
According to the above `KERNEL.HASWELL`, OpenBLAS Haswell dgemm kernel file is `dgemm_kernel_4x8_haswell.S`.
|
|
|
|
|
|
## Optimizing GEMM for a given hardware
|
|
|
|
!!! abstract "Read the Goto paper to understand the algorithm"
|
|
|
|
Goto, Kazushige; van de Geijn, Robert A. (2008).
|
|
["Anatomy of High-Performance Matrix Multiplication"](http://delivery.acm.org/10.1145/1360000/1356053/a12-goto.pdf?ip=155.68.162.54&id=1356053&acc=ACTIVE%20SERVICE&key=A79D83B43E50B5B8%2EF070BBE7E45C3F17%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&__acm__=1517932837_edfe766f1e295d9a7830812371e1d173).
|
|
ACM Transactions on Mathematical Software 34 (3): Article 12
|
|
|
|
(The above link is available only to ACM members, but this and many related
|
|
papers is also available on [the pages of van de Geijn's FLAME project](http://www.cs.utexas.edu/~flame/web/FLAMEPublications.html))
|
|
|
|
The `driver/level3/level3.c` is the implementation of Goto's algorithm.
|
|
Meanwhile, you can look at `kernel/generic/gemmkernel_2x2.c`, which is a naive
|
|
`2x2` register blocking `gemm` kernel in C. Then:
|
|
|
|
* Write optimized assembly kernels. Consider instruction pipeline, available registers, memory/cache access.
|
|
* Tune cache block sizes (`Mc`, `Kc`, and `Nc`)
|
|
|
|
Note that not all of the CPU-specific parameters in `param.h` are actively used in algorithms.
|
|
`DNUMOPT` only appears as a scale factor in profiling output of the level3 `syrk` interface code,
|
|
while its counterpart `SNUMOPT` (aliased as `NUMOPT` in `common.h`) is not used anywhere at all.
|
|
|
|
`SYMV_P` is only used in the generic kernels for the `symv` and `chemv`/`zhemv` functions -
|
|
at least some of those are usually overridden by CPU-specific implementations, so if you start
|
|
by cloning the existing implementation for a related CPU you need to check its `KERNEL` file
|
|
to see if tuning `SYMV_P` would have any effect at all.
|
|
|
|
`GEMV_UNROLL` is only used by some older x86-64 kernels, so not all sections in `param.h` define it.
|
|
Similarly, not all of the CPU parameters like L2 or L3 cache sizes are necessarily used in current
|
|
kernels for a given model - by all indications the CPU identification code was imported from some
|
|
other project originally.
|
|
|
|
|
|
## Running OpenBLAS tests
|
|
|
|
We use tests for Netlib BLAS, CBLAS, and LAPACK. In addition, we use
|
|
OpenBLAS-specific regression tests. They can be run with Make:
|
|
|
|
* `make -C test` for BLAS tests
|
|
* `make -C ctest` for CBLAS tests
|
|
* `make -C utest` for OpenBLAS regression tests
|
|
* `make lapack-test` for LAPACK tests
|
|
|
|
We also use the [BLAS-Tester](https://github.com/xianyi/BLAS-Tester) tests for regression testing.
|
|
It is basically the ATLAS test suite adapted for building with OpenBLAS.
|
|
|
|
The project makes use of several Continuous Integration (CI) services
|
|
conveniently interfaced with GitHub to automatically run tests on a number of
|
|
platforms and build configurations.
|
|
|
|
Also note that the test suites included with "numerically heavy" projects like
|
|
Julia, NumPy, SciPy, Octave or QuantumEspresso can be used for regression
|
|
testing, when those projects are built such that they use OpenBLAS.
|
|
|
|
|
|
## Benchmarking
|
|
|
|
A number of benchmarking methods are used by OpenBLAS:
|
|
|
|
- Several simple C benchmarks for performance testing individual BLAS functions
|
|
are available in the `benchmark` folder. They can be run locally through the
|
|
`Makefile` in that directory. And the `benchmark/scripts` subdirectory
|
|
contains similar benchmarks that use OpenBLAS via NumPy, SciPy, Octave and R.
|
|
- On pull requests, a representative set of functions is tested for performance
|
|
regressions with Codspeed; results can be viewed at
|
|
[https://codspeed.io/OpenMathLib/OpenBLAS](https://codspeed.io/OpenMathLib/OpenBLAS).
|
|
- The [OpenMathLib/BLAS-Benchmarks](https://github.com/OpenMathLib/BLAS-Benchmarks) repository
|
|
contains an [Airspeed Velocity](https://github.com/airspeed-velocity/asv/)-based benchmark
|
|
suite which is run on several CPU architectures in cron jobs. Results are published
|
|
to a dashboard: [http://www.openmathlib.org/BLAS-Benchmarks/](http://www.openmathlib.org/BLAS-Benchmarks/).
|
|
|
|
Benchmarking code for BLAS libraries, and specific performance analysis results, can be found
|
|
in a number of places. For example:
|
|
|
|
* [MatlabJuliaMatrixOperationsBenchmark](https://github.com/RoyiAvital/MatlabJuliaMatrixOperationsBenchmark)
|
|
(various matrix operations in Julia and Matlab)
|
|
* [mmperf/mmperf](https://github.com/mmperf/mmperf/) (single-core matrix multiplication)
|
|
|
|
|
|
## Adding autodetection support for a new revision or variant of a supported CPU
|
|
|
|
Especially relevant for x86-64, a new CPU model may be a "refresh" (die shrink and/or different number of cores) within an existing
|
|
model family without significant changes to its instruction set (e.g., Intel Skylake and Kaby Lake still are fundamentally the same architecture as Haswell,
|
|
low end Goldmont etc. are Nehalem). In this case, compilation with the appropriate older `TARGET` will already lead to a satisfactory build.
|
|
|
|
To achieve autodetection of the new model, its CPUID (or an equivalent identifier) needs to be added in the `cpuid_<architecture>.c`
|
|
relevant for its general architecture, with the returned name for the new type set appropriately. For x86, which has the most complex
|
|
`cpuid` file, there are two functions that need to be edited: `get_cpuname()` to return, e.g., `CPUTYPE_HASWELL` and `get_corename()` for the (broader)
|
|
core family returning, e.g., `CORE_HASWELL`.[^1]
|
|
|
|
[^1]:
|
|
This information ends up in the `Makefile.conf` and `config.h` files generated by `getarch`. Failure to
|
|
set either will typically lead to a missing definition of the `GEMM_UNROLL` parameters later in the build,
|
|
as `getarch_2nd` will be unable to find a matching parameter section in `param.h`.
|
|
|
|
For architectures where `DYNAMIC_ARCH` builds are supported, a similar but simpler code section for the corresponding
|
|
runtime detection of the CPU exists in `driver/others/dynamic.c` (for x86), and `driver/others/dynamic_<arch>.c` for other architectures.
|
|
Note that for x86 the CPUID is compared after splitting it into its family, extended family, model and extended model parts, so the single decimal
|
|
number returned by Linux in `/proc/cpuinfo` for the model has to be converted back to hexadecimal before splitting into its constituent
|
|
digits. For example, `142 == 8E` translates to extended model 8, model 14.
|
|
|
|
|
|
## Adding dedicated support for a new CPU model
|
|
|
|
Usually it will be possible to start from an existing model, clone its `KERNEL` configuration file to the new name to use for this
|
|
`TARGET` and eventually replace individual kernels with versions better suited for peculiarities of the new CPU model.
|
|
In addition, it is necessary to add (or clone at first) the corresponding section of `GEMM_UNROLL` parameters in the top-level `param.h`,
|
|
and possibly to add definitions such as `USE_TRMM` (governing whether `TRMM` functions use the respective `GEMM` kernel or a separate source file)
|
|
to the `Makefile`s (and `CMakeLists.txt`) in the kernel directory. The new CPU name needs to be added to `TargetList.txt`,
|
|
and the CPU auto-detection code used by the `getarch` helper program - contained in
|
|
the `cpuid_<architecture>.c` file amended to include the CPUID (or equivalent) information processing required (see preceding section).
|
|
|
|
|
|
## Adding support for an entirely new architecture
|
|
|
|
This endeavour is best started by cloning the entire support structure for 32-bit ARM, and within that the ARMv5 CPU in particular,
|
|
as this is implemented through plain C kernels only. An example providing a convenient "shopping list" can be seen in pull request
|
|
[#1526](https://github.com/OpenMathLib/OpenBLAS/pull/1526).
|