Go to file

Arjan van de Ven cdc668d82b Add a "sgemm direct" mode for small matrixes OpenBLAS has a fancy algorithm for copying the input data while laying it out in a more CPU friendly memory layout. This is great for large matrixes; the cost of the copy is easily ammortized by the gains from the better memory layout. But for small matrixes (on CPUs that can do efficient unaligned loads) this copy can be a net loss. This patch adds (for SKYLAKEX initially) a "sgemm direct" mode, that bypasses the whole copy machinary for ALPHA=1/BETA=0/... standard arguments, for small matrixes only. What is small? For the non-threaded case this has been measured to be in the MNK = 28 * 512 * 512 range, while in the threaded case it's less, around MNK = 1 * 512 * 512		2018-12-13 13:47:31 +00:00
benchmark	Disable scal to benchmark zgemv separately by default	2018-08-10 01:54:18 +03:00
cmake	Add DYNAMIC_CORE list for ARM64	2018-12-07 17:42:23 +01:00
ctest	Handle special case of gfortran+clang+OpenMP	2018-06-19 20:46:36 +02:00
driver	Fix typo in previous commit for arm dynamic arch	2018-12-07 19:37:33 +01:00
exports	Add `$(LDFLAGS)` to `$(CC)` and `$(FC)` invocations within `exports/Makefile`	2018-09-21 09:19:51 +00:00
interface	Add a "sgemm direct" mode for small matrixes	2018-12-13 13:47:31 +00:00
kernel	Add a "sgemm direct" mode for small matrixes	2018-12-13 13:47:31 +00:00
lapack	Change _STDC_VERSION__ to __STDC_VERSION__	2018-05-11 12:15:08 +08:00
lapack-netlib	Merge pull request #1866 from martin-frbg/issue1859	2018-11-10 19:23:31 +01:00
reference	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
relapack	Add cmake build list file for ReLAPACK	2017-10-12 17:00:00 +02:00
test	Handle special case of gfortran+clang+OpenMP	2018-06-19 20:47:33 +02:00
utest	Fix unknown type name __WAIT_STATUS on RHEL5	2018-10-04 14:37:08 +02:00
.gitignore	Don't change timestamps	2017-08-01 13:43:59 +05:30
.travis.yml	Update .travis.yml	2018-11-11 20:50:38 +01:00
BACKERS.md	Added backers.	2013-09-05 15:39:45 +08:00
CMakeLists.txt	Increment version to 0.3.5.dev	2018-12-02 23:42:33 +01:00
CONTRIBUTORS.md	Optimized standard Blas Level-1,2 (excluding nrm2 functions) for z13 (double precision)	2017-09-06 16:41:08 +04:00
Changelog.txt	Update with the changes from 0.3.4	2018-12-02 23:44:13 +01:00
GotoBLAS_00License.txt	rename documents in GotoBLAS.	2011-01-24 15:57:23 +00:00
GotoBLAS_01Readme.txt	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
GotoBLAS_02QuickInstall.txt	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
GotoBLAS_03FAQ.txt	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
GotoBLAS_04FAQ.txt	rename documents in GotoBLAS.	2011-01-24 15:57:23 +00:00
GotoBLAS_05LargePage.txt	Correct typo /proc/ instead of /pros/	2015-03-20 23:25:11 +01:00
GotoBLAS_06WeirdPerformance.txt	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
LICENSE	Update organization info.	2014-11-25 15:28:58 +08:00
Makefile	Merge pull request #1799 from martin-frbg/issue1796	2018-10-09 08:20:52 +02:00
Makefile.alpha	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
Makefile.arm	arm: Determine the abi from compiler if not specified on command line	2017-06-30 18:20:59 +05:30
Makefile.arm64	Fix two mistakes on Arm64 builds	2018-12-05 18:51:38 +00:00
Makefile.generic	Respect user's LDFLAGS	2013-07-25 14:08:37 -07:00
Makefile.ia64	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
Makefile.install	Use installbsd on AIX	2018-11-01 18:26:08 +01:00
Makefile.mips	MIPS P5600(32 bit) and I6400(64 bit) cores support added.	2016-04-22 14:03:18 +05:30
Makefile.mips64	Import GotoBLAS2 1.13 BSD version codes.	2011-01-24 14:54:24 +00:00
Makefile.power	build: fix libxlmass errors building on Power CPU	2017-05-24 14:51:52 +08:00
Makefile.prebuild	Add mips32r2 api target	2018-05-02 20:17:26 +02:00
Makefile.rule	Increment version to 0.3.5.dev	2018-12-02 23:43:15 +01:00
Makefile.sparc	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
Makefile.system	Merge branch 'develop' into fbsd12	2018-12-03 12:50:14 +01:00
Makefile.tail	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
Makefile.x86	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
Makefile.x86_64	Avoid adding blanket march=skylake-avx512 to dynamic_arch builds	2018-12-11 21:09:26 +01:00
Makefile.zarch	dtrmm and dgemm for z13	2017-01-04 19:32:33 +04:00
README.md	add short blurb about avx512 and needed compiler to README	2018-08-11 17:21:46 +00:00
TargetList.txt	Simplifying ARMv8 build parameters	2018-11-19 16:41:49 +00:00
USAGE.md	Underline importance of NUM_THREADS setting for BUFFER allocation	2018-04-04 22:26:51 +02:00
appveyor.yml	Appveyor: enable building fortran with ninja	2017-12-29 19:58:35 -06:00
c_check	Check availability of immintrin.h in the AVX512 compatibility test	2018-10-04 07:35:30 +02:00
cblas.h	just make CBLAS_LAYOUT an alias of the existing CBLAS_ORDER	2018-09-06 16:54:31 +02:00
common.h	lets fit it in one 4k page	2018-11-06 17:51:24 +00:00
common_alpha.h	add fallback blas_lock implementation	2015-08-16 18:59:17 +02:00
common_arm.h	arm: Determine the abi from compiler if not specified on command line	2017-06-30 18:20:59 +05:30
common_arm64.h	build: LLVM: Add Flang compiler support and enable OpenMP for Clang	2017-05-25 17:03:20 +01:00
common_c.h	Improved Ximatcopy when lda==ldb.	2015-09-07 14:36:16 +02:00
common_d.h	Improved Ximatcopy when lda==ldb.	2015-09-07 14:36:16 +02:00
common_ia64.h	add fallback blas_lock implementation	2015-08-16 18:59:17 +02:00
common_interface.h	Add ATLAS-style ?geadd function	2015-02-16 13:46:20 +01:00
common_lapack.h	Import GotoBLAS2 1.13 BSD version codes.	2011-01-24 14:54:24 +00:00
common_level1.h	Changed _Complex types in common_level1.h to use the typedef.	2015-02-11 11:12:14 -06:00
common_level2.h	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
common_level3.h	Add a "sgemm direct" mode for small matrixes	2018-12-13 13:47:31 +00:00
common_linux.h	Init IBM z system (s390x) porting.	2016-04-15 18:02:24 -04:00
common_macro.h	ARM64: Add the VULCAN Target	2017-01-10 15:01:17 +05:30
common_mips.h	mips: remove incorrect blas_lock implementations	2017-05-05 17:28:03 +01:00
common_mips64.h	Update common_mips64.h	2018-10-09 11:20:16 +08:00
common_param.h	Correct zgeadd_k prototype	2017-11-29 19:57:35 +01:00
common_power.h	optimized dgemm for 20 threads	2016-05-16 14:14:25 +02:00
common_q.h	Import GotoBLAS2 1.13 BSD version codes.	2011-01-24 14:54:24 +00:00
common_reference.h	Update organization info.	2014-11-25 15:28:58 +08:00
common_s.h	Improved Ximatcopy when lda==ldb.	2015-09-07 14:36:16 +02:00
common_sparc.h	add fallback blas_lock implementation	2015-08-16 18:59:17 +02:00
common_stackalloc.h	Avoid declaring arrays of size 0 when making large stack allocations.	2018-06-20 17:03:18 +01:00
common_thread.h	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
common_x.h	Import GotoBLAS2 1.13 BSD version codes.	2011-01-24 14:54:24 +00:00
common_x86.h	Merge pull request #1542 from martin-frbg/quickdiv64	2018-05-02 18:11:50 +02:00
common_x86_64.h	make WMB / MB safer on x86-64	2018-06-17 18:06:24 +00:00
common_z.h	Improved Ximatcopy when lda==ldb.	2015-09-07 14:36:16 +02:00
common_zarch.h	dtrmm and dgemm for z13	2017-01-04 19:32:33 +04:00
cpuid.S	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
cpuid.h	Initial support for SkylakeX / AVX512	2018-06-03 07:58:52 +00:00
cpuid_alpha.c	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
cpuid_arm.c	Fix for issue #1024 : arm-linux-androideabi-g++ Compiler Error in /cpuid_arm.c	2016-12-02 09:28:31 -08:00
cpuid_arm64.c	Fix two mistakes on Arm64 builds	2018-12-05 18:51:38 +00:00
cpuid_ia64.c	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
cpuid_mips.c	Make cpuid_mips compile again and add 1004K cpu	2018-05-02 20:12:25 +02:00
cpuid_mips64.c	Added mips I6500 core	2017-09-22 11:57:43 +05:30
cpuid_power.c	Fix missing parameter in popen call	2018-12-06 18:33:05 +01:00
cpuid_sparc.c	Fix my copypaste blunder with get_corename	2018-02-01 22:06:04 +01:00
cpuid_x86.c	Fix detection of Ryzen2 (missing CORE_ZEN)	2018-10-28 18:36:55 +01:00
cpuid_zarch.c	detect z14 arch on s390x	2018-08-14 12:30:38 +02:00
ctest.c	Haiku supporting patches	2018-08-02 20:49:14 +02:00
ctest1.c	Import GotoBLAS2 1.13 BSD version codes.	2011-01-24 14:54:24 +00:00
ctest2.c	Import GotoBLAS2 1.13 BSD version codes.	2011-01-24 14:54:24 +00:00
f_check	Correct link flags for PGI compiler.	2018-11-21 14:24:56 +13:00
ftest.f	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
ftest2.f	Import GotoBLAS2 1.13 BSD version codes.	2011-01-24 14:54:24 +00:00
ftest3.f	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
gen_config_h.c	Add 64bit support for Microsoft Visual Studio	2017-06-21 13:38:22 -07:00
getarch.c	Simplifying ARMv8 build parameters	2018-11-19 16:41:49 +00:00
getarch_2nd.c	Delete LOCAL_BUFFER_SIZE for other architectures.	2016-04-12 11:49:28 -04:00
l1param.h	Added BULLDOZER target. So far it uses barcelona kernels.	2012-12-07 00:53:31 +08:00
l2param.h	Support AMD Piledriver by bulldozer kernels.	2013-07-06 12:06:43 -03:00
make.inc	(Plain make) build system fixes for AIX	2017-09-18 01:29:21 +02:00
openblas.pc.in	Rename blas.pc.in to openblas.pc.in	2017-02-12 14:34:03 +01:00
openblas_config_template.h	Fix complex support for MSVC headers	2017-07-28 11:50:29 +05:30
param.h	Add a "sgemm direct" mode for small matrixes	2018-12-13 13:47:31 +00:00
quickbuild.32bit	Import GotoBLAS2 1.13 BSD version codes.	2011-01-24 14:54:24 +00:00
quickbuild.64bit	Import GotoBLAS2 1.13 BSD version codes.	2011-01-24 14:54:24 +00:00
quickbuild.win32	Added the tip for Windows.	2012-08-09 20:37:55 +08:00
quickbuild.win64	Refs #63 . delete prefix for mingw64 toolchain.	2014-04-27 13:05:26 +08:00
segfaults.patch	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
symcopy.h	Changed a number of inline calls to use __inline.	2015-02-11 11:13:17 -06:00
version.h	Update organization info.	2014-11-25 15:28:58 +08:00

README.md

OpenBLAS

Travis CI:

AppVeyor:

Introduction

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.

Please read the documentation on the OpenBLAS wiki pages: http://github.com/xianyi/OpenBLAS/wiki.

Binary Packages

We provide official binary packages for the following platform:

Windows x86/x86_64

You can download them from file hosting on sourceforge.net.

Installation from Source

Download from project homepage, http://xianyi.github.com/OpenBLAS/, or check out the code using Git from https://github.com/xianyi/OpenBLAS.git.

Dependencies

Building OpenBLAS requires the following to be installed:

GNU Make
A C compiler, e.g. GCC or Clang
A Fortran compiler (optional, for LAPACK)
IBM MASS (optional, see below)

Normal compile

Simply invoking make (or gmake on BSD) will detect the CPU automatically. To set a specific target CPU, use make TARGET=xxx, e.g. make TARGET=NEHALEM. The full target list is in the file TargetList.txt.

Cross compile

Set CC and FC to point to the cross toolchains, and set HOSTCC to your host C compiler. The target must be specified explicitly when cross compiling.

Examples:

On an x86 box, compile this library for a loongson3a CPU:

make BINARY=64 CC=mips64el-unknown-linux-gnu-gcc FC=mips64el-unknown-linux-gnu-gfortran HOSTCC=gcc TARGET=LOONGSON3A

On an x86 box, compile this library for a loongson3a CPU with loongcc (based on Open64) compiler:

make CC=loongcc FC=loongf95 HOSTCC=gcc TARGET=LOONGSON3A CROSS=1 CROSS_SUFFIX=mips64el-st-linux-gnu-   NO_LAPACKE=1 NO_SHARED=1 BINARY=32

Debug version

A debug version can be built using make DEBUG=1.

Compile with MASS support on Power CPU (optional)

The IBM MASS library consists of a set of mathematical functions for C, C++, and Fortran applications that are are tuned for optimum performance on POWER architectures. OpenBLAS with MASS requires a 64-bit, little-endian OS on POWER. The library can be installed as shown:

On Ubuntu:

wget -q http://public.dhe.ibm.com/software/server/POWER/Linux/xl-compiler/eval/ppc64le/ubuntu/public.gpg -O- | sudo apt-key add -
echo "deb http://public.dhe.ibm.com/software/server/POWER/Linux/xl-compiler/eval/ppc64le/ubuntu/ trusty main" | sudo tee /etc/apt/sources.list.d/ibm-xl-compiler-eval.list
sudo apt-get update
sudo apt-get install libxlmass-devel.8.1.5

On RHEL/CentOS:

wget http://public.dhe.ibm.com/software/server/POWER/Linux/xl-compiler/eval/ppc64le/rhel7/repodata/repomd.xml.key
sudo rpm --import repomd.xml.key
wget http://public.dhe.ibm.com/software/server/POWER/Linux/xl-compiler/eval/ppc64le/rhel7/ibm-xl-compiler-eval.repo
sudo cp ibm-xl-compiler-eval.repo /etc/yum.repos.d/
sudo yum install libxlmass-devel.8.1.5

After installing the MASS library, compile OpenBLAS with USE_MASS=1. For example, to compile on Power8 with MASS support: make USE_MASS=1 TARGET=POWER8.

Install to a specific directory (optional)

Use PREFIX= when invoking make, for example

make install PREFIX=your_installation_directory

The default installation directory is /opt/OpenBLAS.

Supported CPUs and Operating Systems

Please read GotoBLAS_01Readme.txt.

Additional supported CPUs

x86/x86-64

Intel Xeon 56xx (Westmere): Used GotoBLAS2 Nehalem codes.
Intel Sandy Bridge: Optimized Level-3 and Level-2 BLAS with AVX on x86-64.
Intel Haswell: Optimized Level-3 and Level-2 BLAS with AVX2 and FMA on x86-64.
Intel Skylake: Optimized Level-3 and Level-2 BLAS with AVX512 and FMA on x86-64.
AMD Bobcat: Used GotoBLAS2 Barcelona codes.
AMD Bulldozer: x86-64 ?GEMM FMA4 kernels. (Thanks to Werner Saar)
AMD PILEDRIVER: Uses Bulldozer codes with some optimizations.
AMD STEAMROLLER: Uses Bulldozer codes with some optimizations.

MIPS64

ICT Loongson 3A: Optimized Level-3 BLAS and the part of Level-1,2.
ICT Loongson 3B: Experimental

ARM

ARMv6: Optimized BLAS for vfpv2 and vfpv3-d16 (e.g. BCM2835, Cortex M0+)
ARMv7: Optimized BLAS for vfpv3-d32 (e.g. Cortex A8, A9 and A15)

ARM64

ARMv8: Experimental
ARM Cortex-A57: Experimental

PPC/PPC64

POWER8: Optmized Level-3 BLAS and some Level-1, only with USE_OPENMP=1

IBM zEnterprise System

Z13: Optimized Level-3 BLAS and Level-1,2 (double precision)

Supported OS

GNU/Linux
MinGW or Visual Studio (CMake)/Windows: Please read https://github.com/xianyi/OpenBLAS/wiki/How-to-use-OpenBLAS-in-Microsoft-Visual-Studio.
Darwin/macOS: Experimental. Although GotoBLAS2 supports Darwin, we are not macOS experts.
FreeBSD: Supported by the community. We don't actively test the library on this OS.
OpenBSD: Supported by the community. We don't actively test the library on this OS.
DragonFly BSD: Supported by the community. We don't actively test the library on this OS.
Android: Supported by the community. Please read https://github.com/xianyi/OpenBLAS/wiki/How-to-build-OpenBLAS-for-Android.

Usage

Statically link with libopenblas.a or dynamically link with -lopenblas if OpenBLAS was compiled as a shared library.

Setting the number of threads using environment variables

Environment variables are used to specify a maximum number of threads. For example,

export OPENBLAS_NUM_THREADS=4
export GOTO_NUM_THREADS=4
export OMP_NUM_THREADS=4

The priorities are OPENBLAS_NUM_THREADS > GOTO_NUM_THREADS > OMP_NUM_THREADS.

If you compile this library with USE_OPENMP=1, you should set the OMP_NUM_THREADS environment variable; OpenBLAS ignores OPENBLAS_NUM_THREADS and GOTO_NUM_THREADS when compiled with USE_OPENMP=1.

Setting the number of threads at runtime

We provide the following functions to control the number of threads at runtime:

void goto_set_num_threads(int num_threads);
void openblas_set_num_threads(int num_threads);

If you compile this library with USE_OPENMP=1, you should use the above functions too.

Reporting bugs

Please submit an issue in https://github.com/xianyi/OpenBLAS/issues.

Contact

OpenBLAS users mailing list: https://groups.google.com/forum/#!forum/openblas-users
OpenBLAS developers mailing list: https://groups.google.com/forum/#!forum/openblas-dev

Change log

Please see Changelog.txt to view the differences between OpenBLAS and GotoBLAS2 1.13 BSD version.

Troubleshooting

Please read the FAQ first.
Please use GCC version 4.6 and above to compile Sandy Bridge AVX kernels on Linux/MinGW/BSD.
Please use Clang version 3.1 and above to compile the library on Sandy Bridge microarchitecture. Clang 3.0 will generate the wrong AVX binary code.
Please use GCC version 6 or LLVM version 6 and above to compile Skyalke AVX512 kernels.
The number of CPUs/cores should less than or equal to 256. On Linux x86_64 (amd64), there is experimental support for up to 1024 CPUs/cores and 128 numa nodes if you build the library with BIGNUMA=1.
OpenBLAS does not set processor affinity by default. On Linux, you can enable processor affinity by commenting out the line NO_AFFINITY=1 in Makefile.rule. However, note that this may cause a conflict with R parallel.
On Loongson 3A, make test may fail with a pthread_create error (EAGAIN). However, it will be okay when you run the same test case on the shell.

Contributing

Check for open issues or open a fresh issue to start a discussion around a feature idea or a bug.
Fork the OpenBLAS repository to start making your changes.
Write a test which shows that the bug was fixed or that the feature works as expected.
Send a pull request. Make sure to add yourself to CONTRIBUTORS.md.

Donation

Please read this wiki page.