Go to file

Craig Donner bf40f806ef Remove the need for most locking in memory.c. Using thread local storage for tracking memory allocations means that threads no longer have to lock at all when doing memory allocations / frees. This particularly helps the gemm driver since it does an allocation per invocation. Even without threading at all, this helps, since even calling a lock with no contention has a cost: Before this change, no threading: ``` ---------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------- BM_SGEMM/4 102 ns 102 ns 13504412 BM_SGEMM/6 175 ns 175 ns 7997580 BM_SGEMM/8 205 ns 205 ns 6842073 BM_SGEMM/10 266 ns 266 ns 5294919 BM_SGEMM/16 478 ns 478 ns 2963441 BM_SGEMM/20 690 ns 690 ns 2144755 BM_SGEMM/32 1906 ns 1906 ns 716981 BM_SGEMM/40 2983 ns 2983 ns 473218 BM_SGEMM/64 9421 ns 9422 ns 148450 BM_SGEMM/72 12630 ns 12631 ns 112105 BM_SGEMM/80 15845 ns 15846 ns 89118 BM_SGEMM/90 25675 ns 25676 ns 54332 BM_SGEMM/100 29864 ns 29865 ns 47120 BM_SGEMM/112 37841 ns 37842 ns 36717 BM_SGEMM/128 56531 ns 56532 ns 25361 BM_SGEMM/140 75886 ns 75888 ns 18143 BM_SGEMM/150 98493 ns 98496 ns 14299 BM_SGEMM/160 102620 ns 102622 ns 13381 BM_SGEMM/170 135169 ns 135173 ns 10231 BM_SGEMM/180 146170 ns 146172 ns 9535 BM_SGEMM/189 190226 ns 190231 ns 7397 BM_SGEMM/200 194513 ns 194519 ns 7210 BM_SGEMM/256 396561 ns 396573 ns 3531 ``` with this change: ``` ---------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------- BM_SGEMM/4 95 ns 95 ns 14500387 BM_SGEMM/6 166 ns 166 ns 8381763 BM_SGEMM/8 196 ns 196 ns 7277044 BM_SGEMM/10 256 ns 256 ns 5515721 BM_SGEMM/16 463 ns 463 ns 3025197 BM_SGEMM/20 636 ns 636 ns 2070213 BM_SGEMM/32 1885 ns 1885 ns 739444 BM_SGEMM/40 2969 ns 2969 ns 472152 BM_SGEMM/64 9371 ns 9372 ns 148932 BM_SGEMM/72 12431 ns 12431 ns 112919 BM_SGEMM/80 15615 ns 15616 ns 89978 BM_SGEMM/90 25397 ns 25398 ns 55041 BM_SGEMM/100 29445 ns 29446 ns 47540 BM_SGEMM/112 37530 ns 37531 ns 37286 BM_SGEMM/128 55373 ns 55375 ns 25277 BM_SGEMM/140 76241 ns 76241 ns 18259 BM_SGEMM/150 102196 ns 102200 ns 13736 BM_SGEMM/160 101521 ns 101525 ns 13556 BM_SGEMM/170 136182 ns 136184 ns 10567 BM_SGEMM/180 146861 ns 146864 ns 9035 BM_SGEMM/189 192632 ns 192632 ns 7231 BM_SGEMM/200 198547 ns 198555 ns 6995 BM_SGEMM/256 392316 ns 392330 ns 3539 ``` Before, when built with USE_THREAD=1, GEMM_MULTITHREAD_THRESHOLD = 4, the cost of small matrix operations was overshadowed by thread locking (look smaller than 32) even when not explicitly spawning threads: ``` ---------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------- BM_SGEMM/4 328 ns 328 ns 4170562 BM_SGEMM/6 396 ns 396 ns 3536400 BM_SGEMM/8 418 ns 418 ns 3330102 BM_SGEMM/10 491 ns 491 ns 2863047 BM_SGEMM/16 710 ns 710 ns 2028314 BM_SGEMM/20 871 ns 871 ns 1581546 BM_SGEMM/32 2132 ns 2132 ns 657089 BM_SGEMM/40 3197 ns 3196 ns 437969 BM_SGEMM/64 9645 ns 9645 ns 144987 BM_SGEMM/72 35064 ns 32881 ns 50264 BM_SGEMM/80 37661 ns 35787 ns 42080 BM_SGEMM/90 36507 ns 36077 ns 40091 BM_SGEMM/100 32513 ns 31850 ns 48607 BM_SGEMM/112 41742 ns 41207 ns 37273 BM_SGEMM/128 67211 ns 65095 ns 21933 BM_SGEMM/140 68263 ns 67943 ns 19245 BM_SGEMM/150 121854 ns 115439 ns 10660 BM_SGEMM/160 116826 ns 115539 ns 10000 BM_SGEMM/170 126566 ns 122798 ns 11960 BM_SGEMM/180 130088 ns 127292 ns 11503 BM_SGEMM/189 120309 ns 116634 ns 13162 BM_SGEMM/200 114559 ns 110993 ns 10000 BM_SGEMM/256 217063 ns 207806 ns 6417 ``` and after, it's gone (note this includes my other change which reduces calls to num_cpu_avail): ``` ---------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------- BM_SGEMM/4 95 ns 95 ns 12347650 BM_SGEMM/6 166 ns 166 ns 8259683 BM_SGEMM/8 193 ns 193 ns 7162210 BM_SGEMM/10 258 ns 258 ns 5415657 BM_SGEMM/16 471 ns 471 ns 2981009 BM_SGEMM/20 666 ns 666 ns 2148002 BM_SGEMM/32 1903 ns 1903 ns 738245 BM_SGEMM/40 2969 ns 2969 ns 473239 BM_SGEMM/64 9440 ns 9440 ns 148442 BM_SGEMM/72 37239 ns 33330 ns 46813 BM_SGEMM/80 57350 ns 55949 ns 32251 BM_SGEMM/90 36275 ns 36249 ns 42259 BM_SGEMM/100 31111 ns 31008 ns 45270 BM_SGEMM/112 43782 ns 40912 ns 34749 BM_SGEMM/128 67375 ns 64406 ns 22443 BM_SGEMM/140 76389 ns 67003 ns 21430 BM_SGEMM/150 72952 ns 71830 ns 19793 BM_SGEMM/160 97039 ns 96858 ns 11498 BM_SGEMM/170 123272 ns 122007 ns 11855 BM_SGEMM/180 126828 ns 126505 ns 11567 BM_SGEMM/189 115179 ns 114665 ns 11044 BM_SGEMM/200 89289 ns 87259 ns 16147 BM_SGEMM/256 226252 ns 222677 ns 7375 ``` I've also tested this with ThreadSanitizer and found no data races during execution. I'm not sure why 200 is always faster than it's neighbors, we must be hitting some optimal cache size or something.		2018-06-14 16:54:58 +01:00
benchmark	Correct index variables used in MFlops calculation	2018-03-27 21:52:29 +02:00
cmake	Merge pull request #1607 from martin-frbg/dynarch	2018-06-14 16:52:55 +02:00
ctest	Fix compiler warnings in ctest	2017-12-03 18:19:30 +01:00
driver	Remove the need for most locking in memory.c.	2018-06-14 16:54:58 +01:00
exports	Add -lm for Android.	2018-05-24 21:02:42 +08:00
interface	Fixed a few more unnecessary calls to num_cpu_avail.	2018-06-11 10:17:16 +01:00
kernel	Merge pull request #1612 from oon3m0oo/cpus	2018-06-14 16:51:31 +02:00
lapack	Change _STDC_VERSION__ to __STDC_VERSION__	2018-05-11 12:15:08 +08:00
lapack-netlib	Merge pull request #1585 from martin-frbg/lapack-253	2018-06-01 18:59:33 +02:00
reference	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
relapack	Add cmake build list file for ReLAPACK	2017-10-12 17:00:00 +02:00
test	Update single and double precision BLAS1 tests from LAPACK 3.8.0	2018-02-18 12:44:14 +01:00
utest	Merge pull request #1532 from martin-frbg/utest-cblas	2018-04-20 23:44:15 +02:00
.gitignore	Don't change timestamps	2017-08-01 13:43:59 +05:30
.travis.yml	Add a BINARY=32 build to macOS	2018-04-07 12:29:57 -07:00
BACKERS.md	Added backers.	2013-09-05 15:39:45 +08:00
CMakeLists.txt	include CMakePackageConfigHelpers	2018-06-10 15:09:43 +02:00
CONTRIBUTORS.md	Optimized standard Blas Level-1,2 (excluding nrm2 functions) for z13 (double precision)	2017-09-06 16:41:08 +04:00
Changelog.txt	Update doc for 0.2.20 version.	2017-07-24 11:55:10 +08:00
GotoBLAS_00License.txt	rename documents in GotoBLAS.	2011-01-24 15:57:23 +00:00
GotoBLAS_01Readme.txt	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
GotoBLAS_02QuickInstall.txt	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
GotoBLAS_03FAQ.txt	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
GotoBLAS_04FAQ.txt	rename documents in GotoBLAS.	2011-01-24 15:57:23 +00:00
GotoBLAS_05LargePage.txt	Correct typo /proc/ instead of /pros/	2015-03-20 23:25:11 +01:00
GotoBLAS_06WeirdPerformance.txt	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
LICENSE	Update organization info.	2014-11-25 15:28:58 +08:00
Makefile	Move some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option	2018-06-09 16:29:17 +02:00
Makefile.alpha	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
Makefile.arm	arm: Determine the abi from compiler if not specified on command line	2017-06-30 18:20:59 +05:30
Makefile.arm64	arm64: Change mtune/mcpu options for THUNDERX2T99 target	2017-07-01 11:17:10 -07:00
Makefile.generic	Respect user's LDFLAGS	2013-07-25 14:08:37 -07:00
Makefile.ia64	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
Makefile.install	Move some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option	2018-06-09 16:29:17 +02:00
Makefile.mips	MIPS P5600(32 bit) and I6400(64 bit) cores support added.	2016-04-22 14:03:18 +05:30
Makefile.mips64	Import GotoBLAS2 1.13 BSD version codes.	2011-01-24 14:54:24 +00:00
Makefile.power	build: fix libxlmass errors building on Power CPU	2017-05-24 14:51:52 +08:00
Makefile.prebuild	Add mips32r2 api target	2018-05-02 20:17:26 +02:00
Makefile.rule	Move some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option	2018-06-09 16:29:17 +02:00
Makefile.sparc	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
Makefile.system	Update OSX deployment target to 10.8	2018-06-14 16:57:58 +02:00
Makefile.tail	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
Makefile.x86	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
Makefile.x86_64	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
Makefile.zarch	dtrmm and dgemm for z13	2017-01-04 19:32:33 +04:00
README.md	Minor changes to wording and formatting in the README	2018-04-04 14:30:32 -07:00
TargetList.txt	Initial support for SkylakeX / AVX512	2018-06-03 07:58:52 +00:00
USAGE.md	Underline importance of NUM_THREADS setting for BUFFER allocation	2018-04-04 22:26:51 +02:00
appveyor.yml	Appveyor: enable building fortran with ninja	2017-12-29 19:58:35 -06:00
c_check	Improve AVX512 testcase	2018-06-06 16:49:00 +02:00
cblas.h	Fix declaration of cblas_Xdotc_sub and cblas_Xdotu_sub	2017-11-18 18:56:30 +01:00
common.h	Revert "Use usleep instead of sched_yield by default"	2018-06-11 17:05:27 +02:00
common_alpha.h	add fallback blas_lock implementation	2015-08-16 18:59:17 +02:00
common_arm.h	arm: Determine the abi from compiler if not specified on command line	2017-06-30 18:20:59 +05:30
common_arm64.h	build: LLVM: Add Flang compiler support and enable OpenMP for Clang	2017-05-25 17:03:20 +01:00
common_c.h	Improved Ximatcopy when lda==ldb.	2015-09-07 14:36:16 +02:00
common_d.h	Improved Ximatcopy when lda==ldb.	2015-09-07 14:36:16 +02:00
common_ia64.h	add fallback blas_lock implementation	2015-08-16 18:59:17 +02:00
common_interface.h	Add ATLAS-style ?geadd function	2015-02-16 13:46:20 +01:00
common_lapack.h	Import GotoBLAS2 1.13 BSD version codes.	2011-01-24 14:54:24 +00:00
common_level1.h	Changed _Complex types in common_level1.h to use the typedef.	2015-02-11 11:12:14 -06:00
common_level2.h	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
common_level3.h	Improved Ximatcopy when lda==ldb.	2015-09-07 14:36:16 +02:00
common_linux.h	Init IBM z system (s390x) porting.	2016-04-15 18:02:24 -04:00
common_macro.h	ARM64: Add the VULCAN Target	2017-01-10 15:01:17 +05:30
common_mips.h	mips: remove incorrect blas_lock implementations	2017-05-05 17:28:03 +01:00
common_mips64.h	mips: remove incorrect blas_lock implementations	2017-05-05 17:28:03 +01:00
common_param.h	Correct zgeadd_k prototype	2017-11-29 19:57:35 +01:00
common_power.h	optimized dgemm for 20 threads	2016-05-16 14:14:25 +02:00
common_q.h	Import GotoBLAS2 1.13 BSD version codes.	2011-01-24 14:54:24 +00:00
common_reference.h	Update organization info.	2014-11-25 15:28:58 +08:00
common_s.h	Improved Ximatcopy when lda==ldb.	2015-09-07 14:36:16 +02:00
common_sparc.h	add fallback blas_lock implementation	2015-08-16 18:59:17 +02:00
common_stackalloc.h	Refs #727 . Align stack buffer address on 32-bytes.	2016-02-11 03:52:02 +08:00
common_thread.h	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
common_x.h	Import GotoBLAS2 1.13 BSD version codes.	2011-01-24 14:54:24 +00:00
common_x86.h	Merge pull request #1542 from martin-frbg/quickdiv64	2018-05-02 18:11:50 +02:00
common_x86_64.h	Merge pull request #1542 from martin-frbg/quickdiv64	2018-05-02 18:11:50 +02:00
common_z.h	Improved Ximatcopy when lda==ldb.	2015-09-07 14:36:16 +02:00
common_zarch.h	dtrmm and dgemm for z13	2017-01-04 19:32:33 +04:00
cpuid.S	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
cpuid.h	Initial support for SkylakeX / AVX512	2018-06-03 07:58:52 +00:00
cpuid_alpha.c	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
cpuid_arm.c	Fix for issue #1024 : arm-linux-androideabi-g++ Compiler Error in /cpuid_arm.c	2016-12-02 09:28:31 -08:00
cpuid_arm64.c	ARM64: Enable Auto Detection of ThunderX2T99	2018-04-19 09:05:25 +00:00
cpuid_ia64.c	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
cpuid_mips.c	Make cpuid_mips compile again and add 1004K cpu	2018-05-02 20:12:25 +02:00
cpuid_mips64.c	Added mips I6500 core	2017-09-22 11:57:43 +05:30
cpuid_power.c	added dgemm-, dtrmm-, zgemm- and ztrmm-kernel for power8	2016-03-01 07:33:56 +01:00
cpuid_sparc.c	Fix my copypaste blunder with get_corename	2018-02-01 22:06:04 +01:00
cpuid_x86.c	Update cpuid_x86.c	2018-06-04 17:10:19 +02:00
cpuid_zarch.c	detect CPU on zArch	2017-04-20 21:13:41 +02:00
ctest.c	Add support for DragonFly BSD	2018-04-03 16:39:29 -07:00
ctest1.c	Import GotoBLAS2 1.13 BSD version codes.	2011-01-24 14:54:24 +00:00
ctest2.c	Import GotoBLAS2 1.13 BSD version codes.	2011-01-24 14:54:24 +00:00
f_check	Fixes for ifort 2018	2018-05-08 21:55:37 +02:00
ftest.f	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
ftest2.f	Import GotoBLAS2 1.13 BSD version codes.	2011-01-24 14:54:24 +00:00
ftest3.f	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
gen_config_h.c	Add 64bit support for Microsoft Visual Studio	2017-06-21 13:38:22 -07:00
getarch.c	Enable parallel make on MS Windows by default	2018-06-09 17:54:36 +02:00
getarch_2nd.c	Delete LOCAL_BUFFER_SIZE for other architectures.	2016-04-12 11:49:28 -04:00
l1param.h	Added BULLDOZER target. So far it uses barcelona kernels.	2012-12-07 00:53:31 +08:00
l2param.h	Support AMD Piledriver by bulldozer kernels.	2013-07-06 12:06:43 -03:00
make.inc	(Plain make) build system fixes for AIX	2017-09-18 01:29:21 +02:00
openblas.pc.in	Rename blas.pc.in to openblas.pc.in	2017-02-12 14:34:03 +01:00
openblas_config_template.h	Fix complex support for MSVC headers	2017-07-28 11:50:29 +05:30
param.h	Initial support for SkylakeX / AVX512	2018-06-03 07:58:52 +00:00
quickbuild.32bit	Import GotoBLAS2 1.13 BSD version codes.	2011-01-24 14:54:24 +00:00
quickbuild.64bit	Import GotoBLAS2 1.13 BSD version codes.	2011-01-24 14:54:24 +00:00
quickbuild.win32	Added the tip for Windows.	2012-08-09 20:37:55 +08:00
quickbuild.win64	Refs #63 . delete prefix for mingw64 toolchain.	2014-04-27 13:05:26 +08:00
segfaults.patch	Remove all trailing whitespace except lapack-netlib	2014-06-27 12:05:18 -07:00
symcopy.h	Changed a number of inline calls to use __inline.	2015-02-11 11:13:17 -06:00
version.h	Update organization info.	2014-11-25 15:28:58 +08:00

README.md

OpenBLAS

Travis CI:

AppVeyor:

Introduction

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.

Please read the documentation on the OpenBLAS wiki pages: http://github.com/xianyi/OpenBLAS/wiki.

Binary Packages

We provide official binary packages for the following platform:

Windows x86/x86_64

You can download them from file hosting on sourceforge.net.

Installation from Source

Download from project homepage, http://xianyi.github.com/OpenBLAS/, or check out the code using Git from https://github.com/xianyi/OpenBLAS.git.

Dependencies

Building OpenBLAS requires the following to be installed:

GNU Make
A C compiler, e.g. GCC or Clang
A Fortran compiler (optional, for LAPACK)
IBM MASS (optional, see below)

Normal compile

Simply invoking make (or gmake on BSD) will detect the CPU automatically. To set a specific target CPU, use make TARGET=xxx, e.g. make TARGET=NEHALEM. The full target list is in the file TargetList.txt.

Cross compile

Set CC and FC to point to the cross toolchains, and set HOSTCC to your host C compiler. The target must be specified explicitly when cross compiling.

Examples:

On an x86 box, compile this library for a loongson3a CPU:

make BINARY=64 CC=mips64el-unknown-linux-gnu-gcc FC=mips64el-unknown-linux-gnu-gfortran HOSTCC=gcc TARGET=LOONGSON3A

On an x86 box, compile this library for a loongson3a CPU with loongcc (based on Open64) compiler:

make CC=loongcc FC=loongf95 HOSTCC=gcc TARGET=LOONGSON3A CROSS=1 CROSS_SUFFIX=mips64el-st-linux-gnu-   NO_LAPACKE=1 NO_SHARED=1 BINARY=32

Debug version

A debug version can be built using make DEBUG=1.

Compile with MASS support on Power CPU (optional)

The IBM MASS library consists of a set of mathematical functions for C, C++, and Fortran applications that are are tuned for optimum performance on POWER architectures. OpenBLAS with MASS requires a 64-bit, little-endian OS on POWER. The library can be installed as shown:

On Ubuntu:

wget -q http://public.dhe.ibm.com/software/server/POWER/Linux/xl-compiler/eval/ppc64le/ubuntu/public.gpg -O- | sudo apt-key add -
echo "deb http://public.dhe.ibm.com/software/server/POWER/Linux/xl-compiler/eval/ppc64le/ubuntu/ trusty main" | sudo tee /etc/apt/sources.list.d/ibm-xl-compiler-eval.list
sudo apt-get update
sudo apt-get install libxlmass-devel.8.1.5

On RHEL/CentOS:

wget http://public.dhe.ibm.com/software/server/POWER/Linux/xl-compiler/eval/ppc64le/rhel7/repodata/repomd.xml.key
sudo rpm --import repomd.xml.key
wget http://public.dhe.ibm.com/software/server/POWER/Linux/xl-compiler/eval/ppc64le/rhel7/ibm-xl-compiler-eval.repo
sudo cp ibm-xl-compiler-eval.repo /etc/yum.repos.d/
sudo yum install libxlmass-devel.8.1.5

After installing the MASS library, compile OpenBLAS with USE_MASS=1. For example, to compile on Power8 with MASS support: make USE_MASS=1 TARGET=POWER8.

Install to a specific directory (optional)

Use PREFIX= when invoking make, for example

make install PREFIX=your_installation_directory

The default installation directory is /opt/OpenBLAS.

Supported CPUs and Operating Systems

Please read GotoBLAS_01Readme.txt.

Additional supported CPUs

x86/x86-64

Intel Xeon 56xx (Westmere): Used GotoBLAS2 Nehalem codes.
Intel Sandy Bridge: Optimized Level-3 and Level-2 BLAS with AVX on x86-64.
Intel Haswell: Optimized Level-3 and Level-2 BLAS with AVX2 and FMA on x86-64.
AMD Bobcat: Used GotoBLAS2 Barcelona codes.
AMD Bulldozer: x86-64 ?GEMM FMA4 kernels. (Thanks to Werner Saar)
AMD PILEDRIVER: Uses Bulldozer codes with some optimizations.
AMD STEAMROLLER: Uses Bulldozer codes with some optimizations.

MIPS64

ICT Loongson 3A: Optimized Level-3 BLAS and the part of Level-1,2.
ICT Loongson 3B: Experimental

ARM

ARMv6: Optimized BLAS for vfpv2 and vfpv3-d16 (e.g. BCM2835, Cortex M0+)
ARMv7: Optimized BLAS for vfpv3-d32 (e.g. Cortex A8, A9 and A15)

ARM64

ARMv8: Experimental
ARM Cortex-A57: Experimental

PPC/PPC64

POWER8: Optmized Level-3 BLAS and some Level-1, only with USE_OPENMP=1

IBM zEnterprise System

Z13: Optimized Level-3 BLAS and Level-1,2 (double precision)

Supported OS

GNU/Linux
MinGW or Visual Studio (CMake)/Windows: Please read https://github.com/xianyi/OpenBLAS/wiki/How-to-use-OpenBLAS-in-Microsoft-Visual-Studio.
Darwin/macOS: Experimental. Although GotoBLAS2 supports Darwin, we are not macOS experts.
FreeBSD: Supported by the community. We don't actively test the library on this OS.
OpenBSD: Supported by the community. We don't actively test the library on this OS.
DragonFly BSD: Supported by the community. We don't actively test the library on this OS.
Android: Supported by the community. Please read https://github.com/xianyi/OpenBLAS/wiki/How-to-build-OpenBLAS-for-Android.

Usage

Statically link with libopenblas.a or dynamically link with -lopenblas if OpenBLAS was compiled as a shared library.

Setting the number of threads using environment variables

Environment variables are used to specify a maximum number of threads. For example,

export OPENBLAS_NUM_THREADS=4
export GOTO_NUM_THREADS=4
export OMP_NUM_THREADS=4

The priorities are OPENBLAS_NUM_THREADS > GOTO_NUM_THREADS > OMP_NUM_THREADS.

If you compile this library with USE_OPENMP=1, you should set the OMP_NUM_THREADS environment variable; OpenBLAS ignores OPENBLAS_NUM_THREADS and GOTO_NUM_THREADS when compiled with USE_OPENMP=1.

Setting the number of threads at runtime

We provide the following functions to control the number of threads at runtime:

void goto_set_num_threads(int num_threads);
void openblas_set_num_threads(int num_threads);

If you compile this library with USE_OPENMP=1, you should use the above functions too.

Reporting bugs

Please submit an issue in https://github.com/xianyi/OpenBLAS/issues.

Contact

OpenBLAS users mailing list: https://groups.google.com/forum/#!forum/openblas-users
OpenBLAS developers mailing list: https://groups.google.com/forum/#!forum/openblas-dev

Change log

Please see Changelog.txt to view the differences between OpenBLAS and GotoBLAS2 1.13 BSD version.

Troubleshooting

Please read the FAQ first.
Please use GCC version 4.6 and above to compile Sandy Bridge AVX kernels on Linux/MinGW/BSD.
Please use Clang version 3.1 and above to compile the library on Sandy Bridge microarchitecture. Clang 3.0 will generate the wrong AVX binary code.
The number of CPUs/cores should less than or equal to 256. On Linux x86_64 (amd64), there is experimental support for up to 1024 CPUs/cores and 128 numa nodes if you build the library with BIGNUMA=1.
OpenBLAS does not set processor affinity by default. On Linux, you can enable processor affinity by commenting out the line NO_AFFINITY=1 in Makefile.rule. However, note that this may cause a conflict with R parallel.
On Loongson 3A, make test may fail with a pthread_create error (EAGAIN). However, it will be okay when you run the same test case on the shell.

Contributing

Check for open issues or open a fresh issue to start a discussion around a feature idea or a bug.
Fork the OpenBLAS repository to start making your changes.
Write a test which shows that the bug was fixed or that the feature works as expected.
Send a pull request. Make sure to add yourself to CONTRIBUTORS.md.

Donation

Please read this wiki page.