Go to file
Craig Donner bf40f806ef Remove the need for most locking in memory.c.
Using thread local storage for tracking memory allocations means that threads
no longer have to lock at all when doing memory allocations / frees. This
particularly helps the gemm driver since it does an allocation per invocation.
Even without threading at all, this helps, since even calling a lock with
no contention has a cost:

Before this change, no threading:
```
----------------------------------------------------
Benchmark             Time           CPU Iterations
----------------------------------------------------
BM_SGEMM/4          102 ns        102 ns   13504412
BM_SGEMM/6          175 ns        175 ns    7997580
BM_SGEMM/8          205 ns        205 ns    6842073
BM_SGEMM/10         266 ns        266 ns    5294919
BM_SGEMM/16         478 ns        478 ns    2963441
BM_SGEMM/20         690 ns        690 ns    2144755
BM_SGEMM/32        1906 ns       1906 ns     716981
BM_SGEMM/40        2983 ns       2983 ns     473218
BM_SGEMM/64        9421 ns       9422 ns     148450
BM_SGEMM/72       12630 ns      12631 ns     112105
BM_SGEMM/80       15845 ns      15846 ns      89118
BM_SGEMM/90       25675 ns      25676 ns      54332
BM_SGEMM/100      29864 ns      29865 ns      47120
BM_SGEMM/112      37841 ns      37842 ns      36717
BM_SGEMM/128      56531 ns      56532 ns      25361
BM_SGEMM/140      75886 ns      75888 ns      18143
BM_SGEMM/150      98493 ns      98496 ns      14299
BM_SGEMM/160     102620 ns     102622 ns      13381
BM_SGEMM/170     135169 ns     135173 ns      10231
BM_SGEMM/180     146170 ns     146172 ns       9535
BM_SGEMM/189     190226 ns     190231 ns       7397
BM_SGEMM/200     194513 ns     194519 ns       7210
BM_SGEMM/256     396561 ns     396573 ns       3531
```
with this change:
```
----------------------------------------------------
Benchmark             Time           CPU Iterations
----------------------------------------------------
BM_SGEMM/4           95 ns         95 ns   14500387
BM_SGEMM/6          166 ns        166 ns    8381763
BM_SGEMM/8          196 ns        196 ns    7277044
BM_SGEMM/10         256 ns        256 ns    5515721
BM_SGEMM/16         463 ns        463 ns    3025197
BM_SGEMM/20         636 ns        636 ns    2070213
BM_SGEMM/32        1885 ns       1885 ns     739444
BM_SGEMM/40        2969 ns       2969 ns     472152
BM_SGEMM/64        9371 ns       9372 ns     148932
BM_SGEMM/72       12431 ns      12431 ns     112919
BM_SGEMM/80       15615 ns      15616 ns      89978
BM_SGEMM/90       25397 ns      25398 ns      55041
BM_SGEMM/100      29445 ns      29446 ns      47540
BM_SGEMM/112      37530 ns      37531 ns      37286
BM_SGEMM/128      55373 ns      55375 ns      25277
BM_SGEMM/140      76241 ns      76241 ns      18259
BM_SGEMM/150     102196 ns     102200 ns      13736
BM_SGEMM/160     101521 ns     101525 ns      13556
BM_SGEMM/170     136182 ns     136184 ns      10567
BM_SGEMM/180     146861 ns     146864 ns       9035
BM_SGEMM/189     192632 ns     192632 ns       7231
BM_SGEMM/200     198547 ns     198555 ns       6995
BM_SGEMM/256     392316 ns     392330 ns       3539
```

Before, when built with USE_THREAD=1, GEMM_MULTITHREAD_THRESHOLD = 4, the cost
of small matrix operations was overshadowed by thread locking (look smaller than
32) even when not explicitly spawning threads:
```
----------------------------------------------------
Benchmark             Time           CPU Iterations
----------------------------------------------------
BM_SGEMM/4          328 ns        328 ns    4170562
BM_SGEMM/6          396 ns        396 ns    3536400
BM_SGEMM/8          418 ns        418 ns    3330102
BM_SGEMM/10         491 ns        491 ns    2863047
BM_SGEMM/16         710 ns        710 ns    2028314
BM_SGEMM/20         871 ns        871 ns    1581546
BM_SGEMM/32        2132 ns       2132 ns     657089
BM_SGEMM/40        3197 ns       3196 ns     437969
BM_SGEMM/64        9645 ns       9645 ns     144987
BM_SGEMM/72       35064 ns      32881 ns      50264
BM_SGEMM/80       37661 ns      35787 ns      42080
BM_SGEMM/90       36507 ns      36077 ns      40091
BM_SGEMM/100      32513 ns      31850 ns      48607
BM_SGEMM/112      41742 ns      41207 ns      37273
BM_SGEMM/128      67211 ns      65095 ns      21933
BM_SGEMM/140      68263 ns      67943 ns      19245
BM_SGEMM/150     121854 ns     115439 ns      10660
BM_SGEMM/160     116826 ns     115539 ns      10000
BM_SGEMM/170     126566 ns     122798 ns      11960
BM_SGEMM/180     130088 ns     127292 ns      11503
BM_SGEMM/189     120309 ns     116634 ns      13162
BM_SGEMM/200     114559 ns     110993 ns      10000
BM_SGEMM/256     217063 ns     207806 ns       6417
```
and after, it's gone (note this includes my other change which reduces calls
to num_cpu_avail):
```
----------------------------------------------------
Benchmark             Time           CPU Iterations
----------------------------------------------------
BM_SGEMM/4           95 ns         95 ns   12347650
BM_SGEMM/6          166 ns        166 ns    8259683
BM_SGEMM/8          193 ns        193 ns    7162210
BM_SGEMM/10         258 ns        258 ns    5415657
BM_SGEMM/16         471 ns        471 ns    2981009
BM_SGEMM/20         666 ns        666 ns    2148002
BM_SGEMM/32        1903 ns       1903 ns     738245
BM_SGEMM/40        2969 ns       2969 ns     473239
BM_SGEMM/64        9440 ns       9440 ns     148442
BM_SGEMM/72       37239 ns      33330 ns      46813
BM_SGEMM/80       57350 ns      55949 ns      32251
BM_SGEMM/90       36275 ns      36249 ns      42259
BM_SGEMM/100      31111 ns      31008 ns      45270
BM_SGEMM/112      43782 ns      40912 ns      34749
BM_SGEMM/128      67375 ns      64406 ns      22443
BM_SGEMM/140      76389 ns      67003 ns      21430
BM_SGEMM/150      72952 ns      71830 ns      19793
BM_SGEMM/160      97039 ns      96858 ns      11498
BM_SGEMM/170     123272 ns     122007 ns      11855
BM_SGEMM/180     126828 ns     126505 ns      11567
BM_SGEMM/189     115179 ns     114665 ns      11044
BM_SGEMM/200      89289 ns      87259 ns      16147
BM_SGEMM/256     226252 ns     222677 ns       7375
```

I've also tested this with ThreadSanitizer and found no data races during
execution.  I'm not sure why 200 is always faster than it's neighbors, we must
be hitting some optimal cache size or something.
2018-06-14 16:54:58 +01:00
benchmark Correct index variables used in MFlops calculation 2018-03-27 21:52:29 +02:00
cmake Merge pull request #1607 from martin-frbg/dynarch 2018-06-14 16:52:55 +02:00
ctest Fix compiler warnings in ctest 2017-12-03 18:19:30 +01:00
driver Remove the need for most locking in memory.c. 2018-06-14 16:54:58 +01:00
exports Add -lm for Android. 2018-05-24 21:02:42 +08:00
interface Fixed a few more unnecessary calls to num_cpu_avail. 2018-06-11 10:17:16 +01:00
kernel Merge pull request #1612 from oon3m0oo/cpus 2018-06-14 16:51:31 +02:00
lapack Change _STDC_VERSION__ to __STDC_VERSION__ 2018-05-11 12:15:08 +08:00
lapack-netlib Merge pull request #1585 from martin-frbg/lapack-253 2018-06-01 18:59:33 +02:00
reference Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
relapack Add cmake build list file for ReLAPACK 2017-10-12 17:00:00 +02:00
test Update single and double precision BLAS1 tests from LAPACK 3.8.0 2018-02-18 12:44:14 +01:00
utest Merge pull request #1532 from martin-frbg/utest-cblas 2018-04-20 23:44:15 +02:00
.gitignore Don't change timestamps 2017-08-01 13:43:59 +05:30
.travis.yml Add a BINARY=32 build to macOS 2018-04-07 12:29:57 -07:00
BACKERS.md Added backers. 2013-09-05 15:39:45 +08:00
CMakeLists.txt include CMakePackageConfigHelpers 2018-06-10 15:09:43 +02:00
CONTRIBUTORS.md Optimized standard Blas Level-1,2 (excluding nrm2 functions) for z13 (double precision) 2017-09-06 16:41:08 +04:00
Changelog.txt Update doc for 0.2.20 version. 2017-07-24 11:55:10 +08:00
GotoBLAS_00License.txt rename documents in GotoBLAS. 2011-01-24 15:57:23 +00:00
GotoBLAS_01Readme.txt Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
GotoBLAS_02QuickInstall.txt Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
GotoBLAS_03FAQ.txt Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
GotoBLAS_04FAQ.txt rename documents in GotoBLAS. 2011-01-24 15:57:23 +00:00
GotoBLAS_05LargePage.txt Correct typo /proc/ instead of /pros/ 2015-03-20 23:25:11 +01:00
GotoBLAS_06WeirdPerformance.txt Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
LICENSE Update organization info. 2014-11-25 15:28:58 +08:00
Makefile Move some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option 2018-06-09 16:29:17 +02:00
Makefile.alpha Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
Makefile.arm arm: Determine the abi from compiler if not specified on command line 2017-06-30 18:20:59 +05:30
Makefile.arm64 arm64: Change mtune/mcpu options for THUNDERX2T99 target 2017-07-01 11:17:10 -07:00
Makefile.generic Respect user's LDFLAGS 2013-07-25 14:08:37 -07:00
Makefile.ia64 Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
Makefile.install Move some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option 2018-06-09 16:29:17 +02:00
Makefile.mips MIPS P5600(32 bit) and I6400(64 bit) cores support added. 2016-04-22 14:03:18 +05:30
Makefile.mips64 Import GotoBLAS2 1.13 BSD version codes. 2011-01-24 14:54:24 +00:00
Makefile.power build: fix libxlmass errors building on Power CPU 2017-05-24 14:51:52 +08:00
Makefile.prebuild Add mips32r2 api target 2018-05-02 20:17:26 +02:00
Makefile.rule Move some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option 2018-06-09 16:29:17 +02:00
Makefile.sparc Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
Makefile.system Update OSX deployment target to 10.8 2018-06-14 16:57:58 +02:00
Makefile.tail Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
Makefile.x86 Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
Makefile.x86_64 Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
Makefile.zarch dtrmm and dgemm for z13 2017-01-04 19:32:33 +04:00
README.md Minor changes to wording and formatting in the README 2018-04-04 14:30:32 -07:00
TargetList.txt Initial support for SkylakeX / AVX512 2018-06-03 07:58:52 +00:00
USAGE.md Underline importance of NUM_THREADS setting for BUFFER allocation 2018-04-04 22:26:51 +02:00
appveyor.yml Appveyor: enable building fortran with ninja 2017-12-29 19:58:35 -06:00
c_check Improve AVX512 testcase 2018-06-06 16:49:00 +02:00
cblas.h Fix declaration of cblas_Xdotc_sub and cblas_Xdotu_sub 2017-11-18 18:56:30 +01:00
common.h Revert "Use usleep instead of sched_yield by default" 2018-06-11 17:05:27 +02:00
common_alpha.h add fallback blas_lock implementation 2015-08-16 18:59:17 +02:00
common_arm.h arm: Determine the abi from compiler if not specified on command line 2017-06-30 18:20:59 +05:30
common_arm64.h build: LLVM: Add Flang compiler support and enable OpenMP for Clang 2017-05-25 17:03:20 +01:00
common_c.h Improved Ximatcopy when lda==ldb. 2015-09-07 14:36:16 +02:00
common_d.h Improved Ximatcopy when lda==ldb. 2015-09-07 14:36:16 +02:00
common_ia64.h add fallback blas_lock implementation 2015-08-16 18:59:17 +02:00
common_interface.h Add ATLAS-style ?geadd function 2015-02-16 13:46:20 +01:00
common_lapack.h Import GotoBLAS2 1.13 BSD version codes. 2011-01-24 14:54:24 +00:00
common_level1.h Changed _Complex types in common_level1.h to use the typedef. 2015-02-11 11:12:14 -06:00
common_level2.h Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
common_level3.h Improved Ximatcopy when lda==ldb. 2015-09-07 14:36:16 +02:00
common_linux.h Init IBM z system (s390x) porting. 2016-04-15 18:02:24 -04:00
common_macro.h ARM64: Add the VULCAN Target 2017-01-10 15:01:17 +05:30
common_mips.h mips: remove incorrect blas_lock implementations 2017-05-05 17:28:03 +01:00
common_mips64.h mips: remove incorrect blas_lock implementations 2017-05-05 17:28:03 +01:00
common_param.h Correct zgeadd_k prototype 2017-11-29 19:57:35 +01:00
common_power.h optimized dgemm for 20 threads 2016-05-16 14:14:25 +02:00
common_q.h Import GotoBLAS2 1.13 BSD version codes. 2011-01-24 14:54:24 +00:00
common_reference.h Update organization info. 2014-11-25 15:28:58 +08:00
common_s.h Improved Ximatcopy when lda==ldb. 2015-09-07 14:36:16 +02:00
common_sparc.h add fallback blas_lock implementation 2015-08-16 18:59:17 +02:00
common_stackalloc.h Refs #727. Align stack buffer address on 32-bytes. 2016-02-11 03:52:02 +08:00
common_thread.h Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
common_x.h Import GotoBLAS2 1.13 BSD version codes. 2011-01-24 14:54:24 +00:00
common_x86.h Merge pull request #1542 from martin-frbg/quickdiv64 2018-05-02 18:11:50 +02:00
common_x86_64.h Merge pull request #1542 from martin-frbg/quickdiv64 2018-05-02 18:11:50 +02:00
common_z.h Improved Ximatcopy when lda==ldb. 2015-09-07 14:36:16 +02:00
common_zarch.h dtrmm and dgemm for z13 2017-01-04 19:32:33 +04:00
cpuid.S Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
cpuid.h Initial support for SkylakeX / AVX512 2018-06-03 07:58:52 +00:00
cpuid_alpha.c Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
cpuid_arm.c Fix for issue #1024: arm-linux-androideabi-g++ Compiler Error in /cpuid_arm.c 2016-12-02 09:28:31 -08:00
cpuid_arm64.c ARM64: Enable Auto Detection of ThunderX2T99 2018-04-19 09:05:25 +00:00
cpuid_ia64.c Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
cpuid_mips.c Make cpuid_mips compile again and add 1004K cpu 2018-05-02 20:12:25 +02:00
cpuid_mips64.c Added mips I6500 core 2017-09-22 11:57:43 +05:30
cpuid_power.c added dgemm-, dtrmm-, zgemm- and ztrmm-kernel for power8 2016-03-01 07:33:56 +01:00
cpuid_sparc.c Fix my copypaste blunder with get_corename 2018-02-01 22:06:04 +01:00
cpuid_x86.c Update cpuid_x86.c 2018-06-04 17:10:19 +02:00
cpuid_zarch.c detect CPU on zArch 2017-04-20 21:13:41 +02:00
ctest.c Add support for DragonFly BSD 2018-04-03 16:39:29 -07:00
ctest1.c Import GotoBLAS2 1.13 BSD version codes. 2011-01-24 14:54:24 +00:00
ctest2.c Import GotoBLAS2 1.13 BSD version codes. 2011-01-24 14:54:24 +00:00
f_check Fixes for ifort 2018 2018-05-08 21:55:37 +02:00
ftest.f Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
ftest2.f Import GotoBLAS2 1.13 BSD version codes. 2011-01-24 14:54:24 +00:00
ftest3.f Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
gen_config_h.c Add 64bit support for Microsoft Visual Studio 2017-06-21 13:38:22 -07:00
getarch.c Enable parallel make on MS Windows by default 2018-06-09 17:54:36 +02:00
getarch_2nd.c Delete LOCAL_BUFFER_SIZE for other architectures. 2016-04-12 11:49:28 -04:00
l1param.h Added BULLDOZER target. So far it uses barcelona kernels. 2012-12-07 00:53:31 +08:00
l2param.h Support AMD Piledriver by bulldozer kernels. 2013-07-06 12:06:43 -03:00
make.inc (Plain make) build system fixes for AIX 2017-09-18 01:29:21 +02:00
openblas.pc.in Rename blas.pc.in to openblas.pc.in 2017-02-12 14:34:03 +01:00
openblas_config_template.h Fix complex support for MSVC headers 2017-07-28 11:50:29 +05:30
param.h Initial support for SkylakeX / AVX512 2018-06-03 07:58:52 +00:00
quickbuild.32bit Import GotoBLAS2 1.13 BSD version codes. 2011-01-24 14:54:24 +00:00
quickbuild.64bit Import GotoBLAS2 1.13 BSD version codes. 2011-01-24 14:54:24 +00:00
quickbuild.win32 Added the tip for Windows. 2012-08-09 20:37:55 +08:00
quickbuild.win64 Refs #63. delete prefix for mingw64 toolchain. 2014-04-27 13:05:26 +08:00
segfaults.patch Remove all trailing whitespace except lapack-netlib 2014-06-27 12:05:18 -07:00
symcopy.h Changed a number of inline calls to use __inline. 2015-02-11 11:13:17 -06:00
version.h Update organization info. 2014-11-25 15:28:58 +08:00

README.md

OpenBLAS

Join the chat at https://gitter.im/xianyi/OpenBLAS

Travis CI: Build Status

AppVeyor: Build status

Introduction

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.

Please read the documentation on the OpenBLAS wiki pages: http://github.com/xianyi/OpenBLAS/wiki.

Binary Packages

We provide official binary packages for the following platform:

  • Windows x86/x86_64

You can download them from file hosting on sourceforge.net.

Installation from Source

Download from project homepage, http://xianyi.github.com/OpenBLAS/, or check out the code using Git from https://github.com/xianyi/OpenBLAS.git.

Dependencies

Building OpenBLAS requires the following to be installed:

  • GNU Make
  • A C compiler, e.g. GCC or Clang
  • A Fortran compiler (optional, for LAPACK)
  • IBM MASS (optional, see below)

Normal compile

Simply invoking make (or gmake on BSD) will detect the CPU automatically. To set a specific target CPU, use make TARGET=xxx, e.g. make TARGET=NEHALEM. The full target list is in the file TargetList.txt.

Cross compile

Set CC and FC to point to the cross toolchains, and set HOSTCC to your host C compiler. The target must be specified explicitly when cross compiling.

Examples:

  • On an x86 box, compile this library for a loongson3a CPU:

    make BINARY=64 CC=mips64el-unknown-linux-gnu-gcc FC=mips64el-unknown-linux-gnu-gfortran HOSTCC=gcc TARGET=LOONGSON3A
    
  • On an x86 box, compile this library for a loongson3a CPU with loongcc (based on Open64) compiler:

    make CC=loongcc FC=loongf95 HOSTCC=gcc TARGET=LOONGSON3A CROSS=1 CROSS_SUFFIX=mips64el-st-linux-gnu-   NO_LAPACKE=1 NO_SHARED=1 BINARY=32
    

Debug version

A debug version can be built using make DEBUG=1.

Compile with MASS support on Power CPU (optional)

The IBM MASS library consists of a set of mathematical functions for C, C++, and Fortran applications that are are tuned for optimum performance on POWER architectures. OpenBLAS with MASS requires a 64-bit, little-endian OS on POWER. The library can be installed as shown:

  • On Ubuntu:

    wget -q http://public.dhe.ibm.com/software/server/POWER/Linux/xl-compiler/eval/ppc64le/ubuntu/public.gpg -O- | sudo apt-key add -
    echo "deb http://public.dhe.ibm.com/software/server/POWER/Linux/xl-compiler/eval/ppc64le/ubuntu/ trusty main" | sudo tee /etc/apt/sources.list.d/ibm-xl-compiler-eval.list
    sudo apt-get update
    sudo apt-get install libxlmass-devel.8.1.5
    
  • On RHEL/CentOS:

    wget http://public.dhe.ibm.com/software/server/POWER/Linux/xl-compiler/eval/ppc64le/rhel7/repodata/repomd.xml.key
    sudo rpm --import repomd.xml.key
    wget http://public.dhe.ibm.com/software/server/POWER/Linux/xl-compiler/eval/ppc64le/rhel7/ibm-xl-compiler-eval.repo
    sudo cp ibm-xl-compiler-eval.repo /etc/yum.repos.d/
    sudo yum install libxlmass-devel.8.1.5
    

After installing the MASS library, compile OpenBLAS with USE_MASS=1. For example, to compile on Power8 with MASS support: make USE_MASS=1 TARGET=POWER8.

Install to a specific directory (optional)

Use PREFIX= when invoking make, for example

make install PREFIX=your_installation_directory

The default installation directory is /opt/OpenBLAS.

Supported CPUs and Operating Systems

Please read GotoBLAS_01Readme.txt.

Additional supported CPUs

x86/x86-64

  • Intel Xeon 56xx (Westmere): Used GotoBLAS2 Nehalem codes.
  • Intel Sandy Bridge: Optimized Level-3 and Level-2 BLAS with AVX on x86-64.
  • Intel Haswell: Optimized Level-3 and Level-2 BLAS with AVX2 and FMA on x86-64.
  • AMD Bobcat: Used GotoBLAS2 Barcelona codes.
  • AMD Bulldozer: x86-64 ?GEMM FMA4 kernels. (Thanks to Werner Saar)
  • AMD PILEDRIVER: Uses Bulldozer codes with some optimizations.
  • AMD STEAMROLLER: Uses Bulldozer codes with some optimizations.

MIPS64

  • ICT Loongson 3A: Optimized Level-3 BLAS and the part of Level-1,2.
  • ICT Loongson 3B: Experimental

ARM

  • ARMv6: Optimized BLAS for vfpv2 and vfpv3-d16 (e.g. BCM2835, Cortex M0+)
  • ARMv7: Optimized BLAS for vfpv3-d32 (e.g. Cortex A8, A9 and A15)

ARM64

  • ARMv8: Experimental
  • ARM Cortex-A57: Experimental

PPC/PPC64

  • POWER8: Optmized Level-3 BLAS and some Level-1, only with USE_OPENMP=1

IBM zEnterprise System

  • Z13: Optimized Level-3 BLAS and Level-1,2 (double precision)

Supported OS

Usage

Statically link with libopenblas.a or dynamically link with -lopenblas if OpenBLAS was compiled as a shared library.

Setting the number of threads using environment variables

Environment variables are used to specify a maximum number of threads. For example,

export OPENBLAS_NUM_THREADS=4
export GOTO_NUM_THREADS=4
export OMP_NUM_THREADS=4

The priorities are OPENBLAS_NUM_THREADS > GOTO_NUM_THREADS > OMP_NUM_THREADS.

If you compile this library with USE_OPENMP=1, you should set the OMP_NUM_THREADS environment variable; OpenBLAS ignores OPENBLAS_NUM_THREADS and GOTO_NUM_THREADS when compiled with USE_OPENMP=1.

Setting the number of threads at runtime

We provide the following functions to control the number of threads at runtime:

void goto_set_num_threads(int num_threads);
void openblas_set_num_threads(int num_threads);

If you compile this library with USE_OPENMP=1, you should use the above functions too.

Reporting bugs

Please submit an issue in https://github.com/xianyi/OpenBLAS/issues.

Contact

Change log

Please see Changelog.txt to view the differences between OpenBLAS and GotoBLAS2 1.13 BSD version.

Troubleshooting

  • Please read the FAQ first.
  • Please use GCC version 4.6 and above to compile Sandy Bridge AVX kernels on Linux/MinGW/BSD.
  • Please use Clang version 3.1 and above to compile the library on Sandy Bridge microarchitecture. Clang 3.0 will generate the wrong AVX binary code.
  • The number of CPUs/cores should less than or equal to 256. On Linux x86_64 (amd64), there is experimental support for up to 1024 CPUs/cores and 128 numa nodes if you build the library with BIGNUMA=1.
  • OpenBLAS does not set processor affinity by default. On Linux, you can enable processor affinity by commenting out the line NO_AFFINITY=1 in Makefile.rule. However, note that this may cause a conflict with R parallel.
  • On Loongson 3A, make test may fail with a pthread_create error (EAGAIN). However, it will be okay when you run the same test case on the shell.

Contributing

  1. Check for open issues or open a fresh issue to start a discussion around a feature idea or a bug.
  2. Fork the OpenBLAS repository to start making your changes.
  3. Write a test which shows that the bug was fixed or that the feature works as expected.
  4. Send a pull request. Make sure to add yourself to CONTRIBUTORS.md.

Donation

Please read this wiki page.