OpenBLAS/docs/extensions.md

* BLAS-like extensions

| Routine       | Data Types    | Description     |
| ------------- |:------------- | :---------------|
| ?axpby        | s,d,c,z       | like axpy with a multiplier for y    |
| ?gemm3m       | c,z           | gemm3m            |
| ?imatcopy     | s,d,c,z       | in-place transpositon/copying   |
| ?omatcopy     | s,d,c,z       | out-of-place transpositon/copying    |
| ?geadd        | s,d,c,z       | matrix add   |
| ?gemmt        | s,d,c,z       | gemm but only a triangular part updated|

* BLAS-like and Conversion functions for bfloat16 (available when OpenBLAS was compiled with BUILD_BFLOAT16=1)
  * `void cblas_sbstobf16` converts a float array to an array of bfloat16 values by rounding
  * `void cblas_sbdtobf16` converts a double array to an array of bfloat16 values by rounding
  * `void cblas_sbf16tos` converts a bfloat16 array to an array of floats
  * `void cblas_dbf16tod` converts a bfloat16 array to an array of doubles
  * `float cblas_sbdot` computes the dot product of two bfloat16 arrays
  * `void cblas_sbgemv` performs the matrix-vector operations of GEMV with the input matrix and X vector as bfloat16
  * `void cblas_sbgemm` performs the matrix-matrix operations of GEMM with both input arrays containing bfloat16

* Utility functions
  * openblas_get_num_threads
  * openblas_set_num_threads
  * `int openblas_get_num_procs(void)` returns the number of processors available on the system (may include "hyperthreading cores")
  * `int openblas_get_parallel(void)` returns 0 for sequential use, 1 for platform-based threading and 2 for OpenMP-based threading
  * `char * openblas_get_config()` returns the options OpenBLAS was built with, something like `NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell`
  * `int openblas_set_affinity(int thread_index, size_t cpusetsize, cpu_set_t *cpuset)` sets the cpu affinity mask of the given thread to the provided cpuset. (Only available under Linux, with semantics identical to pthread_setaffinity_np)