See #1886 and JuliaRobotics issue 500. trsm benchmarks on Haswell and Zen showed that with these values performance is roughly doubled for matrix sizes between 8x8 and 14x14, and still 10 to 20 percent better near the new cutoff at 32x32.
OpenBLAS has a fancy algorithm for copying the input data while laying
it out in a more CPU friendly memory layout.
This is great for large matrixes; the cost of the copy is easily
ammortized by the gains from the better memory layout.
But for small matrixes (on CPUs that can do efficient unaligned loads) this
copy can be a net loss.
This patch adds (for SKYLAKEX initially) a "sgemm direct" mode, that bypasses
the whole copy machinary for ALPHA=1/BETA=0/... standard arguments,
for small matrixes only.
What is small? For the non-threaded case this has been measured to be
in the M*N*K = 28 * 512 * 512 range, while in the threaded case it's
less, around M*N*K = 1 * 512 * 512
Currently the generic ARMV8 target uses C implementations
for many routines. Replace these with the neon implementations
written for THUNDERX2T99 target which are upto 6x faster for
certain routines.
Fixes two calls that were using `fabs` on a `long double` argument rather than `fabsl`, which looks like it is doing an unintentional truncation to `double` precision.
Always checking num_cpu_avail() regardless of whether threading will actually
be used adds noticeable overhead for small matrices. Most other uses of
num_cpu_avail() do so only if threading will be used, so do the same here.
* Rewrite ROTMG based on the new implementation in GONUM based on the algorithm proposed by Tim Hopkins, see issue 1452 for the reference
* Correct ROTMG utest for issue1452 and add another from gonum, also correct transposition of expected and observed values in error messages
Fixes#1452 and is more in line with how ATLAS does it. The earlier fix from #356 only moved the bug elsewhere, but we will never want the iterative rescaling to change the dflag setting and variable associations with each cycle.