See #1886 and JuliaRobotics issue 500. trsm benchmarks on Haswell and Zen showed that with these values performance is roughly doubled for matrix sizes between 8x8 and 14x14, and still 10 to 20 percent better near the new cutoff at 32x32.
Always checking num_cpu_avail() regardless of whether threading will actually
be used adds noticeable overhead for small matrices. Most other uses of
num_cpu_avail() do so only if threading will be used, so do the same here.
Modify complex CBLAS functions to take void pointers instead of float or double arguments (to bring the prototypes in line with netlib and other implementations' cblas.h)