Using 256-bit registers in dscal makes this microkernel consistent with
cscal and zscal, and generally doubles performance if the vector fits in
L1 cache.
The OSX assembler apparently mishandles the argument to decimal .align, leading to a significant loss of performance
as observed in #730, #901 and most recently #1470