1. Modify the algorithm to resolve multithreading failures 2. No memory allocation in sbgemm kernel 3. Optimize when alpha == 1.0f