The current implementation has locks, but the locks each only have a critical section of one variable so atomic reads/writes with barriers can be used to achieve the same behavior. Like the previous patch, pthread_mutex_lock isn't fair, so in a tight loop the previous thread that has the lock can keep it starving another thread, even if that thread is about to write the data that will stop the current thread from spinning. On a 64c Arm system this improves performance by 20x on sgesv.goto. |
||
---|---|---|
.. | ||
getf2 | ||
getrf | ||
getrs | ||
laswp | ||
lauu2 | ||
lauum | ||
potf2 | ||
potrf | ||
trti2 | ||
trtri | ||
trtrs | ||
CMakeLists.txt | ||
Makefile |