OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Martin Kroeker	4fda217f99	Delete potrf_parallel.c (moving it to ../potrf)	2020-07-25 06:42:39 +00:00
Martin Kroeker	bbe119ee3b	Update conditional for atomics to use HAVE_C11	2020-07-18 17:19:59 +00:00
Martin Kroeker	f4f74941bd	Update conditional for atomics to use HAVE_C11	2020-07-18 17:14:50 +00:00
Rajalakshmi Srinivasaraghavan	7eb55504b1	RFC : Add half precision gemm for bfloat16 in OpenBLAS This patch adds support for bfloat16 data type matrix multiplication kernel. For architectures that don't support bfloat16, it is defined as unsigned short (2 bytes). Default unroll sizes can be changed as per architecture as done for SGEMM and for now 8 and 4 are used for M and N. Size of ncopy/tcopy can be changed as per architecture requirement and for now, size 2 is used. Added shgemm in kernel/power/KERNEL.POWER9 and tested in powerpc64le and powerpc64. For reference, added a small test compare_sgemm_shgemm.c to compare sgemm and shgemm output. This patch does not cover OpenBLAS test, benchmark and lapack tests for shgemm. Complex type implementation can be discussed and added once this is approved.	2020-04-14 14:55:08 -05:00
Ali Saidi	208c7e7ca5	Use acq/rel semantics to pass flags/pointers in getrf_parallel. The current implementation has locks, but the locks each only have a critical section of one variable so atomic reads/writes with barriers can be used to achieve the same behavior. Like the previous patch, pthread_mutex_lock isn't fair, so in a tight loop the previous thread that has the lock can keep it starving another thread, even if that thread is about to write the data that will stop the current thread from spinning. On a 64c Arm system this improves performance by 20x on sgesv.goto.	2020-03-06 06:22:31 +00:00
Andrew	575a84398a	remove redundant code #2113	2019-05-07 23:46:54 +03:00
Zhiyong Dang	3716267124	Change _STDC_VERSION__ to __STDC_VERSION__ Change-Id: Id3fa4e8d9eedd4ef7230df69b611e7f397301a42	2018-05-11 12:15:08 +08:00
Martin Kroeker	20c6c38e51	Merge branch 'develop' into atomic	2018-04-07 12:09:39 +02:00
Martin Kroeker	8ec28ff461	Remove unguarded use of _Atomic and fix tabbing	2018-04-04 22:40:30 +02:00
Martin Kroeker	bb9876db33	Fix thread races and infinite looping on systems with many cpus On systems with more than 64 cpus, blas_quickdivide will sometimes return zero which creates bogus workloads when used for the stride calculation. This then leads to threads spinning incessantly waiting for a status change that never happens, as seen in #1497. This patch also fixes several data races that were found by helgrind and/or tsan while debugging the issue.	2018-04-04 18:16:52 +02:00
Martin Kroeker	40160ff3c1	Use _Atomic instead of volatile for thread safety where C11 is supported	2018-03-10 00:15:44 +01:00
Andrew	d602b99386	LAPACK helpers in C that need care too	2018-01-02 14:38:50 +01:00
Ashwin Sekhar T K	3918d17025	LAPACK: Fix lapack-test errors in ARM64 threaded version	2017-01-31 23:36:23 +05:30
Werner Saar	c81dc6322f	prepared lapack/potrf functions for UNROLL values, that are not a power of two	2017-01-10 10:50:28 +01:00
Werner Saar	3e1bbd6b5f	prepared lapack/getrf functions for UNROLL values, that are not a power of two	2017-01-09 12:57:26 +01:00
Werner Saar	956be69e1d	optimized getrf_single.c for POWER8	2016-05-17 16:19:53 +02:00
Werner Saar	6a2bde7a2d	optimized dgemm and dgetrf for POWER8	2016-05-17 14:45:27 +02:00
Hank Anderson	e74462a3f5	Moved declarations to start of functions to satisfy MSVC C89 implementation.	2015-02-11 11:16:57 -06:00
Hank Anderson	056ba26755	Changed a number of inline calls to use __inline. MSVC doesn't inmplement C99, so can't use the inline keyword. __inline appears to work in MSVC and GCC.	2015-02-11 11:13:17 -06:00
Timothy Gu	6c2ead30f0	Remove all trailing whitespace except lapack-netlib Signed-off-by: Timothy Gu <timothygu99@gmail.com>	2014-06-27 12:05:18 -07:00
Zhang Xianyi	5048a80032	Refs #283 . Fixed the incorrect usage of long data type for Windows 64.	2013-11-14 13:46:42 +08:00
Zhang Xianyi	32d2ca3035	Refs #214 , #221 , #246 . Fixed the getrf overflow bug on Windows. I used a smaller threshold since the stack size is 1MB on windows.	2013-07-11 03:20:02 +08:00
Zhang Xianyi	5d3312142a	Refs #221 #246 . Fixed the overflowing stack bug in mutlithreading BLAS3. When NUM_THREADS(MAX_CPU_NUNBERS) is very large ,e.g. 256. typedef struct { volatile BLASLONG working[MAX_CPU_NUMBER][CACHE_LINE_SIZE * DIVIDE_RATE]; } job_t; job_t job[MAX_CPU_NUMBER]; The job array is equal 8MB. Thus, We use malloc instead of stack allocation.	2013-07-08 01:07:05 +08:00
Zhang Xianyi	1b056c5328	Refs #130 Prevent reading ipiv array beyond the bound in ?laswp. Use laswp instead of laswp_oncopy in getrf.	2012-08-09 20:06:51 +08:00
Xianyi Zhang	342bbc3871	Import GotoBLAS2 1.13 BSD version codes.	2011-01-24 14:54:24 +00:00

25 Commits