Martin Kroeker
c26c0b77a7
Fix wrong constraints in inline assembly
...
for #2009
2019-02-15 15:08:16 +01:00
Martin Kroeker
1c6da2d03c
Merge pull request #2019 from martin-frbg/gcc9fixes
...
Fix unannounced modification of input operand 8 (lda4) in Haswell GEMVN microkernel
2019-02-15 15:02:54 +01:00
Martin Kroeker
4255a58cd2
Rename operands to put lda on the input/output constraint list
2019-02-15 10:10:04 +01:00
Martin Kroeker
46e415b140
Save and restore input argument 8 (lda4)
...
Fixes miscompilation with gcc9 -ftree-vectorize (related to issue #2009 )
2019-02-14 22:43:18 +01:00
Bart Oldeman
69a97ca7b9
dgemv_kernel_4x4(Haswell): add missing clobbers for xmm0,xmm1,xmm2,xmm3
...
This fixes a crash in dblat2 when OpenBLAS is compiled using
-march=znver1 -ftree-vectorize -O2
See also:
https://github.com/easybuilders/easybuild-easyconfigs/issues/7180
2019-02-14 16:27:58 +00:00
Martin Kroeker
ab1630f9fa
Fix declaration of arguments in inline assembly
...
Argument 0 is modified so should be input and output
2019-02-12 16:14:02 +01:00
Martin Kroeker
b824fa70eb
Fix declaration of assembly arguments in SSYMV and DSYMV microkernels
...
Arguments 0 and 1 are both input and output
2019-02-12 16:00:18 +01:00
Martin Kroeker
91481a3e4e
Fix declaration of input arguments in inline assembly
...
Argument 0 is modified as it doubles as a counter
2019-02-12 15:51:43 +01:00
Martin Kroeker
dc6ac9eab0
Fix declaration of input arguments in the x86_64 s/dGEMV_T and s/dGEMV_N kernels
...
Arguments 0 and 1 need to be tagged as both input and output
2019-02-12 15:33:48 +01:00
Martin Kroeker
32b0f1168e
Fix declaration of input arguments in the Sandybridge GER microkernels ( #1967 )
...
* Tag arguments 0 and 1 as both input and output
2019-01-18 08:11:39 +01:00
Martin Kroeker
b495e54310
Fix declaration of input arguments in the x86_64 SCAL microkernels ( #1966 )
...
* Tag arguments 0 and 1 as both input and output (see #1964 )
2019-01-18 08:11:07 +01:00
Martin Kroeker
d5e6940253
Fix declaration of input arguments in the x86_64 microkernels for DOT and AXPY ( #1965 )
...
* Tag operands 0 and 1 as both input and output
For #1964 (basically a continuation of coding problems first seen in #1292 )
2019-01-17 23:20:32 +01:00
Arjan van de Ven
795285c587
Fix thinko in skylake beta handling
...
casting ints is cheaper but it has a rounding, not memory casing effect, resulting in
invalid outcome
2018-12-24 18:49:50 +00:00
Arjan van de Ven
d321448a63
dgemm: use dgemm_ncopy_8_skylakex.c also for Haswell
...
The dgemm_ncopy_8_skylakex.c code is not avx512 specific and gives
a nice performance boost for medium sized matrices
2018-12-16 23:09:22 +00:00
Arjan van de Ven
c43331ad0a
dgemm: Use the skylakex beta function also for haswell
...
it's more efficient for certain tall/skinny matrices
2018-12-16 23:09:17 +00:00
Arjan van de Ven
69d206440a
Make the skylakex/haswell sgemm code compile and run even with compilers without avx2 support
2018-12-16 00:19:41 +00:00
Arjan van de Ven
0586899a10
Use sgemm_ncopy_4_skylakex.c also for Haswell
...
sgemm_ncopy_4_skylakex.c uses SSE transpose operations where the
real perf win happens; this also works great for Haswell.
This gives double digit percentage gains on small and skinny matrices
2018-12-15 13:49:19 +00:00
Arjan van de Ven
00dc09ad19
Use the skylake sgemm beta code also for haswell
...
with a few small changes it's possible to use the skylake sgemm code
also for haswell, this gives a modest gain (10% range) for smallish
matrixes but does wonders for very skinny matrixes
2018-12-15 13:49:13 +00:00
Arjan van de Ven
cdc668d82b
Add a "sgemm direct" mode for small matrixes
...
OpenBLAS has a fancy algorithm for copying the input data while laying
it out in a more CPU friendly memory layout.
This is great for large matrixes; the cost of the copy is easily
ammortized by the gains from the better memory layout.
But for small matrixes (on CPUs that can do efficient unaligned loads) this
copy can be a net loss.
This patch adds (for SKYLAKEX initially) a "sgemm direct" mode, that bypasses
the whole copy machinary for ALPHA=1/BETA=0/... standard arguments,
for small matrixes only.
What is small? For the non-threaded case this has been measured to be
in the M*N*K = 28 * 512 * 512 range, while in the threaded case it's
less, around M*N*K = 1 * 512 * 512
2018-12-13 13:47:31 +00:00
Martin Kroeker
701ea88347
Use p2align instead of align for OSX compatibility
...
fixes #1902
2018-12-03 13:06:43 +01:00
Andrew
19c4bdd8b3
Add return value so that freebsd system clang does not err out
2018-11-25 21:35:01 +01:00
Arjan van de Ven
dcc5d6291e
skylakex: Make the sgemm/dgemm beta code robust for a N=0 or M=0 case
...
in the threading code there are cases where N or M can become 0,
and the optimized beta code did not handle this well, leading
to a crash
during the audit for the crash a few edge conditions on the if statements
were found and fixed as well
2018-11-01 01:42:09 +00:00
Arjan van de Ven
55b244ca0d
enable the SGEMM/SKX C based kernel
...
In QA the final bug was found so now the sklyakex sgemm C based kernel can
be activated....
2018-10-12 09:30:35 +00:00
Arjan van de Ven
d4bad73834
Add a C+intrinsics version of the SGEMM/skylakex kernel
...
for most sizes this is 1.2x to 1.4x faster than the current code
2018-10-10 01:49:22 +00:00
Arjan van de Ven
582c589727
dgemm/skylakex: replace discrete mul/add with fma
...
very minor gains since it's not super hot code, but general principles
2018-10-06 23:13:26 +00:00
Arjan van de Ven
adbf6afa25
Add vector optimizations for ncopy as well for dgemm/skylakex
2018-10-06 21:18:12 +00:00
Arjan van de Ven
32bec8afbb
add a skylakex optimized dgemm beta function
2018-10-06 16:36:26 +00:00
Arjan van de Ven
20c5d668fe
dgemm/avx512 simplify and speed up the 4x4 kernel
2018-10-06 14:12:32 +00:00
Arjan van de Ven
6d43c51ccf
undo slow dgemm/skylake microoptimization
...
the compare is more costly than the work
2018-10-06 14:00:37 +00:00
Arjan van de Ven
d74dc39b0f
Add optimized *copy versions for skylakex
...
Add optimized n/t copy versions for skylakex; in the patch the
tcopy is also rewritten using intrinsics; the ncopy file
will be worked on in a future commit
2018-10-06 13:51:44 +00:00
Arjan van de Ven
66b43affbc
Add a 24x8 kernel to the skylakex dgemm implementation
...
Minor gains for small matrixes, but at 512x512 and above the gain
gets more significant.
2018-10-05 13:22:21 +00:00
Arjan van de Ven
1938819c25
skylake dgemm: Add a 16x8 kernel
...
The next step for the avx512 dgemm code is adding a 16x8 kernel.
In the 8x8 kernel, each FMA has a matching load (the broadcast);
in the 16x8 kernel we can reuse this load for 2 FMAs, which
in turn reduces pressure on the load ports of the CPU and gives
a nice performance boost (in the 25% range).
2018-10-05 13:11:35 +00:00
Martin Kroeker
b7496c3638
Function name needs to be CNAME, set from outside to allow suffixing for dynamic_arch
2018-10-04 19:14:59 +02:00
Arjan van de Ven
45fe8cb0c5
Create a AVX512 enabled version of DGEMM
...
This patch adds dgemm_kernel_4x8_skylakex.c which is
* dgemm_kernel_4x8_haswell.s converted to C + intrinsics
* 8x8 support added
* 8x8 kernel implemented using AVX512
Performance is a work in progress, but already shows a 10% - 20%
increase for a wide range of matrix sizes.
2018-10-03 14:45:25 +00:00
Martin Kroeker
375dff54fc
Merge pull request #1733 from fenrus75/dsymv
...
Add an AVX512 enabled DSYMV (L) function
2018-08-12 18:18:36 +02:00
Martin Kroeker
a5f165275a
Merge pull request #1732 from fenrus75/dgemv
...
Add an AVX512 enabled DGEMV (n) function
2018-08-12 18:17:42 +02:00
Martin Kroeker
8c13aa495a
Merge pull request #1730 from fenrus75/fix-sdot
...
Fix typo in sdot function
2018-08-12 18:17:01 +02:00
Arjan van de Ven
9bec34cb67
Add an AVX512 enabled DSYMV (L) function
...
written in C intrinsics for best readability.
(the same C code works for Haswell as well)
For logistical reasons the code falls back to the existing
haswell AVX2 implementation if the GCC or LLVM compiler is not new enough
2018-08-11 17:46:24 +00:00
Arjan van de Ven
87bebdbd8a
Add an AVX512 enabled DGEMV (n) function
...
written in C intrinsics for best readability.
(the same C code works for Haswell as well)
For logistical reasons the code falls back to the existing
haswell AVX2 implementation if the GCC or LLVM compiler is not new enough
2018-08-11 17:38:12 +00:00
Arjan van de Ven
36add7570a
Fix typo in sdot function
...
it looks like my previous pull request was short the final commit;
fix a typo in sdot
2018-08-11 17:16:45 +00:00
Arjan van de Ven
cacacc8007
Add an AVX512 enabled DSCAL function
...
written in C intrinsics for best readability.
(the same C code works for Haswell as well)
For logistical reasons the code falls back to the existing
haswell AVX2 implementation if the GCC or LLVM compiler is not new enough
2018-08-11 17:14:57 +00:00
Martin Kroeker
1a00ef3d27
Merge pull request #1725 from fenrus75/axpy
...
Add a AVX512 enabled SAXPY/DAXPY functions
2018-08-11 11:01:20 +02:00
Arjan van de Ven
2e99873ff7
Add a AVX512 enabled SAXPY/DAXPY functions
...
written in C intrinsics for best readability.
(the same C code works for Haswell as well)
For logistical reasons the code falls back to the existing
haswell AVX2 implementation if the GCC or LLVM compiler is not new enough
2018-08-10 02:58:32 +00:00
Arjan van de Ven
00abaa865b
Add an AVX512 enabled SDOT function
...
written in C intrinsics for best readability.
(the same C code works for Haswell as well)
For logistical reasons the code falls back to the existing
haswell AVX2 implementation if the GCC or LLVM compiler is not new enough
2018-08-10 02:33:43 +00:00
Arjan van de Ven
7932ff3ea9
Add an AVX512 enabled DDOT function
...
written in C intrinsics for best readability.
(the same C code works for Haswell as well)
For logistical reasons the code falls back to the existing
haswell AVX2 implementation if the GCC or LLVM compiler is not new enough
2018-08-09 03:55:52 +00:00
Martin Kroeker
6e54b0a027
Disable the 16x2 DTRMM kernel on SkylakeX as well
2018-06-30 17:31:06 +02:00
Martin Kroeker
f0a8dc2eec
Disable the AVX512 DGEMM kernel for now
...
due to #1643
2018-06-30 11:34:48 +02:00
Craig Donner
c2545b0fd6
Fixed a few more unnecessary calls to num_cpu_avail.
...
I don't have as many benchmarks for these as for gemm, but it should still
make a difference for small matrices.
2018-06-11 10:17:16 +01:00
Arjan van de Ven
89372e0993
Use AVX512 also for DGEMM
...
this required switching to the generic gemm_beta code (which is faster anyway on SKX)
for both DGEMM and SGEMM
Performance for the not-retuned version is in the 30% range
2018-06-03 22:17:27 +00:00
Arjan van de Ven
99c7bba8e4
Initial support for SkylakeX / AVX512
...
This patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server)
target. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set,
which brings 2 basic things:
1) 512 bit wide SIMD (2x width of AVX2)
2) 32 SIMD registers (2x the number on AVX2)
This initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel
to AVX512VL; more will follow later but this patch aims to get the infrastructure
in place for this "later".
Full performance tuning has not been done yet; with more registers and wider SIMD
it's in theory possible to retune the kernels but even without that there's an
interesting enough performance increase (30-40% range) with just this change.
2018-06-03 07:58:52 +00:00
Martin Kroeker
840e01061f
Merge pull request #1491 from martin-frbg/ddot_mt
...
Add multithreading support for Haswell DDOT
2018-03-27 21:43:05 +02:00
Martin Kroeker
a55694dd5b
Declare dot_compute static to avoid conflicts in multiarch builds
2018-03-16 22:23:36 +01:00
Martin Kroeker
85a41e9cdb
Add multithreading support for Haswell DDOT
...
copied from ashwinyes' implementation in dot_thunderx2t99.c
2018-03-16 16:58:47 +01:00
Martin Kroeker
81215711a2
Re-enable DAXPY microkernels for x86_64
...
as the inaccuracies seen in the original testcase for #1332 appear to be due to an artefact that amplifies the very small rounding differences between FMA and discrete multiply+add
2018-03-04 19:37:03 +01:00
Martin Kroeker
497f0c3d8a
Replace .align with .p2align in the Nehalem microkernels
2018-02-26 20:58:33 +01:00
Martin Kroeker
ea37db828e
Convert .align to .p2align for OSX compatibility
2018-02-26 20:48:03 +01:00
Martin Kroeker
7c1925acec
Use .p2align instead of .align for compatibility on Sandybridge as well
2018-02-24 19:43:15 +01:00
Martin Kroeker
2359c7c1a9
Use .p2align instead of .align for portability
...
The OSX assembler apparently mishandles the argument to decimal .align, leading to a significant loss of performance
as observed in #730 , #901 and most recently #1470
2018-02-24 17:50:13 +01:00
Martin Kroeker
e388459a27
Merge pull request #1419 from brada4/develop
...
Initialize unitialized values for repeated calls
2018-01-31 23:48:34 +01:00
Andrew
4938faa822
core.IdenticalExpr clang501 checker
2018-01-19 23:15:58 +01:00
Martin Kroeker
42285d8e70
Merge pull request #1410 from brada4/develop
...
Address warnings #1357
2018-01-06 20:02:46 +01:00
Andrew
4d0b005e5b
Eliminate remaining unused results in kernels (clang5 analyzer)
2018-01-01 20:54:39 +01:00
Martin Kroeker
b81656936f
Merge pull request #1409 from martin-frbg/issue1292-2
...
Tag %1 and %2 as both input and output operands
2017-12-31 20:18:48 +01:00
Martin Kroeker
b973990df2
Tag %1 and %2 as both input and output operands
...
fix from #1292 extended to the other gemv microkernels
2017-12-31 18:03:36 +01:00
Martin Kroeker
1e31124eb0
Merge pull request #1406 from martin-frbg/issue1292
...
Tag %1 and %2 as both input and output
2017-12-30 14:52:03 +01:00
Martin Kroeker
723f396a20
Tag %1 and %2 as both input and output
...
The inline assembly modifies its input operands, so mark them as output to avoid surprises with optimization. Fixes #1292
2017-12-29 23:56:41 +01:00
Martin Kroeker
43c0622e7b
Retire Piledriver/Steamroller/Excavator daxpy microkernels as well
...
related to issue #1332
2017-12-13 18:40:39 +01:00
Martin Kroeker
0623636c98
Use Sandybridge daxpy kernel on Haswell and Zen for now
...
The testcase from #1332 exposes a problem in daxpy_microk_haswell-2.c that is not seen with
any of the other Intel x86_64 microkernels.
2017-12-10 19:24:31 +01:00
Andrew
281a2b952f
warning cleanup ( #1380 )
...
* dead increments in driver/level2
* dead increments in kernel/generic
* part dead increments in kernel/x86_64
2017-12-05 19:54:10 +01:00
Martin Kroeker
6c77b5f267
Merge pull request #1369 from martin-frbg/dsdot
...
Add optimized dsdot to all other x86_64 kernels that use sdot.c
2017-11-28 18:15:31 +01:00
Martin Kroeker
c92cd6d162
Add trivially optimized dsdot based on sdot
2017-11-24 20:05:27 +01:00
Martin Kroeker
cae5d9a20b
Add trivially optimized dsdot based on sdot
2017-11-24 20:04:29 +01:00
Martin Kroeker
3d891c3106
Add trivially optimized dsdot based on sdot
2017-11-24 20:03:40 +01:00
Martin Kroeker
4fbdcfa823
Add trivially optimized dsdot based on sdot
2017-11-24 20:02:28 +01:00
Martin Kroeker
1bb6a96ebc
Add trivially optimized dsdot based on sdot
2017-11-24 20:01:42 +01:00
Martin Kroeker
6bd163f37a
Add trivially optimized dsdot based on sdot
2017-11-24 20:00:23 +01:00
Martin Kroeker
f0333333d1
Add trivially optimized dsdot based on sdot
2017-11-24 19:59:28 +01:00
Andrew
e89b979b2c
fix spurious compiler warning fix (no code change)
2017-11-24 18:39:04 +01:00
Andrew
7e9b29b9b8
fix spurious compiler warning (no code change)
2017-11-24 18:36:37 +01:00
Martin Kroeker
6157d0902a
Merge pull request #1358 from martin-frbg/unused_vars
...
Clean up spurious unused variables in the kernels
2017-11-15 11:31:43 +01:00
Martin Kroeker
3fea849bbf
Remove unused variables from Haswell dtrmm and Bulldozer dtrsm
2017-11-14 23:35:10 +01:00
Martin Kroeker
8f177621bc
Remove unused variables at0...at3 from ?symv_U
2017-11-14 23:32:25 +01:00
Martin Kroeker
5f402b7759
Remove unused (loop?) variable j from the gemv_n_4 implementations
2017-11-14 23:29:42 +01:00
Martin Kroeker
a07807caac
Eliminate loop code when called as/from dsdot
2017-10-25 16:45:41 +02:00
Martin Kroeker
5e3e91d0fc
Split the microkernel workload into chunks of 32 floats for dsdot mode to limit loss of precision
2017-10-22 18:18:51 +02:00
Martin Kroeker
28c3fa8950
Add dsdot
2017-10-16 23:29:03 +02:00
Martin Kroeker
8ac87c1cb6
Implement DSDOT with unchanged sdot microkernels
2017-10-16 23:27:51 +02:00
Isuru Fernando
505b218829
Merge remote-tracking branch 'upstream/develop' into dyn
2017-08-06 19:07:00 +05:30
Isuru Fernando
1d1854032b
Add missing EXCAVATOR
2017-08-02 19:03:04 +05:30
Isuru Fernando
2c51a990ac
Fix extra whitespaces. CMake parser macro fails with it
...
TODO: Fix the parser macro to strip trailing whitespaces
2017-08-02 18:26:57 +05:30
Isuru Fernando
ca17b4b75c
Fix complex support for MSVC headers
2017-07-28 11:50:29 +05:30
Denis Steckelmacher
c9ff735da6
Add ZEN support (tested for auto-detected static backend)
2017-03-19 15:32:50 +01:00
Martin Kroeker
a6efabf155
Replace gnu _real_ , _imag_ extensions in initializers
2017-03-13 00:38:37 +01:00
Martin Kroeker
dc34a0da96
Merge pull request #915 from mdong/small_fix_for_icc
...
remove input from clobbered list
2017-02-23 20:00:22 +01:00
Martin Kroeker
4998e19869
Change file comments to work around clang 3.9 assembler bug
2016-10-13 16:51:08 +02:00
Martin Kroeker
16446d1d23
Remove explicit include of complex.h
2016-09-29 23:45:56 +02:00
mdong
098d8ec5d6
remove input from clobbered list
2016-06-24 16:37:58 -04:00
Werner Saar
298b13bba4
updated some kernel files for EXCAVATOR
2016-04-25 10:36:23 +02:00
Zhang Xianyi
f24d5307cf
Refs #834 . Fix zgemv config bug on Steamroller.
2016-04-12 22:26:11 +08:00
Zhang Xianyi
d4380c1fe4
Refs xianyi/OpenBLAS-CI#10 , Fix sdot for scipy test_iterative.test_convergence test failure on AMD bulldozer and piledriver.
2016-04-07 01:44:18 +08:00
Werner Saar
faa5e2e5e3
FIX: forgot the add the files cgemv_n_4.c and cgemv_t_4.c
2016-03-10 11:10:38 +01:00
Werner Saar
fdf291be30
Added optimized cgemv_n and cgemv_t kernels for bulldozer, piledriver and steamroller
2016-03-10 09:42:07 +01:00
Werner Saar
c99cc41cbd
Added optimized zgemv_n kernel for bulldozer, piledriver and steamroller
2016-03-09 14:02:03 +01:00
Werner Saar
acdff55a6a
Bugfix for ztrmv
2016-03-07 09:39:34 +01:00
Zhang Xianyi
7d6b68eb4a
Refs #786 . Revert to default assembly kernel.
2016-03-07 11:34:58 +08:00
Zhang Xianyi
8f758eeff9
Refs #786 . avoid old assembly c/zgemv kernels.
2016-03-05 08:32:03 +08:00
Zhang Xianyi
efa4f5c936
Refs #695 #783 . Replace default x86_64 cgemv_t
...
asm kernel by C kernel.
2016-03-01 11:18:56 +08:00
Zhang Xianyi
6e7be06e07
Refs JuliaLang/julia#5728 . Fix gemv performance bug on Haswell Mac OSX.
...
On Mac OS X, it should use .align 4 (equal to .align 16 on Linux).
I didn't get the performance benefit from .align. Thus, I deleted it.
2016-02-19 17:56:07 -05:00
Zhang Xianyi
962376664d
Refs #768 . Swap the result of zdot x87 fp kernel.
2016-02-02 09:15:02 +08:00
Zhang Xianyi
c44ff4d648
Refs #714 . avoid compiling warnings.
2016-01-28 04:38:07 +08:00
Werner Saar
c8f2c5d636
added optimized trsm_kernels
2016-01-05 13:05:05 +01:00
Zhang Xianyi
69363622a8
Fix DYNAMIC_ARCH=1 bug.
2015-10-27 05:10:40 +08:00
Zhang Xianyi
f874465bb8
Use cmake to build OpenBLAS GENERIC Target on MSVC x86 64-bit.
...
Disable CBLAS and LAPACK.
2015-08-10 14:10:44 -05:00
Zhang Xianyi
ab0a0a75fc
Merge branch 'develop' into cmake
2015-08-03 23:59:01 -05:00
Zhang Xianyi
1cf2b10224
Use pure C generic target on x86 and x86_64.
...
make TARGET=GENERIC
?gemm3m is unimplemented on generic target.
2015-08-03 23:55:56 -05:00
Zhang Xianyi
7ac7e147d4
Fixed cmake building bugs on Linux. Disable LAPACK by default.
2015-08-04 04:37:05 +08:00
Werner Saar
e7c969e164
added optimized dtrmm_kernel for haswell
2015-06-13 16:16:29 +02:00
Werner Saar
9bd962f655
modified haswell parameter dgemm_unroll_n
2015-06-13 10:28:27 +02:00
Werner Saar
24f58c8bb1
added optimized cscal and zscal kernels for steamroller
2015-05-18 12:40:07 +02:00
Werner Saar
95b1faf667
added optimized cscal and zscal kernels for steamroller and piledriver
2015-05-18 10:50:57 +02:00
Werner Saar
2d9e406050
added optimized cscal kernel for sandybridge
2015-05-18 08:46:06 +02:00
Werner Saar
59083e3ce1
added optimized cscal kernel for bulldozer
2015-05-18 07:33:52 +02:00
wernsaar
685be40339
Merge pull request #571 from wernsaar/develop
...
added optimized cscal and zscal functions
2015-05-17 14:09:14 +02:00
Werner Saar
31c9e399e9
added optimized cscal kernel for haswell
2015-05-17 13:44:09 +02:00
Werner Saar
7de6bb9889
added optimized zscal kernel for bulldozer
2015-05-17 11:45:19 +02:00
Werner Saar
d63034303b
added optimized zscal kernel for haswell
2015-05-16 16:41:45 +02:00
Zhang Xianyi
51ff17d46e
Add AMD Excavator target.
2015-05-13 16:16:30 -05:00
Werner Saar
18e90ee2e3
bugfix: added static to functions
2015-05-13 13:31:26 +02:00
Werner Saar
e00cccc41e
added optimized dscal kernel for piledriver
2015-05-13 13:05:35 +02:00
Werner Saar
73f09bf64f
optimized dscal kernel for increment != 1
2015-05-13 12:14:39 +02:00
Werner Saar
02e772c7e4
added optimized dscal kernel for haswell
2015-05-12 17:19:58 +02:00
Werner Saar
7aee913991
added optimized dscal kernel for sandybridge
2015-05-12 16:27:43 +02:00
Werner Saar
e50a933037
added optimized dscal kernel for bulldozer
2015-05-12 12:28:44 +02:00
Werner Saar
133c11a156
updated dgemv_n kernel for nehalem
2015-04-30 14:38:06 +02:00
Werner Saar
30f52d53df
optimized dgemv_n kernel for haswell
2015-04-30 12:11:39 +02:00
Werner Saar
5e83d80725
optimized dger kernel for sandybridge
2015-04-28 16:58:11 +02:00
Werner Saar
b2e1797dc6
added optimized sger kernel for sandybridge
2015-04-28 15:33:38 +02:00
Werner Saar
e216f686cb
optimized saxpy and daxpy for sandybridge
2015-04-28 10:18:32 +02:00
Werner Saar
fc0e0391f3
bugfixes: replaced int with BLASLONG
2015-04-24 14:30:44 +02:00
Werner Saar
c22068c406
optimized sdot.c for increments != 1
2015-04-24 13:13:20 +02:00
Werner Saar
dee100d0e4
optimized saxpy.c for increments != 1
2015-04-24 11:52:59 +02:00
Werner Saar
0273966abb
optimized daxpy kernel for increments != 1
2015-04-24 11:39:17 +02:00
Werner Saar
3a67daa954
optimized ddot.c for increments != 1
2015-04-24 10:56:55 +02:00
Werner Saar
b4f2153dcd
added optimized ssymv kernels for sandybridge
2015-04-23 12:19:24 +02:00
Werner Saar
1c4b0eeae3
added optimized ssymv kernels for haswell
2015-04-23 10:23:13 +02:00
Werner Saar
1bec9abb9a
added optimized dsymv kernels for sandybridge
2015-04-22 12:09:43 +02:00
Werner Saar
3814bf60d3
added optimized dsymv kernels for haswell
2015-04-22 10:42:50 +02:00
Werner Saar
6d0db0151f
added optimized zaxpy-kernels
2015-04-16 11:19:37 +02:00
Zhang Xianyi
37b9033c90
Merge pull request #543 from jeromerobert/develop
...
Fix a buffer overflow with MAX_STACK_ALLOC size in dgemv_t
2015-04-15 11:18:14 -05:00
Werner Saar
13889515b3
added optimized caxpy-kernel for sandybridge
2015-04-15 16:29:25 +02:00
Werner Saar
248c9340c3
added optimized caxpy-kernel for haswell
2015-04-15 15:16:31 +02:00
Werner Saar
e9f33b4ca7
added optimized caxpy-kernel for steamroller
2015-04-15 13:49:23 +02:00
Werner Saar
f5d847122a
updated caxpy_microk_bulldozer-2.c and caxpy.c
2015-04-15 11:59:38 +02:00
Jerome Robert
a4c96eca67
Fix a buffer overflow with MAX_STACK_ALLOC size in dgemv_t
...
Refs #478 , #482 , 9798481
, fd9fd42
2015-04-15 11:46:48 +02:00
Werner Saar
baa0363ea2
add optimized ddot-kernel for piledriver
2015-04-14 15:09:13 +02:00
Werner Saar
34ba66606a
add optimized daxpy-kernel for piledriver
2015-04-14 14:23:29 +02:00
Werner Saar
f615dc7603
added optimized saxpy kernel for steamroller
2015-04-14 09:09:39 +02:00
Werner Saar
331c417637
optimized saxpy for piledriver
2015-04-14 08:34:11 +02:00
Werner Saar
d7a17ad85d
optimized sdot-kernel for pilediver
2015-04-13 13:19:21 +02:00
Werner Saar
d35f6c63c2
add optimized daxpy-kernel for steamroller
2015-04-13 12:22:43 +02:00
Werner Saar
166d76e864
added optimized sdot-kernel for steamroller
2015-04-11 08:48:18 +02:00
Werner Saar
f9f127d838
added optimized ddot kernel for steamroller
2015-04-10 16:18:03 +02:00
wernsaar
62231ab337
Merge pull request #538 from wernsaar/develop
...
Added optimized cdot- and zdot-kernels
2015-04-10 16:03:37 +02:00
Werner Saar
3119def9a7
updated cdot and zdot
2015-04-10 11:10:31 +02:00
Werner Saar
33b332372a
add optimized cdot- and zdot-kernel for sandybridge
2015-04-10 09:37:26 +02:00
Werner Saar
fd838c75bc
add optimized cdot- and zdot-kernel for haswell
2015-04-09 15:13:52 +02:00
Werner Saar
b57a60dac8
updated cdot and zdot for piledriver
2015-04-09 10:33:46 +02:00
Werner Saar
5c51163972
added optimized cdot- and zdot-kernel for steamroller
2015-04-09 09:45:23 +02:00
Werner Saar
9299d8cfd6
added optimized cdot- and zdot-kernels for bulldozer
2015-04-08 16:29:55 +02:00
Zhang Xianyi
0a3d3b945d
Refs #535 . Fix the wrong vector instruction in sgemm sandy bridge kernel.
2015-04-08 03:55:49 +08:00
Werner Saar
60c6dec6e6
updated some lines for bulldozer
2015-04-06 18:47:16 +02:00
Werner Saar
47898cca35
added optimized saxpy- and daxpy-kernel for sandybridge
2015-04-06 16:05:16 +02:00
Werner Saar
53bb924287
added optimized saxpy- and daxpy-kernel for haswell
2015-04-06 12:33:16 +02:00
Werner Saar
a901b065d3
added optimized ddot-kernel for sandybridge
2015-04-05 20:19:38 +02:00
Werner Saar
3937e2a0a0
add optimized sdot-kernel for sandybridge
2015-04-05 19:47:05 +02:00
Werner Saar
9707d608d5
removed double definition line
2015-04-05 18:35:34 +02:00
Werner Saar
701b9d7556
added optimized sdot- and ddot-kernel for HASWELL
2015-04-05 17:57:53 +02:00
Zhang Xianyi
41aad0407f
Merge pull request #482 from jeromerobert/develop
...
Allow to do gemv and ger buffer allocation on the stack
2015-01-02 02:26:17 +08:00
Werner Saar
ddf983d643
added optimizations for steamroller
2014-12-30 20:14:45 +08:00
Werner Saar
4319769b79
added target processor STEAMROLLER
2014-12-28 20:16:46 +08:00
Jerome Robert
e9d9a8eae3
Allow to do gemv and ger buffer allocation on the stack
...
ger and gemv call blas_memory_alloc/free which in their turn
call blas_lock. blas_lock create thread contention when matrices
are small and the number of thread is high enough. We avoid
call blas_memory_alloc by replacing it with stack allocation.
This can be enabled with:
make -DMAX_STACK_ALLOC=2048
The given size (in byte) must be high enough to avoid thread contention
and small enough to avoid stack overflow.
Fix #478
2014-12-27 14:33:12 +01:00
Werner Saar
587e16fba3
Ref #458 : Backport, sandybrigde uses nehalem zgemm kernel
2014-12-22 17:01:18 +01:00
Werner Saar
6261342de3
small optimization on dgemm_kernel for N=1
2014-12-18 20:35:51 +01:00
Werner Saar
bc5fff7085
changed inline assembler labels to short form
2014-12-07 12:38:54 +01:00
Zhang Xianyi
0cf29ba6d2
Fixed a bug of sgemm sandy bridge kernel.
...
Reported by Julia project. JuliaLang/julia#9084
2014-12-03 17:38:41 +08:00
Zhang Xianyi
2fb02626da
Update organization info.
2014-11-25 15:28:58 +08:00
Zhang Xianyi
a85c2785ae
Refs #467 . Added generic kernel file for x86_64.
2014-11-24 15:34:48 +08:00
wernsaar
b7c9566eea
removed obsolete gemv kernel files
2014-09-14 11:00:53 +02:00
wernsaar
6df1b0be81
optimized zgemv_n_microk_sandy-4.c
2014-09-14 10:21:22 +02:00
wernsaar
2ac1e076c1
added optimized zgemv_n kernel for sandybridge
2014-09-14 09:02:05 +02:00
wernsaar
9908b6031c
bugfix in KERNEL.PILEDRIVER
2014-09-13 16:26:53 +02:00
wernsaar
8f100a14f2
optimized cgemv_t kernel for haswell
2014-09-13 16:13:27 +02:00
wernsaar
53b5726b04
added optimized cgemv_t kernel for haswell
2014-09-13 15:14:12 +02:00
wernsaar
1a352b24e6
updated KERNEL.HASWELL
2014-09-13 12:23:27 +02:00
wernsaar
5194818d4b
updated zgemv_t_4.c
2014-09-13 09:48:34 +02:00
wernsaar
8a39cdb1c1
added optimized zgemv_t kernel for haswell
2014-09-13 09:47:07 +02:00
wernsaar
0a1390f2d8
enabled optimized zgemv_t kernel for bulldozer
2014-09-12 17:43:47 +02:00
wernsaar
a8b0812feb
optimized zgemv_t for bulldozer
2014-09-12 17:42:25 +02:00
wernsaar
a0fb68ab42
added optimized zgemv_t kernel for bulldozer
2014-09-12 17:04:22 +02:00
wernsaar
44c11165d5
bugfix in cgemv_t_4.c
2014-09-12 14:12:24 +02:00
wernsaar
564be4eb72
added optimized cgemv_t kernel
2014-09-12 13:38:01 +02:00
wernsaar
107c3ea7d5
added optimized zgemv_t routine
2014-09-12 12:35:20 +02:00
wernsaar
bb8d698335
optimized zgemv_n_microk_haswell-4.c for small size
2014-09-11 13:44:55 +02:00
wernsaar
e0192a6914
bugfix in zgemv_n_4.c
2014-09-11 13:18:00 +02:00
wernsaar
bced4594bb
added optimized zgemv_n kernel
2014-09-11 12:34:57 +02:00
wernsaar
cafba99b6b
bufix in cgemv_n_microk_haswell-4.c
2014-09-11 11:12:44 +02:00
wernsaar
ac8f232b2a
more optimizations
2014-09-11 10:25:48 +02:00
wernsaar
f98e1244c4
optimized cgemv_n_4.c
2014-09-10 19:26:14 +02:00
wernsaar
be95700b30
added optimized cgemv_kernel for haswell
2014-09-10 14:11:24 +02:00
wernsaar
4aa534ae93
added cgemv_n kernel, optimized for small sizes
2014-09-10 13:45:13 +02:00
wernsaar
baa46e4fba
added and tested optimized dgemv_n kernel for haswell
2014-09-09 16:17:45 +02:00
wernsaar
faab7a181d
added optimized dgemv_n kernel for haswell
2014-09-09 15:32:32 +02:00
wernsaar
8109d8232c
optimized dgemv_t kernel for haswell
2014-09-09 14:38:08 +02:00
wernsaar
debc6d1a05
bugfix in KERNEL.HASWELL
2014-09-09 14:04:44 +02:00
wernsaar
e73a0113ec
added optimized gemv kernels
2014-09-09 13:54:55 +02:00
wernsaar
44f2bf9bae
added optimized dgemv_t kernel for haswell
2014-09-09 13:34:22 +02:00
wernsaar
cd34e9701b
removed obsolete files
2014-09-08 19:15:31 +02:00
wernsaar
658939faaa
optimized dgemv_n kernel for small sizes
2014-09-08 15:22:35 +02:00
wernsaar
c4d9d4e5f8
added haswell optimized kernel
2014-09-08 12:25:16 +02:00
wernsaar
7c0a94ff47
bugfix in sgemv_n_microk_haswell-4.c
2014-09-08 10:54:33 +02:00
wernsaar
cbbc80aad3
added optimized sgemv_t kernel for haswell
2014-09-08 10:13:39 +02:00
wernsaar
2be5c7a640
bugfix for windows
2014-09-07 21:48:42 +02:00
wernsaar
80f7786875
enabled optimized sgemv kernels for piledriver
2014-09-07 21:13:57 +02:00
wernsaar
553e275407
optimized sgemv_n kernel for sandybridge
2014-09-07 20:53:30 +02:00
wernsaar
7b3932b3f3
optimized sgemv_n kernel for nehalem
2014-09-07 19:20:08 +02:00
wernsaar
75207b1148
optimized sgemv_n for very small size of m
2014-09-07 18:23:48 +02:00
wernsaar
274828fa50
optimizations for very small sizes
2014-09-07 13:45:03 +02:00
wernsaar
5ae1731fe6
better optimzations for sgemv_t kernel
2014-09-06 21:28:57 +02:00
wernsaar
c8eaf3ae2d
optimized sgemv_t_4 kernel for very small sizes
2014-09-06 19:41:57 +02:00
wernsaar
3a7ab47ee9
optimized sgemv_t
2014-09-06 18:34:25 +02:00
wernsaar
cf5544b417
optimization for small size
2014-09-06 13:17:56 +02:00
wernsaar
d143f84dd2
added optimized sgemv_n kernel for haswell
2014-09-06 12:08:48 +02:00
wernsaar
a64fe9bcc9
added optimized sgemv_n kernel for sandybridge
2014-09-06 08:41:53 +02:00
wernsaar
6df7a88930
optimized sgemv_t for sandybridge
2014-09-05 10:22:50 +02:00
wernsaar
53de943690
bugfix for sgemv_n_4.c
2014-09-04 18:55:52 +02:00
wernsaar
7f910010a0
optimized sgemv_n kernel for small sizes
2014-09-04 13:09:27 +02:00
wernsaar
3a5d8dbff9
optimized sgemv_n_4.c
2014-09-03 15:34:30 +02:00
wernsaar
2a60c6d4b0
optimized sgemv_n for small sizes
2014-09-03 14:48:45 +02:00
wernsaar
0fc560ba23
bugfix for buffer overflow
2014-09-03 10:13:47 +02:00
wernsaar
f3b50dcf5b
removed obsolete instructions from sgemv_t_4.c
2014-09-02 13:35:41 +02:00
wernsaar
93eaba959d
optimized sgemv_t for bulldozer
2014-09-02 12:42:36 +02:00
wernsaar
9570e56965
optimized sgemv_t_4.c for small sizes
2014-09-01 15:11:37 +02:00
wernsaar
bc99faef1b
optimized sgemv_t_4.c for uneven sizes
2014-08-31 14:33:15 +02:00
wernsaar
848c0f16f7
optimized sgemv_t_4.c for small size
2014-08-31 13:23:44 +02:00
wernsaar
53e6dbf6ca
optimized sgemv_t kernel for small sizes
2014-08-30 13:36:27 +02:00
wernsaar
20cd850125
modification for clang compiler
2014-08-27 09:00:20 +02:00
wernsaar
3885eebdb8
added optimized zaxpy bulldozer kernel
2014-08-25 15:52:35 +02:00
wernsaar
ee74445155
added optimized caxpy kernel for bulldozer
2014-08-25 14:53:28 +02:00
wernsaar
9d2ace8bac
added optimized daxpy kernel for bulldozer
2014-08-24 10:57:12 +02:00
wernsaar
b55f997302
added optimized daxpy kernel for nehalem
2014-08-23 17:53:07 +02:00
wernsaar
e45c960c2c
added optimized saxpy kernel for nehalem
2014-08-23 17:15:21 +02:00
wernsaar
ac76b6267f
added optimized dgemv_n kernel for nehalem
2014-08-23 10:40:57 +02:00
wernsaar
f1b96c4846
added optimized ddot kernel for bulldozer
2014-08-22 21:19:29 +02:00
wernsaar
16d6be852d
added optimized ddot kernel for nehalem
2014-08-22 20:34:41 +02:00
wernsaar
95a707ced3
update of KERNEL.BULLDOZER
2014-08-22 17:01:27 +02:00
wernsaar
5d97b0754c
added optimized sdot kernel for nehalem
2014-08-22 17:00:26 +02:00
wernsaar
8a9e868919
added optimized sdot for bulldozer
2014-08-22 14:29:17 +02:00
wernsaar
c8b0645266
added optimized symv_L kernels for nehalem
2014-08-21 14:27:00 +02:00
wernsaar
ec05ff3f64
added optimized ssymv_L kernel for bulldozer
2014-08-21 13:32:06 +02:00
wernsaar
f6f9122660
added optimized dsymv_L kernel for bulldozer
2014-08-21 13:02:53 +02:00
wernsaar
8247f38dc1
added optimized dsymv_U kernel for nehalem
2014-08-20 09:58:04 +02:00
wernsaar
ef6374196d
updated optimized dsymv_U kernel for bulldozer
2014-08-20 09:00:56 +02:00
wernsaar
f824c2b751
updated optimized ssymv_U for bulldozer
2014-08-19 19:25:03 +02:00
wernsaar
4ba4ab623f
added optimized ssymv_U kernel for nehalem
2014-08-19 17:09:45 +02:00
wernsaar
4f39447c05
added optimized ssymv_U kernel for bulldozer
2014-08-18 13:52:24 +02:00
wernsaar
74c9465672
added optimized dsymv_U kernel for bulldozer
2014-08-18 12:18:10 +02:00
wernsaar
11eab4c019
added optimized cgemv_n for haswell
2014-08-14 19:00:30 +02:00
wernsaar
4568d32b6b
added optimized cgemv_t kernel for haswell
2014-08-14 14:10:29 +02:00
wernsaar
c1a6374c6f
optimized zgemv_n kernel for sandybridge
2014-08-13 16:10:03 +02:00
wernsaar
2470129132
added fast return, if m or n < 1
2014-08-13 13:54:19 +02:00
wernsaar
8c582d362d
optimized zgemv_t_microk_haswell-2.c
2014-08-13 13:42:22 +02:00
wernsaar
11e34ddd1b
bugfix for zgemv_n_microk_haswell-2.c
2014-08-13 12:54:18 +02:00
wernsaar
9528f0d9ee
bugfix in zgemv_n_microk_sandy-2.c
2014-08-13 12:18:03 +02:00
wernsaar
b06550519e
added optimized cgemv_t c-kernel
2014-08-12 12:15:41 +02:00
wernsaar
6093ee5363
bugfix in zgemv_n_microk_haswell-2.c
2014-08-12 10:02:25 +02:00
wernsaar
07c66b1960
modified algorithm for better numerical stability
2014-08-12 08:35:42 +02:00
wernsaar
58b075daef
added optimized zgemv_t kernel for haswell
2014-08-11 16:57:52 +02:00
wernsaar
09fcd3a341
add optimized zgemv_t kernel for bulldozer
2014-08-11 14:19:25 +02:00
wernsaar
726ad085cb
added optimized zgemv_t for haswell
2014-08-11 13:10:12 +02:00
wernsaar
6fe416976d
added optimimized zgemv_t c-kernel
2014-08-11 09:13:18 +02:00
wernsaar
dbc2eff029
disabled optimized haswell zgemv_n kernel for windows ( bad rounding )
2014-08-10 11:57:24 +02:00
wernsaar
462b4885ff
added optimized zgemv_n kernel for haswell
2014-08-10 08:39:17 +02:00
wernsaar
aa54fe064c
added zgemv_n c-function
2014-08-07 22:30:20 +02:00
wernsaar
006ef3ea01
added optimized dgemv_t kernel for haswell
2014-08-07 10:08:54 +02:00
wernsaar
60f17628cc
added optimized dgemv_n kernel for haswell
2014-08-07 09:18:02 +02:00
wernsaar
c9bad1403a
added optimized sgemv_t kernel for sandybridge
2014-08-07 07:49:33 +02:00
wernsaar
2f8927376f
enabled optimized nehalem sgemv_t kernel for windows
2014-08-06 16:58:21 +02:00
wernsaar
d945a2b06d
added optimized sgemv_t kernel for nehalem
2014-08-06 16:21:48 +02:00
wernsaar
ca6c8d06ce
enabled optimized sgemv kernels for windows
2014-08-06 14:24:36 +02:00
wernsaar
7aa43c8928
enabled optimized sgemv kernels for windows
2014-08-06 14:06:30 +02:00
wernsaar
891b960854
added optimized sgemv_t kernel for haswell
2014-08-06 13:42:41 +02:00
wernsaar
95a8caa2f3
added optimized sgemv_t kernel
2014-08-06 12:12:17 +02:00
wernsaar
8c05b8105b
bugfix in sgemv_n.c
2014-08-05 20:14:29 +02:00
wernsaar
c80084a98f
changed default x86_64 sgemv_n kernel to sgemv_n.c
2014-08-05 19:42:56 +02:00
wernsaar
2bab92961f
enabled optimized sgemv_n kernels for windows
2014-08-05 14:52:54 +02:00
wernsaar
9175b8bd5f
changed long to blaslong for windows compatibility
2014-08-05 13:28:39 +02:00
wernsaar
793f2d43b0
added optimized sgemv_n kernel for nehalem
2014-08-05 10:50:08 +02:00
wernsaar
a4dde45f87
optimized sgemv_n kernel for sandybridge
2014-08-05 08:53:09 +02:00
wernsaar
7fa7ea3e1e
updated haswell optimized sgmv_n kernel
2014-08-05 08:04:47 +02:00
wernsaar
3fbc13eb65
modified sgemv_n for haswell
2014-08-04 16:22:11 +02:00
wernsaar
db6917303f
added a better optimized sgemv_n kernel for bulldozer and piledriver
2014-08-04 14:29:01 +02:00
wernsaar
5087096711
optimization of sandybridge cgemm-kernel
2014-07-29 19:07:21 +02:00
wernsaar
46bc4fd50c
optimized cgemm kernel for haswell
2014-07-29 08:53:09 +02:00
wernsaar
1cc02b4337
optimized sgemm kernel for haswell
2014-07-28 11:50:01 +02:00
wernsaar
1d33547222
optimized zgemm kernel for haswell
2014-07-27 11:51:42 +02:00
wernsaar
6acbafe45b
added sgemv_n microkernel for haswell
2014-07-20 14:52:25 +02:00
wernsaar
5392d11b04
optimized sgemv_n_microk_sandy.c
2014-07-20 14:08:04 +02:00
wernsaar
c0fe95fb72
added sgemv_n microkernel for sandybridge
2014-07-20 13:17:47 +02:00
wernsaar
d9d4077c93
added sgemv_t microkernel for haswell
2014-07-20 11:30:32 +02:00
wernsaar
02eb72ac42
bugfix in sgemv_t_microk_sandy.c
2014-07-20 10:48:41 +02:00
wernsaar
c06f9986d4
added sgemv_t microkernel for sandybridge
2014-07-20 10:21:08 +02:00
wernsaar
2cce125c79
added optimized sgemv_t for bulldozer and piledriver
2014-07-19 15:48:07 +02:00
wernsaar
b3938fe371
don't use this sgemv_n on Windows
2014-07-19 07:15:34 +02:00
wernsaar
c8a4a56177
performance optimizations for sgemv_n
2014-07-18 11:25:21 +02:00
wernsaar
3c5732615d
added blocked sgemv_n and microkernel for bulldozer and piledriver
2014-07-17 23:15:07 +02:00
wernsaar
880597b301
segment violation in sgemv kernels
2014-07-13 10:46:14 +02:00
wernsaar
13348b2137
removed reference to daxpy_bulldozer kernel (Windows bug in lapack-test)
2014-07-06 16:39:32 +02:00
wernsaar
d5b976f92d
fallback to zgemm_kernel_4x2_sse.S
2014-07-06 11:05:28 +02:00
wernsaar
e0c080a28c
removed reference to zgemm_kernel_4x2_sse3.S (bug in lapack-test)
2014-07-05 16:13:17 +02:00
wernsaar
b079df9ef4
added optimized sdot- and dsdot-kernel, written in C
2014-06-30 14:46:38 +02:00
wernsaar
01a119abfc
enabled SMP for sbmv and zsbmv, but only for 64bit binaries
2014-06-29 20:35:56 +02:00
Zhang Xianyi
99efbbbad5
Fixed #395 . Enable optimized cgemm for Sandybridge. Added optimized sdot kernel.
...
Fixed c/zgemm, zgemv computational error of haswell, piledriver, bullldozer, and
barcelona on Windows.
Merge branch 'develop' of https://github.com/wernsaar/OpenBLAS into wernsaar-develop
Conflicts:
kernel/Makefile.L1
kernel/x86_64/KERNEL
param.h
2014-06-29 10:34:51 +08:00
wernsaar
22e5aee2dd
fixed zgemv bug for older AMD Processors
2014-06-28 19:04:49 +02:00
wernsaar
35d37e124f
bugfix for barcelona zgemv-kernel
2014-06-28 12:36:11 +02:00
wernsaar
d8ba46efdb
bugfix for bulldozer cgemm-, zgemm- and zgemv-kernel
2014-06-28 12:16:20 +02:00
wernsaar
a15f22a1f6
bugfix for piledriver cgemm-, zgemm- and zgemv-kernel
2014-06-28 11:46:58 +02:00
wernsaar
b94ea89f52
bugfix for haswell cgemm- and zgemm-kernel
2014-06-28 10:22:40 +02:00
wernsaar
35f668bb14
bugfix for cgemm_kernel_8x2_sandy.S
2014-06-28 10:01:56 +02:00
Timothy Gu
6c2ead30f0
Remove all trailing whitespace except lapack-netlib
...
Signed-off-by: Timothy Gu <timothygu99@gmail.com>
2014-06-27 12:05:18 -07:00
wernsaar
365e8de346
added optimized cgemm-kernel for SANDYBRIDGE
2014-06-27 13:40:29 +02:00
wernsaar
578d1b6219
added DSDOT definition and enabled optimized sdot kernel
2014-06-27 11:30:29 +02:00
wernsaar
dabab2b5f4
added new optimized sgemm kernel for SANDYBRIGE
2014-06-26 21:42:08 +02:00
wernsaar
aa2709c4e0
enabled optimized dgemm kernel for NEHALEM
2014-06-26 12:22:29 +02:00
wernsaar
a13bcc1716
enabled optimized sgemv kernel for barcelona and piledriver
2014-06-25 13:50:57 +02:00
wernsaar
d2c82d7543
enabled optimized sgemv kernel for HASWELL
2014-06-25 12:56:45 +02:00
wernsaar
0517672dd0
enabled optimized sgemv kernels for nehalem, sandybridge and bulldozer
2014-06-25 12:38:14 +02:00
wernsaar
23203d52c1
Ref #380 : lowered stack usage for haswell kernels
2014-06-19 14:31:52 +02:00
wernsaar
73545a79cd
Ref #380 : lowered stack usage for piledriver and bulldozer kernels
2014-06-19 14:02:14 +02:00
wernsaar
5f3b68b4d4
replaced sgemm and cgemm kernels because lapack bugs
2014-05-10 11:24:07 +02:00
wernsaar
2424af62fd
replaced dgemm-kernel because bug in lapack
2014-05-10 10:52:37 +02:00
wernsaar
793509a3b5
replaced files for sdot, sgemv_n and sgemv_t for bug #348
2014-05-06 15:29:39 +02:00
wernsaar
47b22763f8
reduced stack usage on windows to 16K
2014-04-24 14:09:26 +02:00
Zhang Xianyi
9a557e90da
Refs #340 . Fixed SEGFAULT bug of dgemv_n on OSX.
2014-02-15 23:23:15 +08:00
wangqian
2d557eb1e0
Fixed computational error of dgemv_n.
2014-02-04 21:47:51 +08:00
Zhang Xianyi
05bb391c3a
Refs #330 . Fixed the compatible issue with clang on Mac OSX.
2013-12-16 20:31:17 +08:00
Zhang Xianyi
9b5be29886
Refs #310 . Fixed Segfault bug on nehalem when Julia calling dgeqrt3 on OSX.
...
Please also check JuliaLang/julia#4099
Julia test script:
A=rand(256, 256)
qrfact(A)
I found this was a bug in kernel/x86_64/dgemm_ncopy_8.S.
However, I cannot use gdb with julia. Thus, this is a walkaround fix.
2013-12-12 23:23:04 +08:00
wernsaar
034a5b2083
modified zsymv
2013-12-01 21:07:49 +01:00
wernsaar
27d4234d4d
merged symv
2013-12-01 20:56:02 +01:00
wernsaar
b3254eecaf
Merge remote branch 'origin/haswell' into develop
2013-12-01 18:09:12 +01:00
wernsaar
0b6e13b689
Merge remote branch 'origin/develop' into haswell
2013-12-01 13:38:11 +01:00