Merge branch 'develop'
This commit is contained in:
commit
7e4e195e82
|
|
@ -1,6 +1,21 @@
|
|||
OpenBLAS ChangeLog
|
||||
====================================================================
|
||||
Version 0.2.10
|
||||
Version 0.2.12
|
||||
13-Oct-2014
|
||||
common:
|
||||
* Added CBLAS interface for ?omatcopy and ?imatcopy.
|
||||
* Enable ?gemm3m functions.
|
||||
* Added benchmark for ?gemm3m.
|
||||
* Optimized multithreading lower limits.
|
||||
* Disabled SYMM3M and HEMM3M functions
|
||||
because of segment violations.
|
||||
|
||||
x86/x86-64:
|
||||
* Improved axpy and symv performance on AMD Bulldozer.
|
||||
* Improved gemv performance on modern Intel and AMD CPUs.
|
||||
|
||||
====================================================================
|
||||
Version 0.2.11
|
||||
18-Aug-2014
|
||||
common:
|
||||
* Added some benchmark codes.
|
||||
|
|
|
|||
|
|
@ -3,7 +3,7 @@
|
|||
#
|
||||
|
||||
# This library's version
|
||||
VERSION = 0.2.11
|
||||
VERSION = 0.2.12
|
||||
|
||||
# If you set the suffix, the library name will be libopenblas_$(LIBNAMESUFFIX).a
|
||||
# and libopenblas_$(LIBNAMESUFFIX).so. Meanwhile, the soname in shared library
|
||||
|
|
|
|||
|
|
@ -339,7 +339,7 @@ FCOMMON_OPT += -m128bit-long-double
|
|||
endif
|
||||
ifeq ($(C_COMPILER), CLANG)
|
||||
EXPRECISION = 1
|
||||
CCOMMON_OPT += -DEXPRECISION
|
||||
CCOMMON_OPT += -DEXPRECISION
|
||||
FCOMMON_OPT += -m128bit-long-double
|
||||
endif
|
||||
endif
|
||||
|
|
@ -350,6 +350,7 @@ ifeq ($(C_COMPILER), INTEL)
|
|||
CCOMMON_OPT += -wd981
|
||||
endif
|
||||
|
||||
|
||||
ifeq ($(USE_OPENMP), 1)
|
||||
# ifeq logical or. GCC or LSB
|
||||
ifeq ($(C_COMPILER), $(filter $(C_COMPILER),GCC LSB))
|
||||
|
|
|
|||
19
README.md
19
README.md
|
|
@ -55,16 +55,23 @@ Please read GotoBLAS_01Readme.txt
|
|||
|
||||
#### x86/x86-64:
|
||||
- **Intel Xeon 56xx (Westmere)**: Used GotoBLAS2 Nehalem codes.
|
||||
- **Intel Sandy Bridge**: Optimized Level-3 BLAS with AVX on x86-64.
|
||||
- **Intel Haswell**: Optimized Level-3 BLAS with AVX on x86-64 (identical to Sandy Bridge).
|
||||
- **Intel Sandy Bridge**: Optimized Level-3 and Level-2 BLAS with AVX on x86-64.
|
||||
- **Intel Haswell**: Optimized Level-3 and Level-2 BLAS with AVX2 and FMA on x86-64.
|
||||
- **AMD Bobcat**: Used GotoBLAS2 Barcelona codes.
|
||||
- **AMD Bulldozer**: x86-64 S/DGEMM AVX kernels. (Thank Werner Saar)
|
||||
- **AMD PILEDRIVER**: Used Bulldozer codes.
|
||||
- **AMD Bulldozer**: x86-64 ?GEMM FMA4 kernels. (Thank Werner Saar)
|
||||
- **AMD PILEDRIVER**: Uses Bulldozer codes with some optimizations.
|
||||
|
||||
#### MIPS64:
|
||||
- **ICT Loongson 3A**: Optimized Level-3 BLAS and the part of Level-1,2.
|
||||
- **ICT Loongson 3B**: Experimental
|
||||
|
||||
#### ARM:
|
||||
- **ARMV6**: Optimized BLAS for vfpv2 and vfpv3-d16 ( e.g. BCM2835, Cortex M0+ )
|
||||
- **ARMV7**: Optimized BLAS for vfpv3-d32 ( e.g. Cortex A8, A9 and A15 )
|
||||
|
||||
#### ARM64:
|
||||
- **ARMV8**: Experimental
|
||||
|
||||
### Support OS:
|
||||
- **GNU/Linux**
|
||||
- **MingWin/Windows**: Please read <https://github.com/xianyi/OpenBLAS/wiki/How-to-use-OpenBLAS-in-Microsoft-Visual-Studio>.
|
||||
|
|
@ -116,8 +123,8 @@ Please see Changelog.txt to obtain the differences between GotoBLAS2 1.13 BSD ve
|
|||
* Please read [Faq](https://github.com/xianyi/OpenBLAS/wiki/Faq) at first.
|
||||
* Please use gcc version 4.6 and above to compile Sandy Bridge AVX kernels on Linux/MingW/BSD.
|
||||
* Please use Clang version 3.1 and above to compile the library on Sandy Bridge microarchitecture. The Clang 3.0 will generate the wrong AVX binary code.
|
||||
* The number of CPUs/Cores should less than or equal to 256.
|
||||
* On Linux, OpenBLAS sets the processor affinity by default. This may cause [the conflict with R parallel](https://stat.ethz.ch/pipermail/r-sig-hpc/2012-April/001348.html). You can build the library with NO_AFFINITY=1.
|
||||
* The number of CPUs/Cores should less than or equal to 256. On Linux x86_64(amd64), there is experimental support for up to 1024 CPUs/Cores and 128 numa nodes if you build the library with BIGNUMA=1.
|
||||
* OpenBLAS does not set processor affinity by default. On Linux, you can enable processor affinity by commenting the line NO_AFFINITY=1 in Makefile.rule. But this may cause [the conflict with R parallel](https://stat.ethz.ch/pipermail/r-sig-hpc/2012-April/001348.html).
|
||||
* On Loongson 3A. make test would be failed because of pthread_create error. The error code is EAGAIN. However, it will be OK when you run the same testcase on shell.
|
||||
|
||||
## Contributing
|
||||
|
|
|
|||
|
|
@ -19,6 +19,7 @@ PENRYN
|
|||
DUNNINGTON
|
||||
NEHALEM
|
||||
SANDYBRIDGE
|
||||
HASWELL
|
||||
ATOM
|
||||
|
||||
b)AMD CPU:
|
||||
|
|
@ -30,6 +31,7 @@ SHANGHAI
|
|||
ISTANBUL
|
||||
BOBCAT
|
||||
BULLDOZER
|
||||
PILEDRIVER
|
||||
|
||||
c)VIA CPU:
|
||||
SSE_GENERIC
|
||||
|
|
@ -59,3 +61,7 @@ ITANIUM2
|
|||
SPARC
|
||||
SPARCV7
|
||||
|
||||
6.ARM CPU:
|
||||
ARMV7
|
||||
ARMV6
|
||||
ARMV5
|
||||
|
|
|
|||
|
|
@ -35,7 +35,10 @@ goto :: slinpack.goto dlinpack.goto clinpack.goto zlinpack.goto \
|
|||
ssyrk.goto dsyrk.goto csyrk.goto zsyrk.goto \
|
||||
ssyr2k.goto dsyr2k.goto csyr2k.goto zsyr2k.goto \
|
||||
sger.goto dger.goto \
|
||||
ssymv.goto dsymv.goto \
|
||||
sdot.goto ddot.goto \
|
||||
saxpy.goto daxpy.goto caxpy.goto zaxpy.goto \
|
||||
ssymv.goto dsymv.goto csymv.goto zsymv.goto \
|
||||
chemv.goto zhemv.goto \
|
||||
chemm.goto zhemm.goto \
|
||||
cherk.goto zherk.goto \
|
||||
cher2k.goto zher2k.goto \
|
||||
|
|
@ -53,7 +56,10 @@ acml :: slinpack.acml dlinpack.acml clinpack.acml zlinpack.acml \
|
|||
ssyrk.acml dsyrk.acml csyrk.acml zsyrk.acml \
|
||||
ssyr2k.acml dsyr2k.acml csyr2k.acml zsyr2k.acml \
|
||||
sger.acml dger.acml \
|
||||
ssymv.acml dsymv.acml \
|
||||
sdot.acml ddot.acml \
|
||||
saxpy.acml daxpy.acml caxpy.acml zaxpy.acml \
|
||||
ssymv.acml dsymv.acml csymv.acml zsymv.acml \
|
||||
chemv.acml zhemv.acml \
|
||||
chemm.acml zhemm.acml \
|
||||
cherk.acml zherk.acml \
|
||||
cher2k.acml zher2k.acml \
|
||||
|
|
@ -71,7 +77,10 @@ atlas :: slinpack.atlas dlinpack.atlas clinpack.atlas zlinpack.atlas \
|
|||
ssyrk.atlas dsyrk.atlas csyrk.atlas zsyrk.atlas \
|
||||
ssyr2k.atlas dsyr2k.atlas csyr2k.atlas zsyr2k.atlas \
|
||||
sger.atlas dger.atlas \
|
||||
ssymv.atlas dsymv.atlas \
|
||||
sdot.atlas ddot.atlas \
|
||||
saxpy.atlas daxpy.atlas caxpy.atlas zaxpy.atlas \
|
||||
ssymv.atlas dsymv.atlas csymv.atlas zsymv.atlas \
|
||||
chemv.atlas zhemv.atlas \
|
||||
chemm.acml zhemm.acml \
|
||||
chemm.atlas zhemm.atlas \
|
||||
cherk.atlas zherk.atlas \
|
||||
|
|
@ -90,7 +99,10 @@ mkl :: slinpack.mkl dlinpack.mkl clinpack.mkl zlinpack.mkl \
|
|||
ssyrk.mkl dsyrk.mkl csyrk.mkl zsyrk.mkl \
|
||||
ssyr2k.mkl dsyr2k.mkl csyr2k.mkl zsyr2k.mkl \
|
||||
sger.mkl dger.mkl \
|
||||
ssymv.mkl dsymv.mkl \
|
||||
sdot.mkl ddot.mkl \
|
||||
saxpy.mkl daxpy.mkl caxpy.mkl zaxpy.mkl \
|
||||
ssymv.mkl dsymv.mkl csymv.mkl zsymv.mkl \
|
||||
chemv.mkl zhemv.mkl \
|
||||
chemm.mkl zhemm.mkl \
|
||||
cherk.mkl zherk.mkl \
|
||||
cher2k.mkl zher2k.mkl \
|
||||
|
|
@ -100,7 +112,12 @@ mkl :: slinpack.mkl dlinpack.mkl clinpack.mkl zlinpack.mkl \
|
|||
spotrf.mkl dpotrf.mkl cpotrf.mkl zpotrf.mkl \
|
||||
ssymm.mkl dsymm.mkl csymm.mkl zsymm.mkl
|
||||
|
||||
all :: goto atlas acml mkl
|
||||
|
||||
goto_3m :: cgemm3m.goto zgemm3m.goto
|
||||
|
||||
mkl_3m :: cgemm3m.mkl zgemm3m.mkl
|
||||
|
||||
all :: goto mkl atlas acml
|
||||
|
||||
##################################### Slinpack ####################################################
|
||||
slinpack.goto : slinpack.$(SUFFIX) ../$(LIBNAME)
|
||||
|
|
@ -732,6 +749,32 @@ dsymv.atlas : dsymv.$(SUFFIX)
|
|||
dsymv.mkl : dsymv.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBMKL) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
##################################### Csymv ####################################################
|
||||
csymv.goto : csymv.$(SUFFIX) ../$(LIBNAME)
|
||||
$(CC) $(CFLAGS) -o $(@F) $^ $(CEXTRALIB) $(EXTRALIB) -lm
|
||||
|
||||
csymv.acml : csymv.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBACML) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
csymv.atlas : csymv.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBATLAS) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
csymv.mkl : csymv.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBMKL) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
##################################### Dsymv ####################################################
|
||||
zsymv.goto : zsymv.$(SUFFIX) ../$(LIBNAME)
|
||||
$(CC) $(CFLAGS) -o $(@F) $^ $(CEXTRALIB) $(EXTRALIB) -lm
|
||||
|
||||
zsymv.acml : zsymv.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBACML) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
zsymv.atlas : zsymv.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBATLAS) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
zsymv.mkl : zsymv.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBMKL) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
##################################### Sgeev ####################################################
|
||||
sgeev.goto : sgeev.$(SUFFIX) ../$(LIBNAME)
|
||||
$(CC) $(CFLAGS) -o $(@F) $^ $(CEXTRALIB) $(EXTRALIB) -lm
|
||||
|
|
@ -896,6 +939,131 @@ zpotrf.atlas : zpotrf.$(SUFFIX)
|
|||
zpotrf.mkl : zpotrf.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBMKL) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
##################################### Chemv ####################################################
|
||||
|
||||
chemv.goto : chemv.$(SUFFIX) ../$(LIBNAME)
|
||||
$(CC) $(CFLAGS) -o $(@F) $^ $(CEXTRALIB) $(EXTRALIB) -lm
|
||||
|
||||
chemv.acml : chemv.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBACML) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
chemv.atlas : chemv.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBATLAS) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
chemv.mkl : chemv.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBMKL) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
##################################### Zhemv ####################################################
|
||||
|
||||
zhemv.goto : zhemv.$(SUFFIX) ../$(LIBNAME)
|
||||
$(CC) $(CFLAGS) -o $(@F) $^ $(CEXTRALIB) $(EXTRALIB) -lm
|
||||
|
||||
zhemv.acml : zhemv.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBACML) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
zhemv.atlas : zhemv.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBATLAS) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
zhemv.mkl : zhemv.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBMKL) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
##################################### Sdot ####################################################
|
||||
sdot.goto : sdot.$(SUFFIX) ../$(LIBNAME)
|
||||
$(CC) $(CFLAGS) -o $(@F) $^ $(CEXTRALIB) $(EXTRALIB) -lm
|
||||
|
||||
sdot.acml : sdot.$(SUFFIX)
|
||||
$(CC) $(CFLAGS) -o $(@F) $^ $(LIBACML) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
sdot.atlas : sdot.$(SUFFIX)
|
||||
$(CC) $(CFLAGS) -o $(@F) $^ $(LIBATLAS) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
sdot.mkl : sdot.$(SUFFIX)
|
||||
$(CC) $(CFLAGS) -o $(@F) $^ $(LIBMKL) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
##################################### Ddot ####################################################
|
||||
ddot.goto : ddot.$(SUFFIX) ../$(LIBNAME)
|
||||
$(CC) $(CFLAGS) -o $(@F) $^ $(CEXTRALIB) $(EXTRALIB) -lm
|
||||
|
||||
ddot.acml : ddot.$(SUFFIX)
|
||||
$(CC) $(CFLAGS) -o $(@F) $^ $(LIBACML) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
ddot.atlas : ddot.$(SUFFIX)
|
||||
$(CC) $(CFLAGS) -o $(@F) $^ $(LIBATLAS) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
ddot.mkl : ddot.$(SUFFIX)
|
||||
$(CC) $(CFLAGS) -o $(@F) $^ $(LIBMKL) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
##################################### Saxpy ####################################################
|
||||
saxpy.goto : saxpy.$(SUFFIX) ../$(LIBNAME)
|
||||
$(CC) $(CFLAGS) -o $(@F) $^ $(CEXTRALIB) $(EXTRALIB) -lm
|
||||
|
||||
saxpy.acml : saxpy.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBACML) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
saxpy.atlas : saxpy.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBATLAS) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
saxpy.mkl : saxpy.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBMKL) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
##################################### Daxpy ####################################################
|
||||
daxpy.goto : daxpy.$(SUFFIX) ../$(LIBNAME)
|
||||
$(CC) $(CFLAGS) -o $(@F) $^ $(CEXTRALIB) $(EXTRALIB) -lm
|
||||
|
||||
daxpy.acml : daxpy.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBACML) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
daxpy.atlas : daxpy.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBATLAS) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
daxpy.mkl : daxpy.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBMKL) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
##################################### Caxpy ####################################################
|
||||
|
||||
caxpy.goto : caxpy.$(SUFFIX) ../$(LIBNAME)
|
||||
$(CC) $(CFLAGS) -o $(@F) $^ $(CEXTRALIB) $(EXTRALIB) -lm
|
||||
|
||||
caxpy.acml : caxpy.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBACML) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
caxpy.atlas : caxpy.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBATLAS) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
caxpy.mkl : caxpy.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBMKL) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
##################################### Zaxpy ####################################################
|
||||
|
||||
zaxpy.goto : zaxpy.$(SUFFIX) ../$(LIBNAME)
|
||||
$(CC) $(CFLAGS) -o $(@F) $^ $(CEXTRALIB) $(EXTRALIB) -lm
|
||||
|
||||
zaxpy.acml : zaxpy.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBACML) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
zaxpy.atlas : zaxpy.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBATLAS) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
zaxpy.mkl : zaxpy.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBMKL) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
|
||||
##################################### Cgemm3m ####################################################
|
||||
|
||||
cgemm3m.goto : cgemm3m.$(SUFFIX) ../$(LIBNAME)
|
||||
$(CC) $(CFLAGS) -o $(@F) $^ $(CEXTRALIB) $(EXTRALIB) -lm
|
||||
|
||||
cgemm3m.mkl : cgemm3m.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBMKL) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
##################################### Zgemm3m ####################################################
|
||||
|
||||
zgemm3m.goto : zgemm3m.$(SUFFIX) ../$(LIBNAME)
|
||||
$(CC) $(CFLAGS) -o $(@F) $^ $(CEXTRALIB) $(EXTRALIB) -lm
|
||||
|
||||
zgemm3m.mkl : zgemm3m.$(SUFFIX)
|
||||
-$(CC) $(CFLAGS) -o $(@F) $^ $(LIBMKL) $(CEXTRALIB) $(EXTRALIB) $(FEXTRALIB)
|
||||
|
||||
|
||||
###################################################################################################
|
||||
|
||||
|
|
@ -1037,6 +1205,12 @@ ssymv.$(SUFFIX) : symv.c
|
|||
dsymv.$(SUFFIX) : symv.c
|
||||
$(CC) $(CFLAGS) -c -UCOMPLEX -DDOUBLE -o $(@F) $^
|
||||
|
||||
csymv.$(SUFFIX) : symv.c
|
||||
$(CC) $(CFLAGS) -c -DCOMPLEX -UDOUBLE -o $(@F) $^
|
||||
|
||||
zsymv.$(SUFFIX) : symv.c
|
||||
$(CC) $(CFLAGS) -c -DCOMPLEX -DDOUBLE -o $(@F) $^
|
||||
|
||||
sgeev.$(SUFFIX) : geev.c
|
||||
$(CC) $(CFLAGS) -c -UCOMPLEX -UDOUBLE -o $(@F) $^
|
||||
|
||||
|
|
@ -1073,8 +1247,35 @@ cpotrf.$(SUFFIX) : potrf.c
|
|||
zpotrf.$(SUFFIX) : potrf.c
|
||||
$(CC) $(CFLAGS) -c -DCOMPLEX -DDOUBLE -o $(@F) $^
|
||||
|
||||
chemv.$(SUFFIX) : hemv.c
|
||||
$(CC) $(CFLAGS) -c -DCOMPLEX -UDOUBLE -o $(@F) $^
|
||||
|
||||
zhemv.$(SUFFIX) : hemv.c
|
||||
$(CC) $(CFLAGS) -c -DCOMPLEX -DDOUBLE -o $(@F) $^
|
||||
|
||||
sdot.$(SUFFIX) : dot.c
|
||||
$(CC) $(CFLAGS) -c -UCOMPLEX -UDOUBLE -o $(@F) $^
|
||||
|
||||
ddot.$(SUFFIX) : dot.c
|
||||
$(CC) $(CFLAGS) -c -UCOMPLEX -DDOUBLE -o $(@F) $^
|
||||
|
||||
saxpy.$(SUFFIX) : axpy.c
|
||||
$(CC) $(CFLAGS) -c -UCOMPLEX -UDOUBLE -o $(@F) $^
|
||||
|
||||
daxpy.$(SUFFIX) : axpy.c
|
||||
$(CC) $(CFLAGS) -c -UCOMPLEX -DDOUBLE -o $(@F) $^
|
||||
|
||||
caxpy.$(SUFFIX) : axpy.c
|
||||
$(CC) $(CFLAGS) -c -DCOMPLEX -UDOUBLE -o $(@F) $^
|
||||
|
||||
zaxpy.$(SUFFIX) : axpy.c
|
||||
$(CC) $(CFLAGS) -c -DCOMPLEX -DDOUBLE -o $(@F) $^
|
||||
|
||||
cgemm3m.$(SUFFIX) : gemm3m.c
|
||||
$(CC) $(CFLAGS) -c -DCOMPLEX -UDOUBLE -o $(@F) $^
|
||||
|
||||
zgemm3m.$(SUFFIX) : gemm3m.c
|
||||
$(CC) $(CFLAGS) -c -DCOMPLEX -DDOUBLE -o $(@F) $^
|
||||
|
||||
|
||||
clean ::
|
||||
|
|
|
|||
|
|
@ -0,0 +1,201 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
#ifdef __CYGWIN32__
|
||||
#include <sys/time.h>
|
||||
#endif
|
||||
#include "common.h"
|
||||
|
||||
|
||||
#undef AXPY
|
||||
|
||||
#ifdef COMPLEX
|
||||
#ifdef DOUBLE
|
||||
#define AXPY BLASFUNC(zaxpy)
|
||||
#else
|
||||
#define AXPY BLASFUNC(caxpy)
|
||||
#endif
|
||||
#else
|
||||
#ifdef DOUBLE
|
||||
#define AXPY BLASFUNC(daxpy)
|
||||
#else
|
||||
#define AXPY BLASFUNC(saxpy)
|
||||
#endif
|
||||
#endif
|
||||
|
||||
#if defined(__WIN32__) || defined(__WIN64__)
|
||||
|
||||
#ifndef DELTA_EPOCH_IN_MICROSECS
|
||||
#define DELTA_EPOCH_IN_MICROSECS 11644473600000000ULL
|
||||
#endif
|
||||
|
||||
int gettimeofday(struct timeval *tv, void *tz){
|
||||
|
||||
FILETIME ft;
|
||||
unsigned __int64 tmpres = 0;
|
||||
static int tzflag;
|
||||
|
||||
if (NULL != tv)
|
||||
{
|
||||
GetSystemTimeAsFileTime(&ft);
|
||||
|
||||
tmpres |= ft.dwHighDateTime;
|
||||
tmpres <<= 32;
|
||||
tmpres |= ft.dwLowDateTime;
|
||||
|
||||
/*converting file time to unix epoch*/
|
||||
tmpres /= 10; /*convert into microseconds*/
|
||||
tmpres -= DELTA_EPOCH_IN_MICROSECS;
|
||||
tv->tv_sec = (long)(tmpres / 1000000UL);
|
||||
tv->tv_usec = (long)(tmpres % 1000000UL);
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
#if !defined(__WIN32__) && !defined(__WIN64__) && !defined(__CYGWIN32__) && 0
|
||||
|
||||
static void *huge_malloc(BLASLONG size){
|
||||
int shmid;
|
||||
void *address;
|
||||
|
||||
#ifndef SHM_HUGETLB
|
||||
#define SHM_HUGETLB 04000
|
||||
#endif
|
||||
|
||||
if ((shmid =shmget(IPC_PRIVATE,
|
||||
(size + HUGE_PAGESIZE) & ~(HUGE_PAGESIZE - 1),
|
||||
SHM_HUGETLB | IPC_CREAT |0600)) < 0) {
|
||||
printf( "Memory allocation failed(shmget).\n");
|
||||
exit(1);
|
||||
}
|
||||
|
||||
address = shmat(shmid, NULL, SHM_RND);
|
||||
|
||||
if ((BLASLONG)address == -1){
|
||||
printf( "Memory allocation failed(shmat).\n");
|
||||
exit(1);
|
||||
}
|
||||
|
||||
shmctl(shmid, IPC_RMID, 0);
|
||||
|
||||
return address;
|
||||
}
|
||||
|
||||
#define malloc huge_malloc
|
||||
|
||||
#endif
|
||||
|
||||
int MAIN__(int argc, char *argv[]){
|
||||
|
||||
FLOAT *x, *y;
|
||||
FLOAT alpha[2] = { 2.0, 2.0 };
|
||||
blasint m, i;
|
||||
blasint inc_x=1,inc_y=1;
|
||||
int loops = 1;
|
||||
int l;
|
||||
char *p;
|
||||
|
||||
int from = 1;
|
||||
int to = 200;
|
||||
int step = 1;
|
||||
|
||||
struct timeval start, stop;
|
||||
double time1,timeg;
|
||||
|
||||
argc--;argv++;
|
||||
|
||||
if (argc > 0) { from = atol(*argv); argc--; argv++;}
|
||||
if (argc > 0) { to = MAX(atol(*argv), from); argc--; argv++;}
|
||||
if (argc > 0) { step = atol(*argv); argc--; argv++;}
|
||||
|
||||
if ((p = getenv("OPENBLAS_LOOPS"))) loops = atoi(p);
|
||||
if ((p = getenv("OPENBLAS_INCX"))) inc_x = atoi(p);
|
||||
if ((p = getenv("OPENBLAS_INCY"))) inc_y = atoi(p);
|
||||
|
||||
fprintf(stderr, "From : %3d To : %3d Step = %3d Inc_x = %d Inc_y = %d Loops = %d\n", from, to, step,inc_x,inc_y,loops);
|
||||
|
||||
if (( x = (FLOAT *)malloc(sizeof(FLOAT) * to * abs(inc_x) * COMPSIZE)) == NULL){
|
||||
fprintf(stderr,"Out of Memory!!\n");exit(1);
|
||||
}
|
||||
|
||||
if (( y = (FLOAT *)malloc(sizeof(FLOAT) * to * abs(inc_y) * COMPSIZE)) == NULL){
|
||||
fprintf(stderr,"Out of Memory!!\n");exit(1);
|
||||
}
|
||||
|
||||
#ifdef linux
|
||||
srandom(getpid());
|
||||
#endif
|
||||
|
||||
fprintf(stderr, " SIZE Flops\n");
|
||||
|
||||
for(m = from; m <= to; m += step)
|
||||
{
|
||||
|
||||
timeg=0;
|
||||
|
||||
fprintf(stderr, " %6d : ", (int)m);
|
||||
|
||||
|
||||
for (l=0; l<loops; l++)
|
||||
{
|
||||
|
||||
for(i = 0; i < m * COMPSIZE * abs(inc_x); i++){
|
||||
x[i] = ((FLOAT) rand() / (FLOAT) RAND_MAX) - 0.5;
|
||||
}
|
||||
|
||||
for(i = 0; i < m * COMPSIZE * abs(inc_y); i++){
|
||||
y[i] = ((FLOAT) rand() / (FLOAT) RAND_MAX) - 0.5;
|
||||
}
|
||||
gettimeofday( &start, (struct timezone *)0);
|
||||
|
||||
AXPY (&m, alpha, x, &inc_x, y, &inc_y );
|
||||
|
||||
gettimeofday( &stop, (struct timezone *)0);
|
||||
|
||||
time1 = (double)(stop.tv_sec - start.tv_sec) + (double)((stop.tv_usec - start.tv_usec)) * 1.e-6;
|
||||
|
||||
timeg += time1;
|
||||
|
||||
}
|
||||
|
||||
timeg /= loops;
|
||||
|
||||
fprintf(stderr,
|
||||
" %10.2f MFlops\n",
|
||||
COMPSIZE * COMPSIZE * 2. * (double)m / timeg * 1.e-6);
|
||||
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
void main(int argc, char *argv[]) __attribute__((weak, alias("MAIN__")));
|
||||
|
|
@ -0,0 +1,195 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
#ifdef __CYGWIN32__
|
||||
#include <sys/time.h>
|
||||
#endif
|
||||
#include "common.h"
|
||||
|
||||
|
||||
#undef DOT
|
||||
|
||||
|
||||
#ifdef DOUBLE
|
||||
#define DOT BLASFUNC(ddot)
|
||||
#else
|
||||
#define DOT BLASFUNC(sdot)
|
||||
#endif
|
||||
|
||||
|
||||
#if defined(__WIN32__) || defined(__WIN64__)
|
||||
|
||||
#ifndef DELTA_EPOCH_IN_MICROSECS
|
||||
#define DELTA_EPOCH_IN_MICROSECS 11644473600000000ULL
|
||||
#endif
|
||||
|
||||
int gettimeofday(struct timeval *tv, void *tz){
|
||||
|
||||
FILETIME ft;
|
||||
unsigned __int64 tmpres = 0;
|
||||
static int tzflag;
|
||||
|
||||
if (NULL != tv)
|
||||
{
|
||||
GetSystemTimeAsFileTime(&ft);
|
||||
|
||||
tmpres |= ft.dwHighDateTime;
|
||||
tmpres <<= 32;
|
||||
tmpres |= ft.dwLowDateTime;
|
||||
|
||||
/*converting file time to unix epoch*/
|
||||
tmpres /= 10; /*convert into microseconds*/
|
||||
tmpres -= DELTA_EPOCH_IN_MICROSECS;
|
||||
tv->tv_sec = (long)(tmpres / 1000000UL);
|
||||
tv->tv_usec = (long)(tmpres % 1000000UL);
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
#if !defined(__WIN32__) && !defined(__WIN64__) && !defined(__CYGWIN32__) && 0
|
||||
|
||||
static void *huge_malloc(BLASLONG size){
|
||||
int shmid;
|
||||
void *address;
|
||||
|
||||
#ifndef SHM_HUGETLB
|
||||
#define SHM_HUGETLB 04000
|
||||
#endif
|
||||
|
||||
if ((shmid =shmget(IPC_PRIVATE,
|
||||
(size + HUGE_PAGESIZE) & ~(HUGE_PAGESIZE - 1),
|
||||
SHM_HUGETLB | IPC_CREAT |0600)) < 0) {
|
||||
printf( "Memory allocation failed(shmget).\n");
|
||||
exit(1);
|
||||
}
|
||||
|
||||
address = shmat(shmid, NULL, SHM_RND);
|
||||
|
||||
if ((BLASLONG)address == -1){
|
||||
printf( "Memory allocation failed(shmat).\n");
|
||||
exit(1);
|
||||
}
|
||||
|
||||
shmctl(shmid, IPC_RMID, 0);
|
||||
|
||||
return address;
|
||||
}
|
||||
|
||||
#define malloc huge_malloc
|
||||
|
||||
#endif
|
||||
|
||||
int MAIN__(int argc, char *argv[]){
|
||||
|
||||
FLOAT *x, *y;
|
||||
FLOAT result;
|
||||
blasint m, i;
|
||||
blasint inc_x=1,inc_y=1;
|
||||
int loops = 1;
|
||||
int l;
|
||||
char *p;
|
||||
|
||||
int from = 1;
|
||||
int to = 200;
|
||||
int step = 1;
|
||||
|
||||
struct timeval start, stop;
|
||||
double time1,timeg;
|
||||
|
||||
argc--;argv++;
|
||||
|
||||
if (argc > 0) { from = atol(*argv); argc--; argv++;}
|
||||
if (argc > 0) { to = MAX(atol(*argv), from); argc--; argv++;}
|
||||
if (argc > 0) { step = atol(*argv); argc--; argv++;}
|
||||
|
||||
if ((p = getenv("OPENBLAS_LOOPS"))) loops = atoi(p);
|
||||
if ((p = getenv("OPENBLAS_INCX"))) inc_x = atoi(p);
|
||||
if ((p = getenv("OPENBLAS_INCY"))) inc_y = atoi(p);
|
||||
|
||||
fprintf(stderr, "From : %3d To : %3d Step = %3d Inc_x = %d Inc_y = %d Loops = %d\n", from, to, step,inc_x,inc_y,loops);
|
||||
|
||||
if (( x = (FLOAT *)malloc(sizeof(FLOAT) * to * abs(inc_x) * COMPSIZE)) == NULL){
|
||||
fprintf(stderr,"Out of Memory!!\n");exit(1);
|
||||
}
|
||||
|
||||
if (( y = (FLOAT *)malloc(sizeof(FLOAT) * to * abs(inc_y) * COMPSIZE)) == NULL){
|
||||
fprintf(stderr,"Out of Memory!!\n");exit(1);
|
||||
}
|
||||
|
||||
#ifdef linux
|
||||
srandom(getpid());
|
||||
#endif
|
||||
|
||||
fprintf(stderr, " SIZE Flops\n");
|
||||
|
||||
for(m = from; m <= to; m += step)
|
||||
{
|
||||
|
||||
timeg=0;
|
||||
|
||||
fprintf(stderr, " %6d : ", (int)m);
|
||||
|
||||
|
||||
for (l=0; l<loops; l++)
|
||||
{
|
||||
|
||||
for(i = 0; i < m * COMPSIZE * abs(inc_x); i++){
|
||||
x[i] = ((FLOAT) rand() / (FLOAT) RAND_MAX) - 0.5;
|
||||
}
|
||||
|
||||
for(i = 0; i < m * COMPSIZE * abs(inc_y); i++){
|
||||
y[i] = ((FLOAT) rand() / (FLOAT) RAND_MAX) - 0.5;
|
||||
}
|
||||
gettimeofday( &start, (struct timezone *)0);
|
||||
|
||||
result = DOT (&m, x, &inc_x, y, &inc_y );
|
||||
|
||||
gettimeofday( &stop, (struct timezone *)0);
|
||||
|
||||
time1 = (double)(stop.tv_sec - start.tv_sec) + (double)((stop.tv_usec - start.tv_usec)) * 1.e-6;
|
||||
|
||||
timeg += time1;
|
||||
|
||||
}
|
||||
|
||||
timeg /= loops;
|
||||
|
||||
fprintf(stderr,
|
||||
" %10.2f MFlops\n",
|
||||
COMPSIZE * COMPSIZE * 2. * (double)m / timeg * 1.e-6);
|
||||
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
void main(int argc, char *argv[]) __attribute__((weak, alias("MAIN__")));
|
||||
|
|
@ -142,7 +142,9 @@ int MAIN__(int argc, char *argv[]){
|
|||
if (argc > 0) { to = MAX(atol(*argv), from); argc--; argv++;}
|
||||
if (argc > 0) { step = atol(*argv); argc--; argv++;}
|
||||
|
||||
fprintf(stderr, "From : %3d To : %3d Step = %3d\n", from, to, step);
|
||||
if ((p = getenv("OPENBLAS_TRANS"))) trans=*p;
|
||||
|
||||
fprintf(stderr, "From : %3d To : %3d Step=%d : Trans=%c\n", from, to, step, trans);
|
||||
|
||||
if (( a = (FLOAT *)malloc(sizeof(FLOAT) * to * to * COMPSIZE)) == NULL){
|
||||
fprintf(stderr,"Out of Memory!!\n");exit(1);
|
||||
|
|
|
|||
|
|
@ -0,0 +1,212 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
#ifdef __CYGWIN32__
|
||||
#include <sys/time.h>
|
||||
#endif
|
||||
#include "common.h"
|
||||
|
||||
|
||||
#undef GEMM
|
||||
|
||||
#ifndef COMPLEX
|
||||
|
||||
#ifdef DOUBLE
|
||||
#define GEMM BLASFUNC(dgemm)
|
||||
#else
|
||||
#define GEMM BLASFUNC(sgemm)
|
||||
#endif
|
||||
|
||||
#else
|
||||
|
||||
#ifdef DOUBLE
|
||||
#define GEMM BLASFUNC(zgemm3m)
|
||||
#else
|
||||
#define GEMM BLASFUNC(cgemm3m)
|
||||
#endif
|
||||
|
||||
#endif
|
||||
|
||||
#if defined(__WIN32__) || defined(__WIN64__)
|
||||
|
||||
#ifndef DELTA_EPOCH_IN_MICROSECS
|
||||
#define DELTA_EPOCH_IN_MICROSECS 11644473600000000ULL
|
||||
#endif
|
||||
|
||||
int gettimeofday(struct timeval *tv, void *tz){
|
||||
|
||||
FILETIME ft;
|
||||
unsigned __int64 tmpres = 0;
|
||||
static int tzflag;
|
||||
|
||||
if (NULL != tv)
|
||||
{
|
||||
GetSystemTimeAsFileTime(&ft);
|
||||
|
||||
tmpres |= ft.dwHighDateTime;
|
||||
tmpres <<= 32;
|
||||
tmpres |= ft.dwLowDateTime;
|
||||
|
||||
/*converting file time to unix epoch*/
|
||||
tmpres /= 10; /*convert into microseconds*/
|
||||
tmpres -= DELTA_EPOCH_IN_MICROSECS;
|
||||
tv->tv_sec = (long)(tmpres / 1000000UL);
|
||||
tv->tv_usec = (long)(tmpres % 1000000UL);
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
#if !defined(__WIN32__) && !defined(__WIN64__) && !defined(__CYGWIN32__) && 0
|
||||
|
||||
static void *huge_malloc(BLASLONG size){
|
||||
int shmid;
|
||||
void *address;
|
||||
|
||||
#ifndef SHM_HUGETLB
|
||||
#define SHM_HUGETLB 04000
|
||||
#endif
|
||||
|
||||
if ((shmid =shmget(IPC_PRIVATE,
|
||||
(size + HUGE_PAGESIZE) & ~(HUGE_PAGESIZE - 1),
|
||||
SHM_HUGETLB | IPC_CREAT |0600)) < 0) {
|
||||
printf( "Memory allocation failed(shmget).\n");
|
||||
exit(1);
|
||||
}
|
||||
|
||||
address = shmat(shmid, NULL, SHM_RND);
|
||||
|
||||
if ((BLASLONG)address == -1){
|
||||
printf( "Memory allocation failed(shmat).\n");
|
||||
exit(1);
|
||||
}
|
||||
|
||||
shmctl(shmid, IPC_RMID, 0);
|
||||
|
||||
return address;
|
||||
}
|
||||
|
||||
#define malloc huge_malloc
|
||||
|
||||
#endif
|
||||
|
||||
int MAIN__(int argc, char *argv[]){
|
||||
|
||||
FLOAT *a, *b, *c;
|
||||
FLOAT alpha[] = {1.0, 1.0};
|
||||
FLOAT beta [] = {1.0, 1.0};
|
||||
char trans='N';
|
||||
blasint m, i, j;
|
||||
int loops = 1;
|
||||
int l;
|
||||
char *p;
|
||||
|
||||
int from = 1;
|
||||
int to = 200;
|
||||
int step = 1;
|
||||
|
||||
struct timeval start, stop;
|
||||
double time1,timeg;
|
||||
|
||||
argc--;argv++;
|
||||
|
||||
if (argc > 0) { from = atol(*argv); argc--; argv++;}
|
||||
if (argc > 0) { to = MAX(atol(*argv), from); argc--; argv++;}
|
||||
if (argc > 0) { step = atol(*argv); argc--; argv++;}
|
||||
|
||||
if ((p = getenv("OPENBLAS_TRANS"))) trans=*p;
|
||||
|
||||
fprintf(stderr, "From : %3d To : %3d Step=%d : Trans=%c\n", from, to, step, trans);
|
||||
|
||||
if (( a = (FLOAT *)malloc(sizeof(FLOAT) * to * to * COMPSIZE)) == NULL){
|
||||
fprintf(stderr,"Out of Memory!!\n");exit(1);
|
||||
}
|
||||
|
||||
if (( b = (FLOAT *)malloc(sizeof(FLOAT) * to * to * COMPSIZE)) == NULL){
|
||||
fprintf(stderr,"Out of Memory!!\n");exit(1);
|
||||
}
|
||||
|
||||
if (( c = (FLOAT *)malloc(sizeof(FLOAT) * to * to * COMPSIZE)) == NULL){
|
||||
fprintf(stderr,"Out of Memory!!\n");exit(1);
|
||||
}
|
||||
|
||||
p = getenv("OPENBLAS_LOOPS");
|
||||
if ( p != NULL )
|
||||
loops = atoi(p);
|
||||
|
||||
|
||||
#ifdef linux
|
||||
srandom(getpid());
|
||||
#endif
|
||||
|
||||
fprintf(stderr, " SIZE Flops\n");
|
||||
|
||||
for(m = from; m <= to; m += step)
|
||||
{
|
||||
|
||||
timeg=0;
|
||||
|
||||
fprintf(stderr, " %6d : ", (int)m);
|
||||
|
||||
for (l=0; l<loops; l++)
|
||||
{
|
||||
|
||||
for(j = 0; j < m; j++){
|
||||
for(i = 0; i < m * COMPSIZE; i++){
|
||||
a[i + j * m * COMPSIZE] = ((FLOAT) rand() / (FLOAT) RAND_MAX) - 0.5;
|
||||
b[i + j * m * COMPSIZE] = ((FLOAT) rand() / (FLOAT) RAND_MAX) - 0.5;
|
||||
c[i + j * m * COMPSIZE] = ((FLOAT) rand() / (FLOAT) RAND_MAX) - 0.5;
|
||||
}
|
||||
}
|
||||
|
||||
gettimeofday( &start, (struct timezone *)0);
|
||||
|
||||
GEMM (&trans, &trans, &m, &m, &m, alpha, a, &m, b, &m, beta, c, &m );
|
||||
|
||||
gettimeofday( &stop, (struct timezone *)0);
|
||||
|
||||
time1 = (double)(stop.tv_sec - start.tv_sec) + (double)((stop.tv_usec - start.tv_usec)) * 1.e-6;
|
||||
|
||||
timeg += time1;
|
||||
|
||||
}
|
||||
|
||||
timeg /= loops;
|
||||
fprintf(stderr,
|
||||
" %10.2f MFlops\n",
|
||||
COMPSIZE * COMPSIZE * 2. * (double)m * (double)m * (double)m / timeg * 1.e-6);
|
||||
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
void main(int argc, char *argv[]) __attribute__((weak, alias("MAIN__")));
|
||||
110
benchmark/gemv.c
110
benchmark/gemv.c
|
|
@ -128,6 +128,7 @@ int MAIN__(int argc, char *argv[]){
|
|||
blasint inc_x=1,inc_y=1;
|
||||
blasint n=0;
|
||||
int has_param_n = 0;
|
||||
int has_param_m = 0;
|
||||
int loops = 1;
|
||||
int l;
|
||||
char *p;
|
||||
|
|
@ -145,29 +146,38 @@ int MAIN__(int argc, char *argv[]){
|
|||
if (argc > 0) { to = MAX(atol(*argv), from); argc--; argv++;}
|
||||
if (argc > 0) { step = atol(*argv); argc--; argv++;}
|
||||
|
||||
|
||||
int tomax = to;
|
||||
|
||||
if ((p = getenv("OPENBLAS_LOOPS"))) loops = atoi(p);
|
||||
if ((p = getenv("OPENBLAS_INCX"))) inc_x = atoi(p);
|
||||
if ((p = getenv("OPENBLAS_INCY"))) inc_y = atoi(p);
|
||||
if ((p = getenv("OPENBLAS_TRANS"))) trans=*p;
|
||||
if ((p = getenv("OPENBLAS_PARAM_N"))) {
|
||||
n = atoi(p);
|
||||
if ((n>0) && (n<=to)) has_param_n = 1;
|
||||
if ((n>0)) has_param_n = 1;
|
||||
if ( n > tomax ) tomax = n;
|
||||
}
|
||||
if ( has_param_n == 0 )
|
||||
if ((p = getenv("OPENBLAS_PARAM_M"))) {
|
||||
m = atoi(p);
|
||||
if ((m>0)) has_param_m = 1;
|
||||
if ( m > tomax ) tomax = m;
|
||||
}
|
||||
|
||||
if ( has_param_n == 1 )
|
||||
fprintf(stderr, "From : %3d To : %3d Step = %3d Trans = '%c' N = %d Inc_x = %d Inc_y = %d Loops = %d\n", from, to, step,trans,n,inc_x,inc_y,loops);
|
||||
else
|
||||
fprintf(stderr, "From : %3d To : %3d Step = %3d Trans = '%c' Inc_x = %d Inc_y = %d Loops = %d\n", from, to, step,trans,inc_x,inc_y,loops);
|
||||
|
||||
if (( a = (FLOAT *)malloc(sizeof(FLOAT) * to * to * COMPSIZE)) == NULL){
|
||||
|
||||
fprintf(stderr, "From : %3d To : %3d Step = %3d Trans = '%c' Inc_x = %d Inc_y = %d Loops = %d\n", from, to, step,trans,inc_x,inc_y,loops);
|
||||
|
||||
if (( a = (FLOAT *)malloc(sizeof(FLOAT) * tomax * tomax * COMPSIZE)) == NULL){
|
||||
fprintf(stderr,"Out of Memory!!\n");exit(1);
|
||||
}
|
||||
|
||||
if (( x = (FLOAT *)malloc(sizeof(FLOAT) * to * abs(inc_x) * COMPSIZE)) == NULL){
|
||||
if (( x = (FLOAT *)malloc(sizeof(FLOAT) * tomax * abs(inc_x) * COMPSIZE)) == NULL){
|
||||
fprintf(stderr,"Out of Memory!!\n");exit(1);
|
||||
}
|
||||
|
||||
if (( y = (FLOAT *)malloc(sizeof(FLOAT) * to * abs(inc_y) * COMPSIZE)) == NULL){
|
||||
if (( y = (FLOAT *)malloc(sizeof(FLOAT) * tomax * abs(inc_y) * COMPSIZE)) == NULL){
|
||||
fprintf(stderr,"Out of Memory!!\n");exit(1);
|
||||
}
|
||||
|
||||
|
|
@ -177,50 +187,80 @@ int MAIN__(int argc, char *argv[]){
|
|||
|
||||
fprintf(stderr, " SIZE Flops\n");
|
||||
|
||||
for(m = from; m <= to; m += step)
|
||||
if (has_param_m == 0)
|
||||
{
|
||||
|
||||
timeg=0;
|
||||
for(m = from; m <= to; m += step)
|
||||
{
|
||||
timeg=0;
|
||||
if ( has_param_n == 0 ) n = m;
|
||||
fprintf(stderr, " %6dx%d : ", (int)m,(int)n);
|
||||
for(j = 0; j < m; j++){
|
||||
for(i = 0; i < n * COMPSIZE; i++){
|
||||
a[i + j * m * COMPSIZE] = ((FLOAT) rand() / (FLOAT) RAND_MAX) - 0.5;
|
||||
}
|
||||
}
|
||||
|
||||
if ( has_param_n == 0 ) n = m;
|
||||
for (l=0; l<loops; l++)
|
||||
{
|
||||
|
||||
fprintf(stderr, " %6dx%d : ", (int)m,(int)n);
|
||||
for(i = 0; i < n * COMPSIZE * abs(inc_x); i++){
|
||||
x[i] = ((FLOAT) rand() / (FLOAT) RAND_MAX) - 0.5;
|
||||
}
|
||||
|
||||
for(j = 0; j < m; j++){
|
||||
for(i = 0; i < n * COMPSIZE; i++){
|
||||
a[i + j * m * COMPSIZE] = ((FLOAT) rand() / (FLOAT) RAND_MAX) - 0.5;
|
||||
}
|
||||
}
|
||||
for(i = 0; i < n * COMPSIZE * abs(inc_y); i++){
|
||||
y[i] = ((FLOAT) rand() / (FLOAT) RAND_MAX) - 0.5;
|
||||
}
|
||||
gettimeofday( &start, (struct timezone *)0);
|
||||
GEMV (&trans, &m, &n, alpha, a, &m, x, &inc_x, beta, y, &inc_y );
|
||||
gettimeofday( &stop, (struct timezone *)0);
|
||||
time1 = (double)(stop.tv_sec - start.tv_sec) + (double)((stop.tv_usec - start.tv_usec)) * 1.e-6;
|
||||
timeg += time1;
|
||||
|
||||
}
|
||||
|
||||
for (l=0; l<loops; l++)
|
||||
{
|
||||
timeg /= loops;
|
||||
|
||||
for(i = 0; i < n * COMPSIZE * abs(inc_x); i++){
|
||||
x[i] = ((FLOAT) rand() / (FLOAT) RAND_MAX) - 0.5;
|
||||
}
|
||||
fprintf(stderr, " %10.2f MFlops\n", COMPSIZE * COMPSIZE * 2. * (double)m * (double)n / timeg * 1.e-6);
|
||||
|
||||
for(i = 0; i < n * COMPSIZE * abs(inc_y); i++){
|
||||
y[i] = ((FLOAT) rand() / (FLOAT) RAND_MAX) - 0.5;
|
||||
}
|
||||
gettimeofday( &start, (struct timezone *)0);
|
||||
}
|
||||
}
|
||||
else
|
||||
{
|
||||
|
||||
GEMV (&trans, &m, &n, alpha, a, &m, x, &inc_x, beta, y, &inc_y );
|
||||
for(n = from; n <= to; n += step)
|
||||
{
|
||||
timeg=0;
|
||||
fprintf(stderr, " %6dx%d : ", (int)m,(int)n);
|
||||
for(j = 0; j < m; j++){
|
||||
for(i = 0; i < n * COMPSIZE; i++){
|
||||
a[i + j * m * COMPSIZE] = ((FLOAT) rand() / (FLOAT) RAND_MAX) - 0.5;
|
||||
}
|
||||
}
|
||||
|
||||
gettimeofday( &stop, (struct timezone *)0);
|
||||
for (l=0; l<loops; l++)
|
||||
{
|
||||
|
||||
time1 = (double)(stop.tv_sec - start.tv_sec) + (double)((stop.tv_usec - start.tv_usec)) * 1.e-6;
|
||||
for(i = 0; i < n * COMPSIZE * abs(inc_x); i++){
|
||||
x[i] = ((FLOAT) rand() / (FLOAT) RAND_MAX) - 0.5;
|
||||
}
|
||||
|
||||
timeg += time1;
|
||||
for(i = 0; i < n * COMPSIZE * abs(inc_y); i++){
|
||||
y[i] = ((FLOAT) rand() / (FLOAT) RAND_MAX) - 0.5;
|
||||
}
|
||||
gettimeofday( &start, (struct timezone *)0);
|
||||
GEMV (&trans, &m, &n, alpha, a, &m, x, &inc_x, beta, y, &inc_y );
|
||||
gettimeofday( &stop, (struct timezone *)0);
|
||||
time1 = (double)(stop.tv_sec - start.tv_sec) + (double)((stop.tv_usec - start.tv_usec)) * 1.e-6;
|
||||
timeg += time1;
|
||||
|
||||
}
|
||||
}
|
||||
|
||||
timeg /= loops;
|
||||
timeg /= loops;
|
||||
|
||||
fprintf(stderr,
|
||||
" %10.2f MFlops\n",
|
||||
COMPSIZE * COMPSIZE * 2. * (double)m * (double)n / timeg * 1.e-6);
|
||||
fprintf(stderr, " %10.2f MFlops\n", COMPSIZE * COMPSIZE * 2. * (double)m * (double)n / timeg * 1.e-6);
|
||||
|
||||
}
|
||||
}
|
||||
|
||||
return 0;
|
||||
|
|
|
|||
|
|
@ -0,0 +1,208 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
#ifdef __CYGWIN32__
|
||||
#include <sys/time.h>
|
||||
#endif
|
||||
#include "common.h"
|
||||
|
||||
|
||||
#undef HEMV
|
||||
|
||||
|
||||
#ifdef DOUBLE
|
||||
#define HEMV BLASFUNC(zhemv)
|
||||
#else
|
||||
#define HEMV BLASFUNC(chemv)
|
||||
#endif
|
||||
|
||||
|
||||
#if defined(__WIN32__) || defined(__WIN64__)
|
||||
|
||||
#ifndef DELTA_EPOCH_IN_MICROSECS
|
||||
#define DELTA_EPOCH_IN_MICROSECS 11644473600000000ULL
|
||||
#endif
|
||||
|
||||
int gettimeofday(struct timeval *tv, void *tz){
|
||||
|
||||
FILETIME ft;
|
||||
unsigned __int64 tmpres = 0;
|
||||
static int tzflag;
|
||||
|
||||
if (NULL != tv)
|
||||
{
|
||||
GetSystemTimeAsFileTime(&ft);
|
||||
|
||||
tmpres |= ft.dwHighDateTime;
|
||||
tmpres <<= 32;
|
||||
tmpres |= ft.dwLowDateTime;
|
||||
|
||||
/*converting file time to unix epoch*/
|
||||
tmpres /= 10; /*convert into microseconds*/
|
||||
tmpres -= DELTA_EPOCH_IN_MICROSECS;
|
||||
tv->tv_sec = (long)(tmpres / 1000000UL);
|
||||
tv->tv_usec = (long)(tmpres % 1000000UL);
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
#if !defined(__WIN32__) && !defined(__WIN64__) && !defined(__CYGWIN32__) && 0
|
||||
|
||||
static void *huge_malloc(BLASLONG size){
|
||||
int shmid;
|
||||
void *address;
|
||||
|
||||
#ifndef SHM_HUGETLB
|
||||
#define SHM_HUGETLB 04000
|
||||
#endif
|
||||
|
||||
if ((shmid =shmget(IPC_PRIVATE,
|
||||
(size + HUGE_PAGESIZE) & ~(HUGE_PAGESIZE - 1),
|
||||
SHM_HUGETLB | IPC_CREAT |0600)) < 0) {
|
||||
printf( "Memory allocation failed(shmget).\n");
|
||||
exit(1);
|
||||
}
|
||||
|
||||
address = shmat(shmid, NULL, SHM_RND);
|
||||
|
||||
if ((BLASLONG)address == -1){
|
||||
printf( "Memory allocation failed(shmat).\n");
|
||||
exit(1);
|
||||
}
|
||||
|
||||
shmctl(shmid, IPC_RMID, 0);
|
||||
|
||||
return address;
|
||||
}
|
||||
|
||||
#define malloc huge_malloc
|
||||
|
||||
#endif
|
||||
|
||||
int MAIN__(int argc, char *argv[]){
|
||||
|
||||
FLOAT *a, *x, *y;
|
||||
FLOAT alpha[] = {1.0, 1.0};
|
||||
FLOAT beta [] = {1.0, 1.0};
|
||||
char uplo='L';
|
||||
blasint m, i, j;
|
||||
blasint inc_x=1,inc_y=1;
|
||||
int loops = 1;
|
||||
int l;
|
||||
char *p;
|
||||
|
||||
int from = 1;
|
||||
int to = 200;
|
||||
int step = 1;
|
||||
|
||||
struct timeval start, stop;
|
||||
double time1,timeg;
|
||||
|
||||
argc--;argv++;
|
||||
|
||||
if (argc > 0) { from = atol(*argv); argc--; argv++;}
|
||||
if (argc > 0) { to = MAX(atol(*argv), from); argc--; argv++;}
|
||||
if (argc > 0) { step = atol(*argv); argc--; argv++;}
|
||||
|
||||
if ((p = getenv("OPENBLAS_LOOPS"))) loops = atoi(p);
|
||||
if ((p = getenv("OPENBLAS_INCX"))) inc_x = atoi(p);
|
||||
if ((p = getenv("OPENBLAS_INCY"))) inc_y = atoi(p);
|
||||
if ((p = getenv("OPENBLAS_UPLO"))) uplo=*p;
|
||||
|
||||
fprintf(stderr, "From : %3d To : %3d Step = %3d Uplo = '%c' Inc_x = %d Inc_y = %d Loops = %d\n", from, to, step,uplo,inc_x,inc_y,loops);
|
||||
|
||||
if (( a = (FLOAT *)malloc(sizeof(FLOAT) * to * to * COMPSIZE)) == NULL){
|
||||
fprintf(stderr,"Out of Memory!!\n");exit(1);
|
||||
}
|
||||
|
||||
if (( x = (FLOAT *)malloc(sizeof(FLOAT) * to * abs(inc_x) * COMPSIZE)) == NULL){
|
||||
fprintf(stderr,"Out of Memory!!\n");exit(1);
|
||||
}
|
||||
|
||||
if (( y = (FLOAT *)malloc(sizeof(FLOAT) * to * abs(inc_y) * COMPSIZE)) == NULL){
|
||||
fprintf(stderr,"Out of Memory!!\n");exit(1);
|
||||
}
|
||||
|
||||
#ifdef linux
|
||||
srandom(getpid());
|
||||
#endif
|
||||
|
||||
fprintf(stderr, " SIZE Flops\n");
|
||||
|
||||
for(m = from; m <= to; m += step)
|
||||
{
|
||||
|
||||
timeg=0;
|
||||
|
||||
fprintf(stderr, " %6dx%d : ", (int)m,(int)m);
|
||||
|
||||
for(j = 0; j < m; j++){
|
||||
for(i = 0; i < m * COMPSIZE; i++){
|
||||
a[i + j * m * COMPSIZE] = ((FLOAT) rand() / (FLOAT) RAND_MAX) - 0.5;
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
for (l=0; l<loops; l++)
|
||||
{
|
||||
|
||||
for(i = 0; i < m * COMPSIZE * abs(inc_x); i++){
|
||||
x[i] = ((FLOAT) rand() / (FLOAT) RAND_MAX) - 0.5;
|
||||
}
|
||||
|
||||
for(i = 0; i < m * COMPSIZE * abs(inc_y); i++){
|
||||
y[i] = ((FLOAT) rand() / (FLOAT) RAND_MAX) - 0.5;
|
||||
}
|
||||
gettimeofday( &start, (struct timezone *)0);
|
||||
|
||||
HEMV (&uplo, &m, alpha, a, &m, x, &inc_x, beta, y, &inc_y );
|
||||
|
||||
gettimeofday( &stop, (struct timezone *)0);
|
||||
|
||||
time1 = (double)(stop.tv_sec - start.tv_sec) + (double)((stop.tv_usec - start.tv_usec)) * 1.e-6;
|
||||
|
||||
timeg += time1;
|
||||
|
||||
}
|
||||
|
||||
timeg /= loops;
|
||||
|
||||
fprintf(stderr,
|
||||
" %10.2f MFlops\n",
|
||||
COMPSIZE * COMPSIZE * 2. * (double)m * (double)m / timeg * 1.e-6);
|
||||
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
void main(int argc, char *argv[]) __attribute__((weak, alias("MAIN__")));
|
||||
|
|
@ -0,0 +1,42 @@
|
|||
# **********************************************************************************
|
||||
# Copyright (c) 2014, The OpenBLAS Project
|
||||
# All rights reserved.
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are
|
||||
# met:
|
||||
# 1. Redistributions of source code must retain the above copyright
|
||||
# notice, this list of conditions and the following disclaimer.
|
||||
# 2. Redistributions in binary form must reproduce the above copyright
|
||||
# notice, this list of conditions and the following disclaimer in
|
||||
# the documentation and/or other materials provided with the
|
||||
# distribution.
|
||||
# 3. Neither the name of the OpenBLAS project nor the names of
|
||||
# its contributors may be used to endorse or promote products
|
||||
# derived from this software without specific prior written permission.
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
# ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
# USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
# **********************************************************************************
|
||||
|
||||
set term x11 font sans;
|
||||
set ylabel "MFlops";
|
||||
set xlabel "Size";
|
||||
set grid xtics;
|
||||
set grid ytics;
|
||||
set key left;
|
||||
set timestamp "generated on %Y-%m-%d by `whoami`"
|
||||
set title "Sgemv\nTRANS=T\nBulldozer"
|
||||
plot '1-THREAD' smooth bezier, '2-THREADS' smooth bezier, '4-THREADS' smooth bezier;
|
||||
set output "print.png";
|
||||
show title;
|
||||
show plot;
|
||||
show output;
|
||||
|
||||
|
||||
23
cblas.h
23
cblas.h
|
|
@ -243,8 +243,13 @@ void cblas_dgemm(OPENBLAS_CONST enum CBLAS_ORDER Order, OPENBLAS_CONST enum CBLA
|
|||
OPENBLAS_CONST double alpha, OPENBLAS_CONST double *A, OPENBLAS_CONST blasint lda, OPENBLAS_CONST double *B, OPENBLAS_CONST blasint ldb, OPENBLAS_CONST double beta, double *C, OPENBLAS_CONST blasint ldc);
|
||||
void cblas_cgemm(OPENBLAS_CONST enum CBLAS_ORDER Order, OPENBLAS_CONST enum CBLAS_TRANSPOSE TransA, OPENBLAS_CONST enum CBLAS_TRANSPOSE TransB, OPENBLAS_CONST blasint M, OPENBLAS_CONST blasint N, OPENBLAS_CONST blasint K,
|
||||
OPENBLAS_CONST float *alpha, OPENBLAS_CONST float *A, OPENBLAS_CONST blasint lda, OPENBLAS_CONST float *B, OPENBLAS_CONST blasint ldb, OPENBLAS_CONST float *beta, float *C, OPENBLAS_CONST blasint ldc);
|
||||
void cblas_cgemm3m(OPENBLAS_CONST enum CBLAS_ORDER Order, OPENBLAS_CONST enum CBLAS_TRANSPOSE TransA, OPENBLAS_CONST enum CBLAS_TRANSPOSE TransB, OPENBLAS_CONST blasint M, OPENBLAS_CONST blasint N, OPENBLAS_CONST blasint K,
|
||||
OPENBLAS_CONST float *alpha, OPENBLAS_CONST float *A, OPENBLAS_CONST blasint lda, OPENBLAS_CONST float *B, OPENBLAS_CONST blasint ldb, OPENBLAS_CONST float *beta, float *C, OPENBLAS_CONST blasint ldc);
|
||||
void cblas_zgemm(OPENBLAS_CONST enum CBLAS_ORDER Order, OPENBLAS_CONST enum CBLAS_TRANSPOSE TransA, OPENBLAS_CONST enum CBLAS_TRANSPOSE TransB, OPENBLAS_CONST blasint M, OPENBLAS_CONST blasint N, OPENBLAS_CONST blasint K,
|
||||
OPENBLAS_CONST double *alpha, OPENBLAS_CONST double *A, OPENBLAS_CONST blasint lda, OPENBLAS_CONST double *B, OPENBLAS_CONST blasint ldb, OPENBLAS_CONST double *beta, double *C, OPENBLAS_CONST blasint ldc);
|
||||
void cblas_zgemm3m(OPENBLAS_CONST enum CBLAS_ORDER Order, OPENBLAS_CONST enum CBLAS_TRANSPOSE TransA, OPENBLAS_CONST enum CBLAS_TRANSPOSE TransB, OPENBLAS_CONST blasint M, OPENBLAS_CONST blasint N, OPENBLAS_CONST blasint K,
|
||||
OPENBLAS_CONST double *alpha, OPENBLAS_CONST double *A, OPENBLAS_CONST blasint lda, OPENBLAS_CONST double *B, OPENBLAS_CONST blasint ldb, OPENBLAS_CONST double *beta, double *C, OPENBLAS_CONST blasint ldc);
|
||||
|
||||
|
||||
void cblas_ssymm(OPENBLAS_CONST enum CBLAS_ORDER Order, OPENBLAS_CONST enum CBLAS_SIDE Side, OPENBLAS_CONST enum CBLAS_UPLO Uplo, OPENBLAS_CONST blasint M, OPENBLAS_CONST blasint N,
|
||||
OPENBLAS_CONST float alpha, OPENBLAS_CONST float *A, OPENBLAS_CONST blasint lda, OPENBLAS_CONST float *B, OPENBLAS_CONST blasint ldb, OPENBLAS_CONST float beta, float *C, OPENBLAS_CONST blasint ldc);
|
||||
|
|
@ -318,6 +323,24 @@ void cblas_caxpby(OPENBLAS_CONST blasint n, OPENBLAS_CONST float *alpha, OPENBLA
|
|||
|
||||
void cblas_zaxpby(OPENBLAS_CONST blasint n, OPENBLAS_CONST double *alpha, OPENBLAS_CONST double *x, OPENBLAS_CONST blasint incx,OPENBLAS_CONST double *beta, double *y, OPENBLAS_CONST blasint incy);
|
||||
|
||||
void cblas_somatcopy(OPENBLAS_CONST enum CBLAS_ORDER CORDER, OPENBLAS_CONST enum CBLAS_TRANSPOSE CTRANS, OPENBLAS_CONST blasint crows, OPENBLAS_CONST blasint ccols, OPENBLAS_CONST float calpha, OPENBLAS_CONST float *a,
|
||||
OPENBLAS_CONST blasint clda, float *b, OPENBLAS_CONST blasint cldb);
|
||||
void cblas_domatcopy(OPENBLAS_CONST enum CBLAS_ORDER CORDER, OPENBLAS_CONST enum CBLAS_TRANSPOSE CTRANS, OPENBLAS_CONST blasint crows, OPENBLAS_CONST blasint ccols, OPENBLAS_CONST double calpha, OPENBLAS_CONST double *a,
|
||||
OPENBLAS_CONST blasint clda, double *b, OPENBLAS_CONST blasint cldb);
|
||||
void cblas_comatcopy(OPENBLAS_CONST enum CBLAS_ORDER CORDER, OPENBLAS_CONST enum CBLAS_TRANSPOSE CTRANS, OPENBLAS_CONST blasint crows, OPENBLAS_CONST blasint ccols, OPENBLAS_CONST float* calpha, OPENBLAS_CONST float* a,
|
||||
OPENBLAS_CONST blasint clda, float*b, OPENBLAS_CONST blasint cldb);
|
||||
void cblas_zomatcopy(OPENBLAS_CONST enum CBLAS_ORDER CORDER, OPENBLAS_CONST enum CBLAS_TRANSPOSE CTRANS, OPENBLAS_CONST blasint crows, OPENBLAS_CONST blasint ccols, OPENBLAS_CONST double* calpha, OPENBLAS_CONST double* a,
|
||||
OPENBLAS_CONST blasint clda, double *b, OPENBLAS_CONST blasint cldb);
|
||||
|
||||
void cblas_simatcopy(OPENBLAS_CONST enum CBLAS_ORDER CORDER, OPENBLAS_CONST enum CBLAS_TRANSPOSE CTRANS, OPENBLAS_CONST blasint crows, OPENBLAS_CONST blasint ccols, OPENBLAS_CONST float calpha, float *a,
|
||||
OPENBLAS_CONST blasint clda, OPENBLAS_CONST blasint cldb);
|
||||
void cblas_dimatcopy(OPENBLAS_CONST enum CBLAS_ORDER CORDER, OPENBLAS_CONST enum CBLAS_TRANSPOSE CTRANS, OPENBLAS_CONST blasint crows, OPENBLAS_CONST blasint ccols, OPENBLAS_CONST double calpha, double *a,
|
||||
OPENBLAS_CONST blasint clda, OPENBLAS_CONST blasint cldb);
|
||||
void cblas_cimatcopy(OPENBLAS_CONST enum CBLAS_ORDER CORDER, OPENBLAS_CONST enum CBLAS_TRANSPOSE CTRANS, OPENBLAS_CONST blasint crows, OPENBLAS_CONST blasint ccols, OPENBLAS_CONST float* calpha, float* a,
|
||||
OPENBLAS_CONST blasint clda, OPENBLAS_CONST blasint cldb);
|
||||
void cblas_zimatcopy(OPENBLAS_CONST enum CBLAS_ORDER CORDER, OPENBLAS_CONST enum CBLAS_TRANSPOSE CTRANS, OPENBLAS_CONST blasint crows, OPENBLAS_CONST blasint ccols, OPENBLAS_CONST double* calpha, double* a,
|
||||
OPENBLAS_CONST blasint clda, OPENBLAS_CONST blasint cldb);
|
||||
|
||||
#ifdef __cplusplus
|
||||
}
|
||||
#endif /* __cplusplus */
|
||||
|
|
|
|||
|
|
@ -231,8 +231,12 @@ void cblas_dgemm(enum CBLAS_ORDER Order, enum CBLAS_TRANSPOSE TransA, enum CBLAS
|
|||
double alpha, double *A, blasint lda, double *B, blasint ldb, double beta, double *C, blasint ldc);
|
||||
void cblas_cgemm(enum CBLAS_ORDER Order, enum CBLAS_TRANSPOSE TransA, enum CBLAS_TRANSPOSE TransB, blasint M, blasint N, blasint K,
|
||||
float *alpha, float *A, blasint lda, float *B, blasint ldb, float *beta, float *C, blasint ldc);
|
||||
void cblas_cgemm3m(enum CBLAS_ORDER Order, enum CBLAS_TRANSPOSE TransA, enum CBLAS_TRANSPOSE TransB, blasint M, blasint N, blasint K,
|
||||
float *alpha, float *A, blasint lda, float *B, blasint ldb, float *beta, float *C, blasint ldc);
|
||||
void cblas_zgemm(enum CBLAS_ORDER Order, enum CBLAS_TRANSPOSE TransA, enum CBLAS_TRANSPOSE TransB, blasint M, blasint N, blasint K,
|
||||
double *alpha, double *A, blasint lda, double *B, blasint ldb, double *beta, double *C, blasint ldc);
|
||||
void cblas_zgemm3m(enum CBLAS_ORDER Order, enum CBLAS_TRANSPOSE TransA, enum CBLAS_TRANSPOSE TransB, blasint M, blasint N, blasint K,
|
||||
double *alpha, double *A, blasint lda, double *B, blasint ldb, double *beta, double *C, blasint ldc);
|
||||
|
||||
void cblas_ssymm(enum CBLAS_ORDER Order, enum CBLAS_SIDE Side, enum CBLAS_UPLO Uplo, blasint M, blasint N,
|
||||
float alpha, float *A, blasint lda, float *B, blasint ldb, float beta, float *C, blasint ldc);
|
||||
|
|
@ -306,7 +310,23 @@ void cblas_caxpby(blasint n, float *alpha, float *x, blasint incx,float *beta, f
|
|||
|
||||
void cblas_zaxpby(blasint n, double *alpha, double *x, blasint incx,double *beta, double *y, blasint incy);
|
||||
|
||||
void cblas_somatcopy( enum CBLAS_ORDER CORDER, enum CBLAS_TRANSPOSE CTRANS, blasint crows, blasint ccols, float calpha, float *a,
|
||||
blasint clda, float *b, blasint cldb);
|
||||
void cblas_domatcopy( enum CBLAS_ORDER CORDER, enum CBLAS_TRANSPOSE CTRANS, blasint crows, blasint ccols, double calpha, double *a,
|
||||
blasint clda, double *b, blasint cldb);
|
||||
void cblas_comatcopy( enum CBLAS_ORDER CORDER, enum CBLAS_TRANSPOSE CTRANS, blasint crows, blasint ccols, void* calpha, void* a,
|
||||
blasint clda, void *b, blasint cldb);
|
||||
void cblas_zomatcopy( enum CBLAS_ORDER CORDER, enum CBLAS_TRANSPOSE CTRANS, blasint crows, blasint ccols, void* calpha, void* a,
|
||||
blasint clda, void *b, blasint cldb);
|
||||
|
||||
void cblas_simatcopy( enum CBLAS_ORDER CORDER, enum CBLAS_TRANSPOSE CTRANS, blasint crows, blasint ccols, float calpha, float *a,
|
||||
blasint clda, blasint cldb);
|
||||
void cblas_dimatcopy( enum CBLAS_ORDER CORDER, enum CBLAS_TRANSPOSE CTRANS, blasint crows, blasint ccols, double calpha, double *a,
|
||||
blasint clda, blasint cldb);
|
||||
void cblas_cimatcopy( enum CBLAS_ORDER CORDER, enum CBLAS_TRANSPOSE CTRANS, blasint crows, blasint ccols, float* calpha, float* a,
|
||||
blasint clda, blasint cldb);
|
||||
void cblas_zimatcopy( enum CBLAS_ORDER CORDER, enum CBLAS_TRANSPOSE CTRANS, blasint crows, blasint ccols, double* calpha, double* a,
|
||||
blasint clda, blasint cldb);
|
||||
#ifdef __cplusplus
|
||||
}
|
||||
#endif /* __cplusplus */
|
||||
|
|
|
|||
109
common_param.h
109
common_param.h
|
|
@ -435,6 +435,9 @@ BLASLONG (*icamin_k)(BLASLONG, float *, BLASLONG);
|
|||
int (*chemm_outcopy)(BLASLONG, BLASLONG, float *, BLASLONG, BLASLONG, BLASLONG, float *);
|
||||
int (*chemm_oltcopy)(BLASLONG, BLASLONG, float *, BLASLONG, BLASLONG, BLASLONG, float *);
|
||||
|
||||
int cgemm3m_p, cgemm3m_q, cgemm3m_r;
|
||||
int cgemm3m_unroll_m, cgemm3m_unroll_n, cgemm3m_unroll_mn;
|
||||
|
||||
int (*cgemm3m_kernel)(BLASLONG, BLASLONG, BLASLONG, float, float, float *, float *, float *, BLASLONG);
|
||||
|
||||
int (*cgemm3m_incopyb)(BLASLONG, BLASLONG, float *, BLASLONG, float *);
|
||||
|
|
@ -595,6 +598,9 @@ BLASLONG (*izamin_k)(BLASLONG, double *, BLASLONG);
|
|||
int (*zhemm_outcopy)(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, double *);
|
||||
int (*zhemm_oltcopy)(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, double *);
|
||||
|
||||
int zgemm3m_p, zgemm3m_q, zgemm3m_r;
|
||||
int zgemm3m_unroll_m, zgemm3m_unroll_n, zgemm3m_unroll_mn;
|
||||
|
||||
int (*zgemm3m_kernel)(BLASLONG, BLASLONG, BLASLONG, double, double, double *, double *, double *, BLASLONG);
|
||||
|
||||
int (*zgemm3m_incopyb)(BLASLONG, BLASLONG, double *, BLASLONG, double *);
|
||||
|
|
@ -757,6 +763,9 @@ BLASLONG (*ixamin_k)(BLASLONG, xdouble *, BLASLONG);
|
|||
int (*xhemm_outcopy)(BLASLONG, BLASLONG, xdouble *, BLASLONG, BLASLONG, BLASLONG, xdouble *);
|
||||
int (*xhemm_oltcopy)(BLASLONG, BLASLONG, xdouble *, BLASLONG, BLASLONG, BLASLONG, xdouble *);
|
||||
|
||||
int xgemm3m_p, xgemm3m_q, xgemm3m_r;
|
||||
int xgemm3m_unroll_m, xgemm3m_unroll_n, xgemm3m_unroll_mn;
|
||||
|
||||
int (*xgemm3m_kernel)(BLASLONG, BLASLONG, BLASLONG, xdouble, xdouble, xdouble *, xdouble *, xdouble *, BLASLONG);
|
||||
|
||||
int (*xgemm3m_incopyb)(BLASLONG, BLASLONG, xdouble *, BLASLONG, xdouble *);
|
||||
|
|
@ -900,6 +909,27 @@ extern gotoblas_t *gotoblas;
|
|||
#define XGEMM_UNROLL_N gotoblas -> xgemm_unroll_n
|
||||
#define XGEMM_UNROLL_MN gotoblas -> xgemm_unroll_mn
|
||||
|
||||
#define CGEMM3M_P gotoblas -> cgemm3m_p
|
||||
#define CGEMM3M_Q gotoblas -> cgemm3m_q
|
||||
#define CGEMM3M_R gotoblas -> cgemm3m_r
|
||||
#define CGEMM3M_UNROLL_M gotoblas -> cgemm3m_unroll_m
|
||||
#define CGEMM3M_UNROLL_N gotoblas -> cgemm3m_unroll_n
|
||||
#define CGEMM3M_UNROLL_MN gotoblas -> cgemm3m_unroll_mn
|
||||
|
||||
#define ZGEMM3M_P gotoblas -> zgemm3m_p
|
||||
#define ZGEMM3M_Q gotoblas -> zgemm3m_q
|
||||
#define ZGEMM3M_R gotoblas -> zgemm3m_r
|
||||
#define ZGEMM3M_UNROLL_M gotoblas -> zgemm3m_unroll_m
|
||||
#define ZGEMM3M_UNROLL_N gotoblas -> zgemm3m_unroll_n
|
||||
#define ZGEMM3M_UNROLL_MN gotoblas -> zgemm3m_unroll_mn
|
||||
|
||||
#define XGEMM3M_P gotoblas -> xgemm3m_p
|
||||
#define XGEMM3M_Q gotoblas -> xgemm3m_q
|
||||
#define XGEMM3M_R gotoblas -> xgemm3m_r
|
||||
#define XGEMM3M_UNROLL_M gotoblas -> xgemm3m_unroll_m
|
||||
#define XGEMM3M_UNROLL_N gotoblas -> xgemm3m_unroll_n
|
||||
#define XGEMM3M_UNROLL_MN gotoblas -> xgemm3m_unroll_mn
|
||||
|
||||
#else
|
||||
|
||||
#define DTB_ENTRIES DTB_DEFAULT_ENTRIES
|
||||
|
|
@ -972,6 +1002,55 @@ extern gotoblas_t *gotoblas;
|
|||
#define XGEMM_UNROLL_N XGEMM_DEFAULT_UNROLL_N
|
||||
#define XGEMM_UNROLL_MN MAX((XGEMM_UNROLL_M), (XGEMM_UNROLL_N))
|
||||
|
||||
#ifdef CGEMM3M_DEFAULT_UNROLL_N
|
||||
|
||||
#define CGEMM3M_P CGEMM3M_DEFAULT_P
|
||||
#define CGEMM3M_Q CGEMM3M_DEFAULT_Q
|
||||
#define CGEMM3M_R CGEMM3M_DEFAULT_R
|
||||
#define CGEMM3M_UNROLL_M CGEMM3M_DEFAULT_UNROLL_M
|
||||
#define CGEMM3M_UNROLL_N CGEMM3M_DEFAULT_UNROLL_N
|
||||
#define CGEMM3M_UNROLL_MN MAX((CGEMM3M_UNROLL_M), (CGEMM3M_UNROLL_N))
|
||||
|
||||
#else
|
||||
|
||||
#define CGEMM3M_P SGEMM_DEFAULT_P
|
||||
#define CGEMM3M_Q SGEMM_DEFAULT_Q
|
||||
#define CGEMM3M_R SGEMM_DEFAULT_R
|
||||
#define CGEMM3M_UNROLL_M SGEMM_DEFAULT_UNROLL_M
|
||||
#define CGEMM3M_UNROLL_N SGEMM_DEFAULT_UNROLL_N
|
||||
#define CGEMM3M_UNROLL_MN MAX((CGEMM_UNROLL_M), (CGEMM_UNROLL_N))
|
||||
|
||||
#endif
|
||||
|
||||
|
||||
#ifdef ZGEMM3M_DEFAULT_UNROLL_N
|
||||
|
||||
#define ZGEMM3M_P ZGEMM3M_DEFAULT_P
|
||||
#define ZGEMM3M_Q ZGEMM3M_DEFAULT_Q
|
||||
#define ZGEMM3M_R ZGEMM3M_DEFAULT_R
|
||||
#define ZGEMM3M_UNROLL_M ZGEMM3M_DEFAULT_UNROLL_M
|
||||
#define ZGEMM3M_UNROLL_N ZGEMM3M_DEFAULT_UNROLL_N
|
||||
#define ZGEMM3M_UNROLL_MN MAX((ZGEMM_UNROLL_M), (ZGEMM_UNROLL_N))
|
||||
|
||||
#else
|
||||
|
||||
#define ZGEMM3M_P DGEMM_DEFAULT_P
|
||||
#define ZGEMM3M_Q DGEMM_DEFAULT_Q
|
||||
#define ZGEMM3M_R DGEMM_DEFAULT_R
|
||||
#define ZGEMM3M_UNROLL_M DGEMM_DEFAULT_UNROLL_M
|
||||
#define ZGEMM3M_UNROLL_N DGEMM_DEFAULT_UNROLL_N
|
||||
#define ZGEMM3M_UNROLL_MN MAX((ZGEMM_UNROLL_M), (ZGEMM_UNROLL_N))
|
||||
|
||||
#endif
|
||||
|
||||
#define XGEMM3M_P QGEMM_DEFAULT_P
|
||||
#define XGEMM3M_Q QGEMM_DEFAULT_Q
|
||||
#define XGEMM3M_R QGEMM_DEFAULT_R
|
||||
#define XGEMM3M_UNROLL_M QGEMM_DEFAULT_UNROLL_M
|
||||
#define XGEMM3M_UNROLL_N QGEMM_DEFAULT_UNROLL_N
|
||||
#define XGEMM3M_UNROLL_MN MAX((QGEMM_UNROLL_M), (QGEMM_UNROLL_N))
|
||||
|
||||
|
||||
#endif
|
||||
#endif
|
||||
|
||||
|
|
@ -1054,14 +1133,14 @@ extern gotoblas_t *gotoblas;
|
|||
#endif
|
||||
|
||||
#ifdef XDOUBLE
|
||||
#define GEMM3M_UNROLL_M QGEMM_UNROLL_M
|
||||
#define GEMM3M_UNROLL_N QGEMM_UNROLL_N
|
||||
#define GEMM3M_UNROLL_M XGEMM3M_UNROLL_M
|
||||
#define GEMM3M_UNROLL_N XGEMM3M_UNROLL_N
|
||||
#elif defined(DOUBLE)
|
||||
#define GEMM3M_UNROLL_M DGEMM_UNROLL_M
|
||||
#define GEMM3M_UNROLL_N DGEMM_UNROLL_N
|
||||
#define GEMM3M_UNROLL_M ZGEMM3M_UNROLL_M
|
||||
#define GEMM3M_UNROLL_N ZGEMM3M_UNROLL_N
|
||||
#else
|
||||
#define GEMM3M_UNROLL_M SGEMM_UNROLL_M
|
||||
#define GEMM3M_UNROLL_N SGEMM_UNROLL_N
|
||||
#define GEMM3M_UNROLL_M CGEMM3M_UNROLL_M
|
||||
#define GEMM3M_UNROLL_N CGEMM3M_UNROLL_N
|
||||
#endif
|
||||
|
||||
|
||||
|
|
@ -1123,31 +1202,31 @@ extern gotoblas_t *gotoblas;
|
|||
|
||||
#ifndef GEMM3M_P
|
||||
#ifdef XDOUBLE
|
||||
#define GEMM3M_P QGEMM_P
|
||||
#define GEMM3M_P XGEMM3M_P
|
||||
#elif defined(DOUBLE)
|
||||
#define GEMM3M_P DGEMM_P
|
||||
#define GEMM3M_P ZGEMM3M_P
|
||||
#else
|
||||
#define GEMM3M_P SGEMM_P
|
||||
#define GEMM3M_P CGEMM3M_P
|
||||
#endif
|
||||
#endif
|
||||
|
||||
#ifndef GEMM3M_Q
|
||||
#ifdef XDOUBLE
|
||||
#define GEMM3M_Q QGEMM_Q
|
||||
#define GEMM3M_Q XGEMM3M_Q
|
||||
#elif defined(DOUBLE)
|
||||
#define GEMM3M_Q DGEMM_Q
|
||||
#define GEMM3M_Q ZGEMM3M_Q
|
||||
#else
|
||||
#define GEMM3M_Q SGEMM_Q
|
||||
#define GEMM3M_Q CGEMM3M_Q
|
||||
#endif
|
||||
#endif
|
||||
|
||||
#ifndef GEMM3M_R
|
||||
#ifdef XDOUBLE
|
||||
#define GEMM3M_R QGEMM_R
|
||||
#define GEMM3M_R XGEMM3M_R
|
||||
#elif defined(DOUBLE)
|
||||
#define GEMM3M_R DGEMM_R
|
||||
#define GEMM3M_R ZGEMM3M_R
|
||||
#else
|
||||
#define GEMM3M_R SGEMM_R
|
||||
#define GEMM3M_R CGEMM3M_R
|
||||
#endif
|
||||
#endif
|
||||
|
||||
|
|
|
|||
|
|
@ -46,6 +46,7 @@
|
|||
#define __volatile__
|
||||
#endif
|
||||
|
||||
/*
|
||||
#ifdef HAVE_SSE2
|
||||
#define MB __asm__ __volatile__ ("mfence");
|
||||
#define WMB __asm__ __volatile__ ("sfence");
|
||||
|
|
@ -53,6 +54,10 @@
|
|||
#define MB
|
||||
#define WMB
|
||||
#endif
|
||||
*/
|
||||
|
||||
#define MB
|
||||
#define WMB
|
||||
|
||||
static void __inline blas_lock(volatile BLASULONG *address){
|
||||
|
||||
|
|
@ -99,7 +104,9 @@ static __inline void cpuid(int op, int *eax, int *ebx, int *ecx, int *edx){
|
|||
: "0" (op));
|
||||
}
|
||||
|
||||
/*
|
||||
#define WHEREAMI
|
||||
*/
|
||||
|
||||
static inline int WhereAmI(void){
|
||||
int eax, ebx, ecx, edx;
|
||||
|
|
@ -111,6 +118,7 @@ static inline int WhereAmI(void){
|
|||
return apicid;
|
||||
}
|
||||
|
||||
|
||||
#ifdef CORE_BARCELONA
|
||||
#define IFLUSH gotoblas_iflush()
|
||||
#define IFLUSH_HALF gotoblas_iflush_half()
|
||||
|
|
|
|||
|
|
@ -59,9 +59,16 @@
|
|||
void cpuid(int op, int *eax, int *ebx, int *ecx, int *edx);
|
||||
#else
|
||||
static inline void cpuid(int op, int *eax, int *ebx, int *ecx, int *edx){
|
||||
#if defined(__i386__) && defined(__PIC__)
|
||||
__asm__ __volatile__
|
||||
("mov %%ebx, %%edi;"
|
||||
"cpuid;"
|
||||
"xchgl %%ebx, %%edi;"
|
||||
: "=a" (*eax), "=D" (*ebx), "=c" (*ecx), "=d" (*edx) : "a" (op) : "cc");
|
||||
#else
|
||||
__asm__ __volatile__
|
||||
("cpuid": "=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx) : "a" (op) : "cc");
|
||||
|
||||
#endif
|
||||
}
|
||||
#endif
|
||||
|
||||
|
|
|
|||
|
|
@ -74,6 +74,18 @@ else
|
|||
OPENBLAS_NUM_THREADS=2 ./xzcblat3 < zin3
|
||||
endif
|
||||
|
||||
all3_3m: xzcblat3_3m xccblat3_3m
|
||||
ifeq ($(USE_OPENMP), 1)
|
||||
OMP_NUM_THREADS=2 ./xccblat3_3m < cin3_3m
|
||||
OMP_NUM_THREADS=2 ./xzcblat3_3m < zin3_3m
|
||||
else
|
||||
OPENBLAS_NUM_THREADS=2 ./xccblat3_3m < cin3_3m
|
||||
OPENBLAS_NUM_THREADS=2 ./xzcblat3_3m < zin3_3m
|
||||
endif
|
||||
|
||||
|
||||
|
||||
|
||||
clean ::
|
||||
rm -f x*
|
||||
|
||||
|
|
@ -103,6 +115,9 @@ xccblat2: $(ctestl2o) c_cblat2.o $(TOPDIR)/$(LIBNAME)
|
|||
xccblat3: $(ctestl3o) c_cblat3.o $(TOPDIR)/$(LIBNAME)
|
||||
$(FC) $(FLDFLAGS) -o xccblat3 c_cblat3.o $(ctestl3o) $(LIB) $(EXTRALIB) $(CEXTRALIB)
|
||||
|
||||
xccblat3_3m: $(ctestl3o) c_cblat3_3m.o $(TOPDIR)/$(LIBNAME)
|
||||
$(FC) $(FLDFLAGS) -o xccblat3_3m c_cblat3_3m.o $(ctestl3o) $(LIB) $(EXTRALIB) $(CEXTRALIB)
|
||||
|
||||
# Double complex
|
||||
xzcblat1: $(ztestl1o) c_zblat1.o $(TOPDIR)/$(LIBNAME)
|
||||
$(FC) $(FLDFLAGS) -o xzcblat1 c_zblat1.o $(ztestl1o) $(LIB) $(EXTRALIB) $(CEXTRALIB)
|
||||
|
|
@ -111,4 +126,9 @@ xzcblat2: $(ztestl2o) c_zblat2.o $(TOPDIR)/$(LIBNAME)
|
|||
xzcblat3: $(ztestl3o) c_zblat3.o $(TOPDIR)/$(LIBNAME)
|
||||
$(FC) $(FLDFLAGS) -o xzcblat3 c_zblat3.o $(ztestl3o) $(LIB) $(EXTRALIB) $(CEXTRALIB)
|
||||
|
||||
|
||||
xzcblat3_3m: $(ztestl3o) c_zblat3_3m.o $(TOPDIR)/$(LIBNAME)
|
||||
$(FC) $(FLDFLAGS) -o xzcblat3_3m c_zblat3_3m.o $(ztestl3o) $(LIB) $(EXTRALIB) $(CEXTRALIB)
|
||||
|
||||
|
||||
include $(TOPDIR)/Makefile.tail
|
||||
|
|
|
|||
234
ctest/c_c3chke.c
234
ctest/c_c3chke.c
|
|
@ -45,8 +45,238 @@ void F77_c3chke(char * rout) {
|
|||
F77_xerbla(cblas_rout,&cblas_info);
|
||||
}
|
||||
|
||||
if (strncmp( sf,"cblas_cgemm" ,11)==0) {
|
||||
cblas_rout = "cblas_cgemm" ;
|
||||
|
||||
if (strncmp( sf,"cblas_cgemm3m" ,13)==0) {
|
||||
cblas_rout = "cblas_cgemm3" ;
|
||||
|
||||
cblas_info = 1;
|
||||
cblas_cgemm3m( INVALID, CblasNoTrans, CblasNoTrans, 0, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 1;
|
||||
cblas_cgemm3m( INVALID, CblasNoTrans, CblasTrans, 0, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 1;
|
||||
cblas_cgemm3m( INVALID, CblasTrans, CblasNoTrans, 0, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 1;
|
||||
cblas_cgemm3m( INVALID, CblasTrans, CblasTrans, 0, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 2; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, INVALID, CblasNoTrans, 0, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 2; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, INVALID, CblasTrans, 0, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 3; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasNoTrans, INVALID, 0, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 3; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasTrans, INVALID, 0, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 4; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasNoTrans, CblasNoTrans, INVALID, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 4; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasNoTrans, CblasTrans, INVALID, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 4; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasTrans, CblasNoTrans, INVALID, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 4; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasTrans, CblasTrans, INVALID, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 5; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasNoTrans, CblasNoTrans, 0, INVALID, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 5; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasNoTrans, CblasTrans, 0, INVALID, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 5; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasTrans, CblasNoTrans, 0, INVALID, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 5; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasTrans, CblasTrans, 0, INVALID, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 6; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasNoTrans, CblasNoTrans, 0, 0, INVALID,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 6; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasNoTrans, CblasTrans, 0, 0, INVALID,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 6; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasTrans, CblasNoTrans, 0, 0, INVALID,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 6; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasTrans, CblasTrans, 0, 0, INVALID,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 9; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasNoTrans, CblasNoTrans, 2, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 2 );
|
||||
chkxer();
|
||||
cblas_info = 9; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasNoTrans, CblasTrans, 2, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 2 );
|
||||
chkxer();
|
||||
cblas_info = 9; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasTrans, CblasNoTrans, 0, 0, 2,
|
||||
ALPHA, A, 1, B, 2, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 9; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasTrans, CblasTrans, 0, 0, 2,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 11; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasNoTrans, CblasNoTrans, 0, 0, 2,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 11; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasTrans, CblasNoTrans, 0, 0, 2,
|
||||
ALPHA, A, 2, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 11; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasNoTrans, CblasTrans, 0, 2, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 11; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasTrans, CblasTrans, 0, 2, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 14; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasNoTrans, CblasNoTrans, 2, 0, 0,
|
||||
ALPHA, A, 2, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 14; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasNoTrans, CblasTrans, 2, 0, 0,
|
||||
ALPHA, A, 2, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 14; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasTrans, CblasNoTrans, 2, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 14; RowMajorStrg = FALSE;
|
||||
cblas_cgemm3m( CblasColMajor, CblasTrans, CblasTrans, 2, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 4; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasNoTrans, CblasNoTrans, INVALID, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 4; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasNoTrans, CblasTrans, INVALID, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 4; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasTrans, CblasNoTrans, INVALID, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 4; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasTrans, CblasTrans, INVALID, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 5; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasNoTrans, CblasNoTrans, 0, INVALID, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 5; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasNoTrans, CblasTrans, 0, INVALID, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 5; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasTrans, CblasNoTrans, 0, INVALID, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 5; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasTrans, CblasTrans, 0, INVALID, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 6; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasNoTrans, CblasNoTrans, 0, 0, INVALID,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 6; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasNoTrans, CblasTrans, 0, 0, INVALID,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 6; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasTrans, CblasNoTrans, 0, 0, INVALID,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 6; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasTrans, CblasTrans, 0, 0, INVALID,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 9; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasNoTrans, CblasNoTrans, 0, 0, 2,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 2 );
|
||||
chkxer();
|
||||
cblas_info = 9; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasNoTrans, CblasTrans, 0, 0, 2,
|
||||
ALPHA, A, 1, B, 2, BETA, C, 2 );
|
||||
chkxer();
|
||||
cblas_info = 9; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasTrans, CblasNoTrans, 2, 0, 0,
|
||||
ALPHA, A, 1, B, 2, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 9; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasTrans, CblasTrans, 2, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 11; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasNoTrans, CblasNoTrans, 0, 2, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 11; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasTrans, CblasNoTrans, 0, 2, 0,
|
||||
ALPHA, A, 2, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 11; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasNoTrans, CblasTrans, 0, 0, 2,
|
||||
ALPHA, A, 2, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 11; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasTrans, CblasTrans, 0, 0, 2,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 14; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasNoTrans, CblasNoTrans, 0, 2, 0,
|
||||
ALPHA, A, 1, B, 2, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 14; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasNoTrans, CblasTrans, 0, 2, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 14; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasTrans, CblasNoTrans, 0, 2, 0,
|
||||
ALPHA, A, 1, B, 2, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 14; RowMajorStrg = TRUE;
|
||||
cblas_cgemm3m( CblasRowMajor, CblasTrans, CblasTrans, 0, 2, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
|
||||
} else if (strncmp( sf,"cblas_cgemm" ,11)==0) {
|
||||
cblas_rout = "cblas_cgemm" ;
|
||||
|
||||
|
||||
cblas_info = 1;
|
||||
cblas_cgemm( INVALID, CblasNoTrans, CblasNoTrans, 0, 0, 0,
|
||||
|
|
|
|||
|
|
@ -88,6 +88,7 @@ void F77_cgemm(int *order, char *transpa, char *transpb, int *m, int *n,
|
|||
cblas_cgemm( UNDEFINED, transa, transb, *m, *n, *k, alpha, a, *lda,
|
||||
b, *ldb, beta, c, *ldc );
|
||||
}
|
||||
|
||||
void F77_chemm(int *order, char *rtlf, char *uplow, int *m, int *n,
|
||||
CBLAS_TEST_COMPLEX *alpha, CBLAS_TEST_COMPLEX *a, int *lda,
|
||||
CBLAS_TEST_COMPLEX *b, int *ldb, CBLAS_TEST_COMPLEX *beta,
|
||||
|
|
@ -563,3 +564,84 @@ void F77_ctrsm(int *order, char *rtlf, char *uplow, char *transp, char *diagn,
|
|||
cblas_ctrsm(UNDEFINED, side, uplo, trans, diag, *m, *n, alpha,
|
||||
a, *lda, b, *ldb);
|
||||
}
|
||||
|
||||
|
||||
|
||||
void F77_cgemm3m(int *order, char *transpa, char *transpb, int *m, int *n,
|
||||
int *k, CBLAS_TEST_COMPLEX *alpha, CBLAS_TEST_COMPLEX *a, int *lda,
|
||||
CBLAS_TEST_COMPLEX *b, int *ldb, CBLAS_TEST_COMPLEX *beta,
|
||||
CBLAS_TEST_COMPLEX *c, int *ldc ) {
|
||||
|
||||
CBLAS_TEST_COMPLEX *A, *B, *C;
|
||||
int i,j,LDA, LDB, LDC;
|
||||
enum CBLAS_TRANSPOSE transa, transb;
|
||||
|
||||
get_transpose_type(transpa, &transa);
|
||||
get_transpose_type(transpb, &transb);
|
||||
|
||||
if (*order == TEST_ROW_MJR) {
|
||||
if (transa == CblasNoTrans) {
|
||||
LDA = *k+1;
|
||||
A=(CBLAS_TEST_COMPLEX*)malloc((*m)*LDA*sizeof(CBLAS_TEST_COMPLEX));
|
||||
for( i=0; i<*m; i++ )
|
||||
for( j=0; j<*k; j++ ) {
|
||||
A[i*LDA+j].real=a[j*(*lda)+i].real;
|
||||
A[i*LDA+j].imag=a[j*(*lda)+i].imag;
|
||||
}
|
||||
}
|
||||
else {
|
||||
LDA = *m+1;
|
||||
A=(CBLAS_TEST_COMPLEX* )malloc(LDA*(*k)*sizeof(CBLAS_TEST_COMPLEX));
|
||||
for( i=0; i<*k; i++ )
|
||||
for( j=0; j<*m; j++ ) {
|
||||
A[i*LDA+j].real=a[j*(*lda)+i].real;
|
||||
A[i*LDA+j].imag=a[j*(*lda)+i].imag;
|
||||
}
|
||||
}
|
||||
|
||||
if (transb == CblasNoTrans) {
|
||||
LDB = *n+1;
|
||||
B=(CBLAS_TEST_COMPLEX* )malloc((*k)*LDB*sizeof(CBLAS_TEST_COMPLEX) );
|
||||
for( i=0; i<*k; i++ )
|
||||
for( j=0; j<*n; j++ ) {
|
||||
B[i*LDB+j].real=b[j*(*ldb)+i].real;
|
||||
B[i*LDB+j].imag=b[j*(*ldb)+i].imag;
|
||||
}
|
||||
}
|
||||
else {
|
||||
LDB = *k+1;
|
||||
B=(CBLAS_TEST_COMPLEX* )malloc(LDB*(*n)*sizeof(CBLAS_TEST_COMPLEX));
|
||||
for( i=0; i<*n; i++ )
|
||||
for( j=0; j<*k; j++ ) {
|
||||
B[i*LDB+j].real=b[j*(*ldb)+i].real;
|
||||
B[i*LDB+j].imag=b[j*(*ldb)+i].imag;
|
||||
}
|
||||
}
|
||||
|
||||
LDC = *n+1;
|
||||
C=(CBLAS_TEST_COMPLEX* )malloc((*m)*LDC*sizeof(CBLAS_TEST_COMPLEX));
|
||||
for( j=0; j<*n; j++ )
|
||||
for( i=0; i<*m; i++ ) {
|
||||
C[i*LDC+j].real=c[j*(*ldc)+i].real;
|
||||
C[i*LDC+j].imag=c[j*(*ldc)+i].imag;
|
||||
}
|
||||
cblas_cgemm3m( CblasRowMajor, transa, transb, *m, *n, *k, alpha, A, LDA,
|
||||
B, LDB, beta, C, LDC );
|
||||
for( j=0; j<*n; j++ )
|
||||
for( i=0; i<*m; i++ ) {
|
||||
c[j*(*ldc)+i].real=C[i*LDC+j].real;
|
||||
c[j*(*ldc)+i].imag=C[i*LDC+j].imag;
|
||||
}
|
||||
free(A);
|
||||
free(B);
|
||||
free(C);
|
||||
}
|
||||
else if (*order == TEST_COL_MJR)
|
||||
cblas_cgemm3m( CblasColMajor, transa, transb, *m, *n, *k, alpha, a, *lda,
|
||||
b, *ldb, beta, c, *ldc );
|
||||
else
|
||||
cblas_cgemm3m( UNDEFINED, transa, transb, *m, *n, *k, alpha, a, *lda,
|
||||
b, *ldb, beta, c, *ldc );
|
||||
}
|
||||
|
||||
|
||||
|
|
|
|||
File diff suppressed because it is too large
Load Diff
238
ctest/c_z3chke.c
238
ctest/c_z3chke.c
|
|
@ -45,8 +45,242 @@ void F77_z3chke(char * rout) {
|
|||
F77_xerbla(cblas_rout,&cblas_info);
|
||||
}
|
||||
|
||||
if (strncmp( sf,"cblas_zgemm" ,11)==0) {
|
||||
cblas_rout = "cblas_zgemm" ;
|
||||
|
||||
|
||||
|
||||
|
||||
if (strncmp( sf,"cblas_zgemm3m" ,13)==0) {
|
||||
cblas_rout = "cblas_zgemm3" ;
|
||||
|
||||
cblas_info = 1;
|
||||
cblas_zgemm3m( INVALID, CblasNoTrans, CblasNoTrans, 0, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 1;
|
||||
cblas_zgemm3m( INVALID, CblasNoTrans, CblasTrans, 0, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 1;
|
||||
cblas_zgemm3m( INVALID, CblasTrans, CblasNoTrans, 0, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 1;
|
||||
cblas_zgemm3m( INVALID, CblasTrans, CblasTrans, 0, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 2; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, INVALID, CblasNoTrans, 0, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 2; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, INVALID, CblasTrans, 0, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 3; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasNoTrans, INVALID, 0, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 3; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasTrans, INVALID, 0, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 4; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasNoTrans, CblasNoTrans, INVALID, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 4; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasNoTrans, CblasTrans, INVALID, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 4; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasTrans, CblasNoTrans, INVALID, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 4; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasTrans, CblasTrans, INVALID, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 5; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasNoTrans, CblasNoTrans, 0, INVALID, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 5; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasNoTrans, CblasTrans, 0, INVALID, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 5; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasTrans, CblasNoTrans, 0, INVALID, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 5; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasTrans, CblasTrans, 0, INVALID, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 6; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasNoTrans, CblasNoTrans, 0, 0, INVALID,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 6; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasNoTrans, CblasTrans, 0, 0, INVALID,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 6; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasTrans, CblasNoTrans, 0, 0, INVALID,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 6; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasTrans, CblasTrans, 0, 0, INVALID,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 9; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasNoTrans, CblasNoTrans, 2, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 2 );
|
||||
chkxer();
|
||||
cblas_info = 9; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasNoTrans, CblasTrans, 2, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 2 );
|
||||
chkxer();
|
||||
cblas_info = 9; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasTrans, CblasNoTrans, 0, 0, 2,
|
||||
ALPHA, A, 1, B, 2, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 9; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasTrans, CblasTrans, 0, 0, 2,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 11; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasNoTrans, CblasNoTrans, 0, 0, 2,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 11; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasTrans, CblasNoTrans, 0, 0, 2,
|
||||
ALPHA, A, 2, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 11; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasNoTrans, CblasTrans, 0, 2, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 11; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasTrans, CblasTrans, 0, 2, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 14; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasNoTrans, CblasNoTrans, 2, 0, 0,
|
||||
ALPHA, A, 2, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 14; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasNoTrans, CblasTrans, 2, 0, 0,
|
||||
ALPHA, A, 2, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 14; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasTrans, CblasNoTrans, 2, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 14; RowMajorStrg = FALSE;
|
||||
cblas_zgemm3m( CblasColMajor, CblasTrans, CblasTrans, 2, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 4; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasNoTrans, CblasNoTrans, INVALID, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 4; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasNoTrans, CblasTrans, INVALID, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 4; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasTrans, CblasNoTrans, INVALID, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 4; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasTrans, CblasTrans, INVALID, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 5; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasNoTrans, CblasNoTrans, 0, INVALID, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 5; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasNoTrans, CblasTrans, 0, INVALID, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 5; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasTrans, CblasNoTrans, 0, INVALID, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 5; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasTrans, CblasTrans, 0, INVALID, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 6; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasNoTrans, CblasNoTrans, 0, 0, INVALID,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 6; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasNoTrans, CblasTrans, 0, 0, INVALID,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 6; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasTrans, CblasNoTrans, 0, 0, INVALID,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 6; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasTrans, CblasTrans, 0, 0, INVALID,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 9; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasNoTrans, CblasNoTrans, 0, 0, 2,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 2 );
|
||||
chkxer();
|
||||
cblas_info = 9; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasNoTrans, CblasTrans, 0, 0, 2,
|
||||
ALPHA, A, 1, B, 2, BETA, C, 2 );
|
||||
chkxer();
|
||||
cblas_info = 9; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasTrans, CblasNoTrans, 2, 0, 0,
|
||||
ALPHA, A, 1, B, 2, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 9; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasTrans, CblasTrans, 2, 0, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 11; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasNoTrans, CblasNoTrans, 0, 2, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 11; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasTrans, CblasNoTrans, 0, 2, 0,
|
||||
ALPHA, A, 2, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 11; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasNoTrans, CblasTrans, 0, 0, 2,
|
||||
ALPHA, A, 2, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 11; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasTrans, CblasTrans, 0, 0, 2,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 14; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasNoTrans, CblasNoTrans, 0, 2, 0,
|
||||
ALPHA, A, 1, B, 2, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 14; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasNoTrans, CblasTrans, 0, 2, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 14; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasTrans, CblasNoTrans, 0, 2, 0,
|
||||
ALPHA, A, 1, B, 2, BETA, C, 1 );
|
||||
chkxer();
|
||||
cblas_info = 14; RowMajorStrg = TRUE;
|
||||
cblas_zgemm3m( CblasRowMajor, CblasTrans, CblasTrans, 0, 2, 0,
|
||||
ALPHA, A, 1, B, 1, BETA, C, 1 );
|
||||
chkxer();
|
||||
|
||||
|
||||
|
||||
} else if (strncmp( sf,"cblas_zgemm" ,11)==0) {
|
||||
cblas_rout = "cblas_zgemm" ;
|
||||
|
||||
cblas_info = 1;
|
||||
cblas_zgemm( INVALID, CblasNoTrans, CblasNoTrans, 0, 0, 0,
|
||||
|
|
|
|||
|
|
@ -562,3 +562,82 @@ void F77_ztrsm(int *order, char *rtlf, char *uplow, char *transp, char *diagn,
|
|||
cblas_ztrsm(UNDEFINED, side, uplo, trans, diag, *m, *n, alpha,
|
||||
a, *lda, b, *ldb);
|
||||
}
|
||||
|
||||
|
||||
void F77_zgemm3m(int *order, char *transpa, char *transpb, int *m, int *n,
|
||||
int *k, CBLAS_TEST_ZOMPLEX *alpha, CBLAS_TEST_ZOMPLEX *a, int *lda,
|
||||
CBLAS_TEST_ZOMPLEX *b, int *ldb, CBLAS_TEST_ZOMPLEX *beta,
|
||||
CBLAS_TEST_ZOMPLEX *c, int *ldc ) {
|
||||
|
||||
CBLAS_TEST_ZOMPLEX *A, *B, *C;
|
||||
int i,j,LDA, LDB, LDC;
|
||||
enum CBLAS_TRANSPOSE transa, transb;
|
||||
|
||||
get_transpose_type(transpa, &transa);
|
||||
get_transpose_type(transpb, &transb);
|
||||
|
||||
if (*order == TEST_ROW_MJR) {
|
||||
if (transa == CblasNoTrans) {
|
||||
LDA = *k+1;
|
||||
A=(CBLAS_TEST_ZOMPLEX*)malloc((*m)*LDA*sizeof(CBLAS_TEST_ZOMPLEX));
|
||||
for( i=0; i<*m; i++ )
|
||||
for( j=0; j<*k; j++ ) {
|
||||
A[i*LDA+j].real=a[j*(*lda)+i].real;
|
||||
A[i*LDA+j].imag=a[j*(*lda)+i].imag;
|
||||
}
|
||||
}
|
||||
else {
|
||||
LDA = *m+1;
|
||||
A=(CBLAS_TEST_ZOMPLEX* )malloc(LDA*(*k)*sizeof(CBLAS_TEST_ZOMPLEX));
|
||||
for( i=0; i<*k; i++ )
|
||||
for( j=0; j<*m; j++ ) {
|
||||
A[i*LDA+j].real=a[j*(*lda)+i].real;
|
||||
A[i*LDA+j].imag=a[j*(*lda)+i].imag;
|
||||
}
|
||||
}
|
||||
|
||||
if (transb == CblasNoTrans) {
|
||||
LDB = *n+1;
|
||||
B=(CBLAS_TEST_ZOMPLEX* )malloc((*k)*LDB*sizeof(CBLAS_TEST_ZOMPLEX) );
|
||||
for( i=0; i<*k; i++ )
|
||||
for( j=0; j<*n; j++ ) {
|
||||
B[i*LDB+j].real=b[j*(*ldb)+i].real;
|
||||
B[i*LDB+j].imag=b[j*(*ldb)+i].imag;
|
||||
}
|
||||
}
|
||||
else {
|
||||
LDB = *k+1;
|
||||
B=(CBLAS_TEST_ZOMPLEX* )malloc(LDB*(*n)*sizeof(CBLAS_TEST_ZOMPLEX));
|
||||
for( i=0; i<*n; i++ )
|
||||
for( j=0; j<*k; j++ ) {
|
||||
B[i*LDB+j].real=b[j*(*ldb)+i].real;
|
||||
B[i*LDB+j].imag=b[j*(*ldb)+i].imag;
|
||||
}
|
||||
}
|
||||
|
||||
LDC = *n+1;
|
||||
C=(CBLAS_TEST_ZOMPLEX* )malloc((*m)*LDC*sizeof(CBLAS_TEST_ZOMPLEX));
|
||||
for( j=0; j<*n; j++ )
|
||||
for( i=0; i<*m; i++ ) {
|
||||
C[i*LDC+j].real=c[j*(*ldc)+i].real;
|
||||
C[i*LDC+j].imag=c[j*(*ldc)+i].imag;
|
||||
}
|
||||
cblas_zgemm3m( CblasRowMajor, transa, transb, *m, *n, *k, alpha, A, LDA,
|
||||
B, LDB, beta, C, LDC );
|
||||
for( j=0; j<*n; j++ )
|
||||
for( i=0; i<*m; i++ ) {
|
||||
c[j*(*ldc)+i].real=C[i*LDC+j].real;
|
||||
c[j*(*ldc)+i].imag=C[i*LDC+j].imag;
|
||||
}
|
||||
free(A);
|
||||
free(B);
|
||||
free(C);
|
||||
}
|
||||
else if (*order == TEST_COL_MJR)
|
||||
cblas_zgemm3m( CblasColMajor, transa, transb, *m, *n, *k, alpha, a, *lda,
|
||||
b, *ldb, beta, c, *ldc );
|
||||
else
|
||||
cblas_zgemm3m( UNDEFINED, transa, transb, *m, *n, *k, alpha, a, *lda,
|
||||
b, *ldb, beta, c, *ldc );
|
||||
}
|
||||
|
||||
|
|
|
|||
File diff suppressed because it is too large
Load Diff
|
|
@ -173,12 +173,14 @@ typedef struct { double real; double imag; } CBLAS_TEST_ZOMPLEX;
|
|||
#define F77_dtrmm cdtrmm_
|
||||
#define F77_dtrsm cdtrsm_
|
||||
#define F77_cgemm ccgemm_
|
||||
#define F77_cgemm3m ccgemm3m_
|
||||
#define F77_csymm ccsymm_
|
||||
#define F77_csyrk ccsyrk_
|
||||
#define F77_csyr2k ccsyr2k_
|
||||
#define F77_ctrmm cctrmm_
|
||||
#define F77_ctrsm cctrsm_
|
||||
#define F77_zgemm czgemm_
|
||||
#define F77_zgemm3m czgemm3m_
|
||||
#define F77_zsymm czsymm_
|
||||
#define F77_zsyrk czsyrk_
|
||||
#define F77_zsyr2k czsyr2k_
|
||||
|
|
@ -333,12 +335,14 @@ typedef struct { double real; double imag; } CBLAS_TEST_ZOMPLEX;
|
|||
#define F77_dtrmm CDTRMM
|
||||
#define F77_dtrsm CDTRSM
|
||||
#define F77_cgemm CCGEMM
|
||||
#define F77_cgemm3m CCGEMM3M
|
||||
#define F77_csymm CCSYMM
|
||||
#define F77_csyrk CCSYRK
|
||||
#define F77_csyr2k CCSYR2K
|
||||
#define F77_ctrmm CCTRMM
|
||||
#define F77_ctrsm CCTRSM
|
||||
#define F77_zgemm CZGEMM
|
||||
#define F77_zgemm3m CZGEMM3M
|
||||
#define F77_zsymm CZSYMM
|
||||
#define F77_zsyrk CZSYRK
|
||||
#define F77_zsyr2k CZSYR2K
|
||||
|
|
@ -493,12 +497,14 @@ typedef struct { double real; double imag; } CBLAS_TEST_ZOMPLEX;
|
|||
#define F77_dtrmm cdtrmm
|
||||
#define F77_dtrsm cdtrsm
|
||||
#define F77_cgemm ccgemm
|
||||
#define F77_cgemm3m ccgemm3m
|
||||
#define F77_csymm ccsymm
|
||||
#define F77_csyrk ccsyrk
|
||||
#define F77_csyr2k ccsyr2k
|
||||
#define F77_ctrmm cctrmm
|
||||
#define F77_ctrsm cctrsm
|
||||
#define F77_zgemm czgemm
|
||||
#define F77_zgemm3m czgemm3m
|
||||
#define F77_zsymm czsymm
|
||||
#define F77_zsyrk czsyrk
|
||||
#define F77_zsyr2k czsyr2k
|
||||
|
|
|
|||
|
|
@ -0,0 +1,22 @@
|
|||
'CBLAT3.SNAP' NAME OF SNAPSHOT OUTPUT FILE
|
||||
-1 UNIT NUMBER OF SNAPSHOT FILE (NOT USED IF .LT. 0)
|
||||
F LOGICAL FLAG, T TO REWIND SNAPSHOT FILE AFTER EACH RECORD.
|
||||
F LOGICAL FLAG, T TO STOP ON FAILURES.
|
||||
T LOGICAL FLAG, T TO TEST ERROR EXITS.
|
||||
2 0 TO TEST COLUMN-MAJOR, 1 TO TEST ROW-MAJOR, 2 TO TEST BOTH
|
||||
16.0 THRESHOLD VALUE OF TEST RATIO
|
||||
6 NUMBER OF VALUES OF N
|
||||
0 1 2 3 5 9 35 VALUES OF N
|
||||
3 NUMBER OF VALUES OF ALPHA
|
||||
(0.0,0.0) (1.0,0.0) (0.7,-0.9) VALUES OF ALPHA
|
||||
3 NUMBER OF VALUES OF BETA
|
||||
(0.0,0.0) (1.0,0.0) (1.3,-1.1) VALUES OF BETA
|
||||
cblas_cgemm3m T PUT F FOR NO TEST. SAME COLUMNS.
|
||||
cblas_chemm F PUT F FOR NO TEST. SAME COLUMNS.
|
||||
cblas_csymm F PUT F FOR NO TEST. SAME COLUMNS.
|
||||
cblas_ctrmm F PUT F FOR NO TEST. SAME COLUMNS.
|
||||
cblas_ctrsm F PUT F FOR NO TEST. SAME COLUMNS.
|
||||
cblas_cherk F PUT F FOR NO TEST. SAME COLUMNS.
|
||||
cblas_csyrk F PUT F FOR NO TEST. SAME COLUMNS.
|
||||
cblas_cher2k F PUT F FOR NO TEST. SAME COLUMNS.
|
||||
cblas_csyr2k F PUT F FOR NO TEST. SAME COLUMNS.
|
||||
|
|
@ -0,0 +1,22 @@
|
|||
'ZBLAT3.SNAP' NAME OF SNAPSHOT OUTPUT FILE
|
||||
-1 UNIT NUMBER OF SNAPSHOT FILE (NOT USED IF .LT. 0)
|
||||
F LOGICAL FLAG, T TO REWIND SNAPSHOT FILE AFTER EACH RECORD.
|
||||
F LOGICAL FLAG, T TO STOP ON FAILURES.
|
||||
T LOGICAL FLAG, T TO TEST ERROR EXITS.
|
||||
2 0 TO TEST COLUMN-MAJOR, 1 TO TEST ROW-MAJOR, 2 TO TEST BOTH
|
||||
16.0 THRESHOLD VALUE OF TEST RATIO
|
||||
7 NUMBER OF VALUES OF N
|
||||
0 1 2 3 5 9 35 VALUES OF N
|
||||
3 NUMBER OF VALUES OF ALPHA
|
||||
(0.0,0.0) (1.0,0.0) (0.7,-0.9) VALUES OF ALPHA
|
||||
3 NUMBER OF VALUES OF BETA
|
||||
(0.0,0.0) (1.0,0.0) (1.3,-1.1) VALUES OF BETA
|
||||
cblas_zgemm3m T PUT F FOR NO TEST. SAME COLUMNS.
|
||||
cblas_zhemm F PUT F FOR NO TEST. SAME COLUMNS.
|
||||
cblas_zsymm F PUT F FOR NO TEST. SAME COLUMNS.
|
||||
cblas_ztrmm F PUT F FOR NO TEST. SAME COLUMNS.
|
||||
cblas_ztrsm F PUT F FOR NO TEST. SAME COLUMNS.
|
||||
cblas_zherk F PUT F FOR NO TEST. SAME COLUMNS.
|
||||
cblas_zsyrk F PUT F FOR NO TEST. SAME COLUMNS.
|
||||
cblas_zher2k F PUT F FOR NO TEST. SAME COLUMNS.
|
||||
cblas_zsyr2k F PUT F FOR NO TEST. SAME COLUMNS.
|
||||
|
|
@ -4,11 +4,11 @@ include ../../Makefile.system
|
|||
USE_GEMM3M = 0
|
||||
|
||||
ifeq ($(ARCH), x86)
|
||||
USE_GEMM3M = 0
|
||||
USE_GEMM3M = 1
|
||||
endif
|
||||
|
||||
ifeq ($(ARCH), x86_64)
|
||||
USE_GEMM3M = 0
|
||||
USE_GEMM3M = 1
|
||||
endif
|
||||
|
||||
ifeq ($(ARCH), ia64)
|
||||
|
|
|
|||
|
|
@ -251,7 +251,11 @@ void blas_set_parameter(void){
|
|||
|
||||
env_var_t p;
|
||||
int factor;
|
||||
#if defined(BULLDOZER) || defined(PILEDRIVER) || defined(SANDYBRIDGE) || defined(NEHALEM) || defined(HASWELL)
|
||||
int size = 16;
|
||||
#else
|
||||
int size = get_L2_size();
|
||||
#endif
|
||||
|
||||
#if defined(CORE_KATMAI) || defined(CORE_COPPERMINE) || defined(CORE_BANIAS)
|
||||
size >>= 7;
|
||||
|
|
|
|||
|
|
@ -52,7 +52,9 @@
|
|||
cblas_zhpr, cblas_zscal, cblas_zswap, cblas_zsymm, cblas_zsyr2k, cblas_zsyrk,
|
||||
cblas_ztbmv, cblas_ztbsv, cblas_ztpmv, cblas_ztpsv, cblas_ztrmm, cblas_ztrmv, cblas_ztrsm,
|
||||
cblas_ztrsv, cblas_cdotc_sub, cblas_cdotu_sub, cblas_zdotc_sub, cblas_zdotu_sub,
|
||||
cblas_saxpby,cblas_daxpby,cblas_caxpby,cblas_zaxpby
|
||||
cblas_saxpby,cblas_daxpby,cblas_caxpby,cblas_zaxpby,
|
||||
cblas_somatcopy, cblas_domatcopy, cblas_comatcopy, cblas_zomatcopy,
|
||||
cblas_simatcopy, cblas_dimatcopy, cblas_cimatcopy, cblas_zimatcopy,
|
||||
);
|
||||
|
||||
@exblasobjs = (
|
||||
|
|
@ -73,7 +75,7 @@
|
|||
);
|
||||
|
||||
@gemm3mobjs = (
|
||||
|
||||
cgemm3m,zgemm3m
|
||||
);
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -4,11 +4,11 @@ include $(TOPDIR)/Makefile.system
|
|||
SUPPORT_GEMM3M = 0
|
||||
|
||||
ifeq ($(ARCH), x86)
|
||||
SUPPORT_GEMM3M = 0
|
||||
SUPPORT_GEMM3M = 1
|
||||
endif
|
||||
|
||||
ifeq ($(ARCH), x86_64)
|
||||
SUPPORT_GEMM3M = 0
|
||||
SUPPORT_GEMM3M = 1
|
||||
endif
|
||||
|
||||
ifeq ($(ARCH), ia64)
|
||||
|
|
@ -128,9 +128,11 @@ ZBLAS3OBJS = \
|
|||
|
||||
ifeq ($(SUPPORT_GEMM3M), 1)
|
||||
|
||||
CBLAS3OBJS += cgemm3m.$(SUFFIX) csymm3m.$(SUFFIX) chemm3m.$(SUFFIX)
|
||||
# CBLAS3OBJS += cgemm3m.$(SUFFIX) csymm3m.$(SUFFIX) chemm3m.$(SUFFIX)
|
||||
CBLAS3OBJS += cgemm3m.$(SUFFIX)
|
||||
|
||||
ZBLAS3OBJS += zgemm3m.$(SUFFIX) zsymm3m.$(SUFFIX) zhemm3m.$(SUFFIX)
|
||||
# ZBLAS3OBJS += zgemm3m.$(SUFFIX) zsymm3m.$(SUFFIX) zhemm3m.$(SUFFIX)
|
||||
ZBLAS3OBJS += zgemm3m.$(SUFFIX)
|
||||
|
||||
endif
|
||||
|
||||
|
|
@ -267,7 +269,7 @@ CSBLAS2OBJS = \
|
|||
|
||||
CSBLAS3OBJS = \
|
||||
cblas_sgemm.$(SUFFIX) cblas_ssymm.$(SUFFIX) cblas_strmm.$(SUFFIX) cblas_strsm.$(SUFFIX) \
|
||||
cblas_ssyrk.$(SUFFIX) cblas_ssyr2k.$(SUFFIX)
|
||||
cblas_ssyrk.$(SUFFIX) cblas_ssyr2k.$(SUFFIX) cblas_somatcopy.$(SUFFIX) cblas_simatcopy.$(SUFFIX)
|
||||
|
||||
CDBLAS1OBJS = \
|
||||
cblas_idamax.$(SUFFIX) cblas_dasum.$(SUFFIX) cblas_daxpy.$(SUFFIX) \
|
||||
|
|
@ -283,7 +285,7 @@ CDBLAS2OBJS = \
|
|||
|
||||
CDBLAS3OBJS += \
|
||||
cblas_dgemm.$(SUFFIX) cblas_dsymm.$(SUFFIX) cblas_dtrmm.$(SUFFIX) cblas_dtrsm.$(SUFFIX) \
|
||||
cblas_dsyrk.$(SUFFIX) cblas_dsyr2k.$(SUFFIX)
|
||||
cblas_dsyrk.$(SUFFIX) cblas_dsyr2k.$(SUFFIX) cblas_domatcopy.$(SUFFIX) cblas_dimatcopy.$(SUFFIX)
|
||||
|
||||
CCBLAS1OBJS = \
|
||||
cblas_icamax.$(SUFFIX) cblas_scasum.$(SUFFIX) cblas_caxpy.$(SUFFIX) \
|
||||
|
|
@ -305,7 +307,9 @@ CCBLAS2OBJS = \
|
|||
CCBLAS3OBJS = \
|
||||
cblas_cgemm.$(SUFFIX) cblas_csymm.$(SUFFIX) cblas_ctrmm.$(SUFFIX) cblas_ctrsm.$(SUFFIX) \
|
||||
cblas_csyrk.$(SUFFIX) cblas_csyr2k.$(SUFFIX) \
|
||||
cblas_chemm.$(SUFFIX) cblas_cherk.$(SUFFIX) cblas_cher2k.$(SUFFIX)
|
||||
cblas_chemm.$(SUFFIX) cblas_cherk.$(SUFFIX) cblas_cher2k.$(SUFFIX) \
|
||||
cblas_comatcopy.$(SUFFIX) cblas_cimatcopy.$(SUFFIX)
|
||||
|
||||
|
||||
CZBLAS1OBJS = \
|
||||
cblas_izamax.$(SUFFIX) cblas_dzasum.$(SUFFIX) cblas_zaxpy.$(SUFFIX) \
|
||||
|
|
@ -327,7 +331,19 @@ CZBLAS2OBJS = \
|
|||
CZBLAS3OBJS = \
|
||||
cblas_zgemm.$(SUFFIX) cblas_zsymm.$(SUFFIX) cblas_ztrmm.$(SUFFIX) cblas_ztrsm.$(SUFFIX) \
|
||||
cblas_zsyrk.$(SUFFIX) cblas_zsyr2k.$(SUFFIX) \
|
||||
cblas_zhemm.$(SUFFIX) cblas_zherk.$(SUFFIX) cblas_zher2k.$(SUFFIX)
|
||||
cblas_zhemm.$(SUFFIX) cblas_zherk.$(SUFFIX) cblas_zher2k.$(SUFFIX)\
|
||||
cblas_zomatcopy.$(SUFFIX) cblas_zimatcopy.$(SUFFIX)
|
||||
|
||||
ifeq ($(SUPPORT_GEMM3M), 1)
|
||||
|
||||
# CBLAS3OBJS += cgemm3m.$(SUFFIX) csymm3m.$(SUFFIX) chemm3m.$(SUFFIX)
|
||||
CCBLAS3OBJS += cblas_cgemm3m.$(SUFFIX)
|
||||
|
||||
# ZBLAS3OBJS += zgemm3m.$(SUFFIX) zsymm3m.$(SUFFIX) zhemm3m.$(SUFFIX)
|
||||
CZBLAS3OBJS += cblas_zgemm3m.$(SUFFIX)
|
||||
|
||||
endif
|
||||
|
||||
|
||||
ifndef NO_CBLAS
|
||||
|
||||
|
|
@ -1771,6 +1787,13 @@ cblas_cher2k.$(SUFFIX) cblas_cher2k.$(PSUFFIX) : syr2k.c
|
|||
cblas_zher2k.$(SUFFIX) cblas_zher2k.$(PSUFFIX) : syr2k.c
|
||||
$(CC) -DCBLAS -c $(CFLAGS) -DHEMM $< -o $(@F)
|
||||
|
||||
cblas_cgemm3m.$(SUFFIX) cblas_cgemm3m.$(PSUFFIX) : gemm.c
|
||||
$(CC) -DCBLAS -c $(CFLAGS) -DGEMM3M $< -o $(@F)
|
||||
|
||||
cblas_zgemm3m.$(SUFFIX) cblas_zgemm3m.$(PSUFFIX) : gemm.c
|
||||
$(CC) -DCBLAS -c $(CFLAGS) -DGEMM3M $< -o $(@F)
|
||||
|
||||
|
||||
sgetf2.$(SUFFIX) sgetf2.$(PSUFFIX) : lapack/getf2.c
|
||||
$(CC) -c $(CFLAGS) $< -o $(@F)
|
||||
|
||||
|
|
@ -2035,25 +2058,49 @@ cblas_caxpby.$(SUFFIX) cblas_caxpby.$(PSUFFIX) : zaxpby.c
|
|||
domatcopy.$(SUFFIX) domatcopy.$(PSUFFIX) : omatcopy.c
|
||||
$(CC) -c $(CFLAGS) $< -o $(@F)
|
||||
|
||||
cblas_domatcopy.$(SUFFIX) cblas_domatcopy.$(PSUFFIX) : omatcopy.c
|
||||
$(CC) -c $(CFLAGS) -DCBLAS $< -o $(@F)
|
||||
|
||||
somatcopy.$(SUFFIX) somatcopy.$(PSUFFIX) : omatcopy.c
|
||||
$(CC) -c $(CFLAGS) $< -o $(@F)
|
||||
|
||||
cblas_somatcopy.$(SUFFIX) cblas_somatcopy.$(PSUFFIX) : omatcopy.c
|
||||
$(CC) -c $(CFLAGS) -DCBLAS $< -o $(@F)
|
||||
|
||||
comatcopy.$(SUFFIX) comatcopy.$(PSUFFIX) : zomatcopy.c
|
||||
$(CC) -c $(CFLAGS) $< -o $(@F)
|
||||
|
||||
cblas_comatcopy.$(SUFFIX) cblas_comatcopy.$(PSUFFIX) : zomatcopy.c
|
||||
$(CC) -c $(CFLAGS) -DCBLAS $< -o $(@F)
|
||||
|
||||
zomatcopy.$(SUFFIX) zomatcopy.$(PSUFFIX) : zomatcopy.c
|
||||
$(CC) -c $(CFLAGS) $< -o $(@F)
|
||||
|
||||
cblas_zomatcopy.$(SUFFIX) cblas_zomatcopy.$(PSUFFIX) : zomatcopy.c
|
||||
$(CC) -c $(CFLAGS) -DCBLAS $< -o $(@F)
|
||||
|
||||
dimatcopy.$(SUFFIX) dimatcopy.$(PSUFFIX) : imatcopy.c
|
||||
$(CC) -c $(CFLAGS) $< -o $(@F)
|
||||
|
||||
cblas_dimatcopy.$(SUFFIX) cblas_dimatcopy.$(PSUFFIX) : imatcopy.c
|
||||
$(CC) -c $(CFLAGS) -DCBLAS $< -o $(@F)
|
||||
|
||||
simatcopy.$(SUFFIX) simatcopy.$(PSUFFIX) : imatcopy.c
|
||||
$(CC) -c $(CFLAGS) $< -o $(@F)
|
||||
|
||||
cblas_simatcopy.$(SUFFIX) cblas_simatcopy.$(PSUFFIX) : imatcopy.c
|
||||
$(CC) -c $(CFLAGS) -DCBLAS $< -o $(@F)
|
||||
|
||||
cimatcopy.$(SUFFIX) cimatcopy.$(PSUFFIX) : zimatcopy.c
|
||||
$(CC) -c $(CFLAGS) $< -o $(@F)
|
||||
|
||||
cblas_cimatcopy.$(SUFFIX) cblas_cimatcopy.$(PSUFFIX) : zimatcopy.c
|
||||
$(CC) -c $(CFLAGS) -DCBLAS $< -o $(@F)
|
||||
|
||||
zimatcopy.$(SUFFIX) zimatcopy.$(PSUFFIX) : zimatcopy.c
|
||||
$(CC) -c $(CFLAGS) $< -o $(@F)
|
||||
|
||||
cblas_zimatcopy.$(SUFFIX) cblas_zimatcopy.$(PSUFFIX) : zimatcopy.c
|
||||
$(CC) -c $(CFLAGS) -DCBLAS $< -o $(@F)
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -405,49 +405,12 @@ void CNAME(enum CBLAS_ORDER order, enum CBLAS_TRANSPOSE TransA, enum CBLAS_TRANS
|
|||
|
||||
#ifndef COMPLEX
|
||||
double MNK = (double) args.m * (double) args.n * (double) args.k;
|
||||
if ( MNK <= (16.0 * 1024.0 * (double) GEMM_MULTITHREAD_THRESHOLD) )
|
||||
if ( MNK <= (65536.0 * (double) GEMM_MULTITHREAD_THRESHOLD) )
|
||||
nthreads_max = 1;
|
||||
else
|
||||
{
|
||||
if ( MNK <= (2.0 * 65536.0 * (double) GEMM_MULTITHREAD_THRESHOLD) )
|
||||
{
|
||||
nthreads_max = 4;
|
||||
if ( args.m < 16 * GEMM_MULTITHREAD_THRESHOLD )
|
||||
{
|
||||
nthreads_max = 2;
|
||||
if ( args.m < 3 * GEMM_MULTITHREAD_THRESHOLD ) nthreads_max = 1;
|
||||
if ( args.n < 1 * GEMM_MULTITHREAD_THRESHOLD ) nthreads_max = 1;
|
||||
if ( args.k < 3 * GEMM_MULTITHREAD_THRESHOLD ) nthreads_max = 1;
|
||||
}
|
||||
else
|
||||
{
|
||||
if ( args.n <= 1 * GEMM_MULTITHREAD_THRESHOLD ) nthreads_max = 2;
|
||||
}
|
||||
}
|
||||
}
|
||||
#else
|
||||
double MNK = (double) args.m * (double) args.n * (double) args.k;
|
||||
if ( MNK <= (256.0 * (double) GEMM_MULTITHREAD_THRESHOLD) )
|
||||
if ( MNK <= (8192.0 * (double) GEMM_MULTITHREAD_THRESHOLD) )
|
||||
nthreads_max = 1;
|
||||
else
|
||||
{
|
||||
if ( MNK <= (16384.0 * (double) GEMM_MULTITHREAD_THRESHOLD) )
|
||||
{
|
||||
nthreads_max = 4;
|
||||
if ( args.m < 3 * GEMM_MULTITHREAD_THRESHOLD )
|
||||
{
|
||||
nthreads_max = 2;
|
||||
if ( args.m <= 1 * GEMM_MULTITHREAD_THRESHOLD ) nthreads_max = 1;
|
||||
if ( args.n < 1 * GEMM_MULTITHREAD_THRESHOLD ) nthreads_max = 1;
|
||||
if ( args.k < 1 * GEMM_MULTITHREAD_THRESHOLD ) nthreads_max = 1;
|
||||
}
|
||||
else
|
||||
{
|
||||
if ( args.n < 2 * GEMM_MULTITHREAD_THRESHOLD ) nthreads_max = 2;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#endif
|
||||
args.common = NULL;
|
||||
|
||||
|
|
|
|||
|
|
@ -216,7 +216,7 @@ void CNAME(enum CBLAS_ORDER order,
|
|||
int nthreads_avail = nthreads_max;
|
||||
|
||||
double MNK = (double) m * (double) n;
|
||||
if ( MNK <= (500.0 * 100.0 * (double) GEMM_MULTITHREAD_THRESHOLD) )
|
||||
if ( MNK <= (24.0 * 24.0 * (double) (GEMM_MULTITHREAD_THRESHOLD*GEMM_MULTITHREAD_THRESHOLD) ) )
|
||||
nthreads_max = 1;
|
||||
|
||||
if ( nthreads_max > nthreads_avail )
|
||||
|
|
|
|||
|
|
@ -50,6 +50,7 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|||
#undef malloc
|
||||
#undef free
|
||||
|
||||
#ifndef CBLAS
|
||||
void NAME( char* ORDER, char* TRANS, blasint *rows, blasint *cols, FLOAT *alpha, FLOAT *a, blasint *lda, blasint *ldb)
|
||||
{
|
||||
|
||||
|
|
@ -71,6 +72,28 @@ void NAME( char* ORDER, char* TRANS, blasint *rows, blasint *cols, FLOAT *alpha,
|
|||
if ( Trans == 'R' ) trans = BlasNoTrans;
|
||||
if ( Trans == 'T' ) trans = BlasTrans;
|
||||
if ( Trans == 'C' ) trans = BlasTrans;
|
||||
#else
|
||||
void CNAME( enum CBLAS_ORDER CORDER, enum CBLAS_TRANSPOSE CTRANS, blasint crows, blasint ccols, FLOAT calpha, FLOAT *a, blasint clda, blasint cldb)
|
||||
{
|
||||
char Order, Trans;
|
||||
int order=-1,trans=-1;
|
||||
blasint info = -1;
|
||||
FLOAT *b;
|
||||
size_t msize;
|
||||
blasint *lda, *ldb, *rows, *cols;
|
||||
FLOAT *alpha;
|
||||
|
||||
if ( CORDER == CblasColMajor) order = BlasColMajor;
|
||||
if ( CORDER == CblasRowMajor) order = BlasRowMajor;
|
||||
if ( CTRANS == CblasNoTrans || CTRANS == CblasConjNoTrans) trans = BlasNoTrans;
|
||||
if ( CTRANS == CblasTrans || CTRANS == CblasConjTrans ) trans = BlasTrans;
|
||||
|
||||
rows = &crows;
|
||||
cols = &ccols;
|
||||
alpha = &calpha;
|
||||
lda = &clda;
|
||||
ldb = &cldb;
|
||||
#endif
|
||||
|
||||
if ( order == BlasColMajor)
|
||||
{
|
||||
|
|
|
|||
|
|
@ -47,6 +47,7 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|||
#define BlasNoTrans 0
|
||||
#define BlasTrans 1
|
||||
|
||||
#ifndef CBLAS
|
||||
void NAME( char* ORDER, char* TRANS, blasint *rows, blasint *cols, FLOAT *alpha, FLOAT *a, blasint *lda, FLOAT *b, blasint *ldb)
|
||||
{
|
||||
|
||||
|
|
@ -66,7 +67,27 @@ void NAME( char* ORDER, char* TRANS, blasint *rows, blasint *cols, FLOAT *alpha,
|
|||
if ( Trans == 'R' ) trans = BlasNoTrans;
|
||||
if ( Trans == 'T' ) trans = BlasTrans;
|
||||
if ( Trans == 'C' ) trans = BlasTrans;
|
||||
#else
|
||||
void CNAME(enum CBLAS_ORDER CORDER, enum CBLAS_TRANSPOSE CTRANS, blasint crows, blasint ccols, FLOAT calpha, FLOAT *a, blasint clda, FLOAT *b, blasint cldb)
|
||||
{
|
||||
blasint *rows, *cols, *lda, *ldb;
|
||||
FLOAT *alpha;
|
||||
int order=-1,trans=-1;
|
||||
blasint info = -1;
|
||||
|
||||
if ( CORDER == CblasColMajor ) order = BlasColMajor;
|
||||
if ( CORDER == CblasRowMajor ) order = BlasRowMajor;
|
||||
|
||||
if ( CTRANS == CblasNoTrans || CTRANS == CblasConjNoTrans ) trans = BlasNoTrans;
|
||||
if ( CTRANS == CblasTrans || CTRANS == CblasConjTrans ) trans = BlasTrans;
|
||||
|
||||
rows = &crows;
|
||||
cols = &ccols;
|
||||
lda = &clda;
|
||||
ldb = &cldb;
|
||||
alpha = &calpha;
|
||||
|
||||
#endif
|
||||
if ( order == BlasColMajor)
|
||||
{
|
||||
if ( trans == BlasNoTrans && *ldb < *rows ) info = 9;
|
||||
|
|
|
|||
|
|
@ -238,7 +238,7 @@ void CNAME(enum CBLAS_ORDER order,
|
|||
int nthreads_avail = nthreads_max;
|
||||
|
||||
double MNK = (double) m * (double) n;
|
||||
if ( MNK <= (80.0 * 20.0 * (double) GEMM_MULTITHREAD_THRESHOLD) )
|
||||
if ( MNK <= ( 256.0 * (double) (GEMM_MULTITHREAD_THRESHOLD * GEMM_MULTITHREAD_THRESHOLD) ))
|
||||
nthreads_max = 1;
|
||||
|
||||
if ( nthreads_max > nthreads_avail )
|
||||
|
|
|
|||
|
|
@ -49,6 +49,8 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|||
#define BlasTransConj 2
|
||||
#define BlasConj 3
|
||||
|
||||
|
||||
#ifndef CBLAS
|
||||
void NAME( char* ORDER, char* TRANS, blasint *rows, blasint *cols, FLOAT *alpha, FLOAT *a, blasint *lda, blasint *ldb)
|
||||
{
|
||||
|
||||
|
|
@ -71,6 +73,30 @@ void NAME( char* ORDER, char* TRANS, blasint *rows, blasint *cols, FLOAT *alpha,
|
|||
if ( Trans == 'C' ) trans = BlasTransConj;
|
||||
if ( Trans == 'R' ) trans = BlasConj;
|
||||
|
||||
#else
|
||||
void CNAME( enum CBLAS_ORDER CORDER, enum CBLAS_TRANSPOSE CTRANS, blasint crows, blasint ccols, FLOAT *alpha, FLOAT *a, blasint clda, blasint cldb)
|
||||
{
|
||||
|
||||
blasint *rows, *cols, *lda, *ldb;
|
||||
int order=-1,trans=-1;
|
||||
blasint info = -1;
|
||||
FLOAT *b;
|
||||
size_t msize;
|
||||
|
||||
if ( CORDER == CblasColMajor ) order = BlasColMajor;
|
||||
if ( CORDER == CblasRowMajor ) order = BlasRowMajor;
|
||||
|
||||
if ( CTRANS == CblasNoTrans) trans = BlasNoTrans;
|
||||
if ( CTRANS == CblasConjNoTrans ) trans = BlasConj;
|
||||
if ( CTRANS == CblasTrans) trans = BlasTrans;
|
||||
if ( CTRANS == CblasConjTrans) trans = BlasTransConj;
|
||||
|
||||
rows = &crows;
|
||||
cols = &ccols;
|
||||
lda = &clda;
|
||||
ldb = &cldb;
|
||||
#endif
|
||||
|
||||
if ( order == BlasColMajor)
|
||||
{
|
||||
if ( trans == BlasNoTrans && *ldb < *rows ) info = 9;
|
||||
|
|
|
|||
|
|
@ -49,6 +49,7 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|||
#define BlasTransConj 2
|
||||
#define BlasConj 3
|
||||
|
||||
#ifndef CBLAS
|
||||
void NAME( char* ORDER, char* TRANS, blasint *rows, blasint *cols, FLOAT *alpha, FLOAT *a, blasint *lda, FLOAT *b, blasint *ldb)
|
||||
{
|
||||
|
||||
|
|
@ -69,6 +70,26 @@ void NAME( char* ORDER, char* TRANS, blasint *rows, blasint *cols, FLOAT *alpha,
|
|||
if ( Trans == 'C' ) trans = BlasTransConj;
|
||||
if ( Trans == 'R' ) trans = BlasConj;
|
||||
|
||||
#else
|
||||
void CNAME(enum CBLAS_ORDER CORDER, enum CBLAS_TRANSPOSE CTRANS, blasint crows, blasint ccols, FLOAT *alpha, FLOAT *a, blasint clda, FLOAT*b, blasint cldb)
|
||||
{
|
||||
blasint *rows, *cols, *lda, *ldb;
|
||||
int order=-1,trans=-1;
|
||||
blasint info = -1;
|
||||
|
||||
if ( CORDER == CblasColMajor ) order = BlasColMajor;
|
||||
if ( CORDER == CblasRowMajor ) order = BlasRowMajor;
|
||||
|
||||
if ( CTRANS == CblasNoTrans) trans = BlasNoTrans;
|
||||
if ( CTRANS == CblasConjNoTrans ) trans = BlasConj;
|
||||
if ( CTRANS == CblasTrans) trans = BlasTrans;
|
||||
if ( CTRANS == CblasConjTrans) trans = BlasTransConj;
|
||||
|
||||
rows = &crows;
|
||||
cols = &ccols;
|
||||
lda = &clda;
|
||||
ldb = &cldb;
|
||||
#endif
|
||||
if ( order == BlasColMajor)
|
||||
{
|
||||
if ( trans == BlasNoTrans && *ldb < *rows ) info = 9;
|
||||
|
|
|
|||
|
|
@ -0,0 +1,70 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2013, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
#include "common.h"
|
||||
|
||||
int CNAME(BLASLONG m, BLASLONG offset, FLOAT alpha, FLOAT *a, BLASLONG lda, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y, FLOAT *buffer)
|
||||
{
|
||||
BLASLONG i;
|
||||
BLASLONG ix,iy;
|
||||
BLASLONG jx,jy;
|
||||
BLASLONG j;
|
||||
FLOAT temp1;
|
||||
FLOAT temp2;
|
||||
|
||||
#if 0
|
||||
if ( m != offset )
|
||||
printf("Symv_L: m=%d offset=%d\n",m,offset);
|
||||
#endif
|
||||
|
||||
jx = 0;
|
||||
jy = 0;
|
||||
|
||||
for (j=0; j<offset; j++)
|
||||
{
|
||||
temp1 = alpha * x[jx];
|
||||
temp2 = 0.0;
|
||||
y[jy] += temp1 * a[j*lda+j];
|
||||
iy = jy;
|
||||
ix = jx;
|
||||
for (i=j+1; i<m; i++)
|
||||
{
|
||||
ix += inc_x;
|
||||
iy += inc_y;
|
||||
y[iy] += temp1 * a[j*lda+i];
|
||||
temp2 += a[j*lda+i] * x[ix];
|
||||
|
||||
}
|
||||
y[jy] += alpha * temp2;
|
||||
jx += inc_x;
|
||||
jy += inc_y;
|
||||
}
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,71 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2013, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
#include "common.h"
|
||||
|
||||
int CNAME(BLASLONG m, BLASLONG offset, FLOAT alpha, FLOAT *a, BLASLONG lda, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y, FLOAT *buffer)
|
||||
{
|
||||
BLASLONG i;
|
||||
BLASLONG ix,iy;
|
||||
BLASLONG jx,jy;
|
||||
BLASLONG j;
|
||||
FLOAT temp1;
|
||||
FLOAT temp2;
|
||||
|
||||
#if 0
|
||||
if( m != offset )
|
||||
printf("Symv_U: m=%d offset=%d\n",m,offset);
|
||||
#endif
|
||||
|
||||
BLASLONG m1 = m - offset;
|
||||
|
||||
jx = m1 * inc_x;
|
||||
jy = m1 * inc_y;
|
||||
|
||||
for (j=m1; j<m; j++)
|
||||
{
|
||||
temp1 = alpha * x[jx];
|
||||
temp2 = 0.0;
|
||||
iy = 0;
|
||||
ix = 0;
|
||||
for (i=0; i<j; i++)
|
||||
{
|
||||
y[iy] += temp1 * a[j*lda+i];
|
||||
temp2 += a[j*lda+i] * x[ix];
|
||||
ix += inc_x;
|
||||
iy += inc_y;
|
||||
|
||||
}
|
||||
y[jy] += temp1 * a[j*lda+j] + alpha * temp2;
|
||||
jx += inc_x;
|
||||
jy += inc_y;
|
||||
}
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -293,6 +293,14 @@ gotoblas_t TABLE_NAME = {
|
|||
#endif
|
||||
chemm_outcopyTS, chemm_oltcopyTS,
|
||||
|
||||
0, 0, 0,
|
||||
#ifdef CGEMM3M_DEFAULT_UNROLL_M
|
||||
CGEMM3M_DEFAULT_UNROLL_M, CGEMM3M_DEFAULT_UNROLL_N, MAX(CGEMM3M_DEFAULT_UNROLL_M, CGEMM3M_DEFAULT_UNROLL_N),
|
||||
#else
|
||||
SGEMM_DEFAULT_UNROLL_M, SGEMM_DEFAULT_UNROLL_N, MAX(SGEMM_DEFAULT_UNROLL_M, SGEMM_DEFAULT_UNROLL_N),
|
||||
#endif
|
||||
|
||||
|
||||
cgemm3m_kernelTS,
|
||||
|
||||
cgemm3m_incopybTS, cgemm3m_incopyrTS,
|
||||
|
|
@ -391,6 +399,14 @@ gotoblas_t TABLE_NAME = {
|
|||
#endif
|
||||
zhemm_outcopyTS, zhemm_oltcopyTS,
|
||||
|
||||
0, 0, 0,
|
||||
#ifdef ZGEMM3M_DEFAULT_UNROLL_M
|
||||
ZGEMM3M_DEFAULT_UNROLL_M, ZGEMM3M_DEFAULT_UNROLL_N, MAX(ZGEMM3M_DEFAULT_UNROLL_M, ZGEMM3M_DEFAULT_UNROLL_N),
|
||||
#else
|
||||
DGEMM_DEFAULT_UNROLL_M, DGEMM_DEFAULT_UNROLL_N, MAX(DGEMM_DEFAULT_UNROLL_M, DGEMM_DEFAULT_UNROLL_N),
|
||||
#endif
|
||||
|
||||
|
||||
zgemm3m_kernelTS,
|
||||
|
||||
zgemm3m_incopybTS, zgemm3m_incopyrTS,
|
||||
|
|
@ -486,6 +502,9 @@ gotoblas_t TABLE_NAME = {
|
|||
#endif
|
||||
xhemm_outcopyTS, xhemm_oltcopyTS,
|
||||
|
||||
0, 0, 0,
|
||||
QGEMM_DEFAULT_UNROLL_M, QGEMM_DEFAULT_UNROLL_N, MAX(QGEMM_DEFAULT_UNROLL_M, QGEMM_DEFAULT_UNROLL_N),
|
||||
|
||||
xgemm3m_kernelTS,
|
||||
|
||||
xgemm3m_incopybTS, xgemm3m_incopyrTS,
|
||||
|
|
@ -661,9 +680,23 @@ static void init_parameter(void) {
|
|||
TABLE_NAME.dgemm_q = DGEMM_DEFAULT_Q;
|
||||
TABLE_NAME.cgemm_q = CGEMM_DEFAULT_Q;
|
||||
TABLE_NAME.zgemm_q = ZGEMM_DEFAULT_Q;
|
||||
|
||||
#ifdef CGEMM3M_DEFAULT_Q
|
||||
TABLE_NAME.cgemm3m_q = CGEMM3M_DEFAULT_Q;
|
||||
#else
|
||||
TABLE_NAME.cgemm3m_q = SGEMM_DEFAULT_Q;
|
||||
#endif
|
||||
|
||||
#ifdef ZGEMM3M_DEFAULT_Q
|
||||
TABLE_NAME.zgemm3m_q = ZGEMM3M_DEFAULT_Q;
|
||||
#else
|
||||
TABLE_NAME.zgemm3m_q = DGEMM_DEFAULT_Q;
|
||||
#endif
|
||||
|
||||
#ifdef EXPRECISION
|
||||
TABLE_NAME.qgemm_q = QGEMM_DEFAULT_Q;
|
||||
TABLE_NAME.xgemm_q = XGEMM_DEFAULT_Q;
|
||||
TABLE_NAME.xgemm3m_q = QGEMM_DEFAULT_Q;
|
||||
#endif
|
||||
|
||||
#if defined(CORE_KATMAI) || defined(CORE_COPPERMINE) || defined(CORE_BANIAS) || defined(CORE_YONAH) || defined(CORE_ATHLON)
|
||||
|
|
@ -918,20 +951,56 @@ static void init_parameter(void) {
|
|||
TABLE_NAME.dgemm_p = DGEMM_DEFAULT_P;
|
||||
TABLE_NAME.cgemm_p = CGEMM_DEFAULT_P;
|
||||
TABLE_NAME.zgemm_p = ZGEMM_DEFAULT_P;
|
||||
|
||||
|
||||
|
||||
#ifdef EXPRECISION
|
||||
TABLE_NAME.qgemm_p = QGEMM_DEFAULT_P;
|
||||
TABLE_NAME.xgemm_p = XGEMM_DEFAULT_P;
|
||||
#endif
|
||||
|
||||
#endif
|
||||
|
||||
|
||||
#ifdef CGEMM3M_DEFAULT_P
|
||||
TABLE_NAME.cgemm3m_p = CGEMM3M_DEFAULT_P;
|
||||
#else
|
||||
TABLE_NAME.cgemm3m_p = TABLE_NAME.sgemm_p;
|
||||
#endif
|
||||
|
||||
#ifdef ZGEMM3M_DEFAULT_P
|
||||
TABLE_NAME.zgemm3m_p = ZGEMM3M_DEFAULT_P;
|
||||
#else
|
||||
TABLE_NAME.zgemm3m_p = TABLE_NAME.dgemm_p;
|
||||
#endif
|
||||
|
||||
#ifdef EXPRECISION
|
||||
TABLE_NAME.xgemm3m_p = TABLE_NAME.qgemm_p;
|
||||
#endif
|
||||
|
||||
|
||||
|
||||
TABLE_NAME.sgemm_p = (TABLE_NAME.sgemm_p + SGEMM_DEFAULT_UNROLL_M - 1) & ~(SGEMM_DEFAULT_UNROLL_M - 1);
|
||||
TABLE_NAME.dgemm_p = (TABLE_NAME.dgemm_p + DGEMM_DEFAULT_UNROLL_M - 1) & ~(DGEMM_DEFAULT_UNROLL_M - 1);
|
||||
TABLE_NAME.cgemm_p = (TABLE_NAME.cgemm_p + CGEMM_DEFAULT_UNROLL_M - 1) & ~(CGEMM_DEFAULT_UNROLL_M - 1);
|
||||
TABLE_NAME.zgemm_p = (TABLE_NAME.zgemm_p + ZGEMM_DEFAULT_UNROLL_M - 1) & ~(ZGEMM_DEFAULT_UNROLL_M - 1);
|
||||
|
||||
#ifdef CGEMM3M_DEFAULT_UNROLL_M
|
||||
TABLE_NAME.cgemm3m_p = (TABLE_NAME.cgemm3m_p + CGEMM3M_DEFAULT_UNROLL_M - 1) & ~(CGEMM3M_DEFAULT_UNROLL_M - 1);
|
||||
#else
|
||||
TABLE_NAME.cgemm3m_p = (TABLE_NAME.cgemm3m_p + SGEMM_DEFAULT_UNROLL_M - 1) & ~(SGEMM_DEFAULT_UNROLL_M - 1);
|
||||
#endif
|
||||
|
||||
#ifdef ZGEMM3M_DEFAULT_UNROLL_M
|
||||
TABLE_NAME.zgemm3m_p = (TABLE_NAME.zgemm3m_p + ZGEMM3M_DEFAULT_UNROLL_M - 1) & ~(ZGEMM3M_DEFAULT_UNROLL_M - 1);
|
||||
#else
|
||||
TABLE_NAME.zgemm3m_p = (TABLE_NAME.zgemm3m_p + DGEMM_DEFAULT_UNROLL_M - 1) & ~(DGEMM_DEFAULT_UNROLL_M - 1);
|
||||
#endif
|
||||
|
||||
#ifdef QUAD_PRECISION
|
||||
TABLE_NAME.qgemm_p = (TABLE_NAME.qgemm_p + QGEMM_DEFAULT_UNROLL_M - 1) & ~(QGEMM_DEFAULT_UNROLL_M - 1);
|
||||
TABLE_NAME.xgemm_p = (TABLE_NAME.xgemm_p + XGEMM_DEFAULT_UNROLL_M - 1) & ~(XGEMM_DEFAULT_UNROLL_M - 1);
|
||||
TABLE_NAME.xgemm3m_p = (TABLE_NAME.xgemm3m_p + QGEMM_DEFAULT_UNROLL_M - 1) & ~(QGEMM_DEFAULT_UNROLL_M - 1);
|
||||
#endif
|
||||
|
||||
#ifdef DEBUG
|
||||
|
|
@ -965,11 +1034,32 @@ static void init_parameter(void) {
|
|||
+ TABLE_NAME.align) & ~TABLE_NAME.align)
|
||||
) / (TABLE_NAME.zgemm_q * 16) - 15) & ~15);
|
||||
|
||||
TABLE_NAME.cgemm3m_r = (((BUFFER_SIZE -
|
||||
((TABLE_NAME.cgemm3m_p * TABLE_NAME.cgemm3m_q * 8 + TABLE_NAME.offsetA
|
||||
+ TABLE_NAME.align) & ~TABLE_NAME.align)
|
||||
) / (TABLE_NAME.cgemm3m_q * 8) - 15) & ~15);
|
||||
|
||||
TABLE_NAME.zgemm3m_r = (((BUFFER_SIZE -
|
||||
((TABLE_NAME.zgemm3m_p * TABLE_NAME.zgemm3m_q * 16 + TABLE_NAME.offsetA
|
||||
+ TABLE_NAME.align) & ~TABLE_NAME.align)
|
||||
) / (TABLE_NAME.zgemm3m_q * 16) - 15) & ~15);
|
||||
|
||||
|
||||
|
||||
|
||||
#ifdef EXPRECISION
|
||||
TABLE_NAME.xgemm_r = (((BUFFER_SIZE -
|
||||
((TABLE_NAME.xgemm_p * TABLE_NAME.xgemm_q * 32 + TABLE_NAME.offsetA
|
||||
+ TABLE_NAME.align) & ~TABLE_NAME.align)
|
||||
) / (TABLE_NAME.xgemm_q * 32) - 15) & ~15);
|
||||
|
||||
TABLE_NAME.xgemm3m_r = (((BUFFER_SIZE -
|
||||
((TABLE_NAME.xgemm3m_p * TABLE_NAME.xgemm3m_q * 32 + TABLE_NAME.offsetA
|
||||
+ TABLE_NAME.align) & ~TABLE_NAME.align)
|
||||
) / (TABLE_NAME.xgemm3m_q * 32) - 15) & ~15);
|
||||
|
||||
#endif
|
||||
|
||||
|
||||
|
||||
}
|
||||
|
|
|
|||
|
|
@ -1,8 +1,20 @@
|
|||
SGEMVNKERNEL = sgemv_n.c
|
||||
SGEMVTKERNEL = sgemv_t.c
|
||||
DAXPYKERNEL = daxpy.c
|
||||
CAXPYKERNEL = caxpy.c
|
||||
ZAXPYKERNEL = zaxpy.c
|
||||
|
||||
SDOTKERNEL = sdot.c
|
||||
DDOTKERNEL = ddot.c
|
||||
|
||||
DSYMV_U_KERNEL = dsymv_U.c
|
||||
DSYMV_L_KERNEL = dsymv_L.c
|
||||
SSYMV_U_KERNEL = ssymv_U.c
|
||||
SSYMV_L_KERNEL = ssymv_L.c
|
||||
|
||||
SGEMVNKERNEL = sgemv_n_4.c
|
||||
SGEMVTKERNEL = sgemv_t_4.c
|
||||
|
||||
ZGEMVNKERNEL = zgemv_n_dup.S
|
||||
ZGEMVTKERNEL = zgemv_t.c
|
||||
ZGEMVTKERNEL = zgemv_t_4.c
|
||||
|
||||
DGEMVNKERNEL = dgemv_n_bulldozer.S
|
||||
DGEMVTKERNEL = dgemv_t_bulldozer.S
|
||||
|
|
|
|||
|
|
@ -1,14 +1,14 @@
|
|||
SGEMVNKERNEL = sgemv_n.c
|
||||
SGEMVTKERNEL = sgemv_t.c
|
||||
SGEMVNKERNEL = sgemv_n_4.c
|
||||
SGEMVTKERNEL = sgemv_t_4.c
|
||||
|
||||
DGEMVNKERNEL = dgemv_n.c
|
||||
DGEMVTKERNEL = dgemv_t.c
|
||||
DGEMVNKERNEL = dgemv_n_4.c
|
||||
DGEMVTKERNEL = dgemv_t_4.c
|
||||
|
||||
ZGEMVNKERNEL = zgemv_n.c
|
||||
ZGEMVTKERNEL = zgemv_t.c
|
||||
ZGEMVNKERNEL = zgemv_n_4.c
|
||||
ZGEMVTKERNEL = zgemv_t_4.c
|
||||
|
||||
CGEMVNKERNEL = cgemv_n.c
|
||||
CGEMVTKERNEL = cgemv_t.c
|
||||
CGEMVNKERNEL = cgemv_n_4.c
|
||||
CGEMVTKERNEL = cgemv_t_4.c
|
||||
|
||||
SGEMMKERNEL = sgemm_kernel_16x4_haswell.S
|
||||
SGEMMINCOPY = ../generic/gemm_ncopy_16.c
|
||||
|
|
|
|||
|
|
@ -1,5 +1,17 @@
|
|||
SGEMVNKERNEL = sgemv_n.c
|
||||
SGEMVTKERNEL = sgemv_t.c
|
||||
SAXPYKERNEL = saxpy.c
|
||||
DAXPYKERNEL = daxpy.c
|
||||
|
||||
SDOTKERNEL = sdot.c
|
||||
DDOTKERNEL = ddot.c
|
||||
|
||||
DSYMV_U_KERNEL = dsymv_U.c
|
||||
DSYMV_L_KERNEL = dsymv_L.c
|
||||
SSYMV_U_KERNEL = ssymv_U.c
|
||||
SSYMV_L_KERNEL = ssymv_L.c
|
||||
|
||||
SGEMVNKERNEL = sgemv_n_4.c
|
||||
SGEMVTKERNEL = sgemv_t_4.c
|
||||
DGEMVNKERNEL = dgemv_n_4.c
|
||||
|
||||
SGEMMKERNEL = gemm_kernel_4x8_nehalem.S
|
||||
SGEMMINCOPY = gemm_ncopy_4.S
|
||||
|
|
|
|||
|
|
@ -1,11 +1,12 @@
|
|||
SGEMVNKERNEL = sgemv_n.c
|
||||
SGEMVTKERNEL = sgemv_t.c
|
||||
SGEMVNKERNEL = sgemv_n_4.c
|
||||
SGEMVTKERNEL = sgemv_t_4.c
|
||||
|
||||
ZGEMVNKERNEL = zgemv_n_dup.S
|
||||
ZGEMVTKERNEL = zgemv_t.S
|
||||
ZGEMVTKERNEL = zgemv_t_4.c
|
||||
|
||||
DGEMVNKERNEL = dgemv_n_bulldozer.S
|
||||
DGEMVTKERNEL = dgemv_t_bulldozer.S
|
||||
|
||||
DDOTKERNEL = ddot_bulldozer.S
|
||||
DCOPYKERNEL = dcopy_bulldozer.S
|
||||
|
||||
|
|
|
|||
|
|
@ -1,7 +1,7 @@
|
|||
SGEMVNKERNEL = sgemv_n.c
|
||||
SGEMVTKERNEL = sgemv_t.c
|
||||
SGEMVNKERNEL = sgemv_n_4.c
|
||||
SGEMVTKERNEL = sgemv_t_4.c
|
||||
|
||||
ZGEMVNKERNEL = zgemv_n.c
|
||||
ZGEMVNKERNEL = zgemv_n_4.c
|
||||
|
||||
|
||||
SGEMMKERNEL = sgemm_kernel_16x4_sandy.S
|
||||
|
|
|
|||
|
|
@ -0,0 +1,131 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
#include "common.h"
|
||||
|
||||
|
||||
#if defined(BULLDOZER)
|
||||
#include "caxpy_microk_bulldozer-2.c"
|
||||
#endif
|
||||
|
||||
|
||||
#ifndef HAVE_KERNEL_8
|
||||
|
||||
static void caxpy_kernel_8(BLASLONG n, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
BLASLONG register i = 0;
|
||||
BLASLONG register ix = 0;
|
||||
FLOAT da_r = alpha[0];
|
||||
FLOAT da_i = alpha[1];
|
||||
|
||||
|
||||
while(i < n)
|
||||
{
|
||||
#if !defined(CONJ)
|
||||
y[ix] += ( da_r * x[ix] - da_i * x[ix+1] ) ;
|
||||
y[ix+1] += ( da_r * x[ix+1] + da_i * x[ix] ) ;
|
||||
y[ix+2] += ( da_r * x[ix+2] - da_i * x[ix+3] ) ;
|
||||
y[ix+3] += ( da_r * x[ix+3] + da_i * x[ix+2] ) ;
|
||||
#else
|
||||
y[ix] += ( da_r * x[ix] + da_i * x[ix+1] ) ;
|
||||
y[ix+1] -= ( da_r * x[ix+1] - da_i * x[ix] ) ;
|
||||
y[ix+2] += ( da_r * x[ix+2] + da_i * x[ix+3] ) ;
|
||||
y[ix+3] -= ( da_r * x[ix+3] - da_i * x[ix+2] ) ;
|
||||
#endif
|
||||
|
||||
ix+=4 ;
|
||||
i+=2 ;
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
int CNAME(BLASLONG n, BLASLONG dummy0, BLASLONG dummy1, FLOAT da_r, FLOAT da_i, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y, FLOAT *dummy, BLASLONG dummy2)
|
||||
{
|
||||
BLASLONG i=0;
|
||||
BLASLONG ix=0,iy=0;
|
||||
FLOAT da[2];
|
||||
|
||||
if ( n <= 0 ) return(0);
|
||||
|
||||
if ( (inc_x == 1) && (inc_y == 1) )
|
||||
{
|
||||
|
||||
int n1 = n & -8;
|
||||
|
||||
if ( n1 )
|
||||
{
|
||||
da[0] = da_r;
|
||||
da[1] = da_i;
|
||||
caxpy_kernel_8(n1, x, y , &da );
|
||||
ix = 2 * n1;
|
||||
}
|
||||
i = n1;
|
||||
while(i < n)
|
||||
{
|
||||
#if !defined(CONJ)
|
||||
y[ix] += ( da_r * x[ix] - da_i * x[ix+1] ) ;
|
||||
y[ix+1] += ( da_r * x[ix+1] + da_i * x[ix] ) ;
|
||||
#else
|
||||
y[ix] += ( da_r * x[ix] + da_i * x[ix+1] ) ;
|
||||
y[ix+1] -= ( da_r * x[ix+1] - da_i * x[ix] ) ;
|
||||
#endif
|
||||
i++ ;
|
||||
ix += 2;
|
||||
|
||||
}
|
||||
return(0);
|
||||
|
||||
|
||||
}
|
||||
|
||||
inc_x *=2;
|
||||
inc_y *=2;
|
||||
|
||||
while(i < n)
|
||||
{
|
||||
|
||||
#if !defined(CONJ)
|
||||
y[iy] += ( da_r * x[ix] - da_i * x[ix+1] ) ;
|
||||
y[iy+1] += ( da_r * x[ix+1] + da_i * x[ix] ) ;
|
||||
#else
|
||||
y[iy] += ( da_r * x[ix] + da_i * x[ix+1] ) ;
|
||||
y[iy+1] -= ( da_r * x[ix+1] - da_i * x[ix] ) ;
|
||||
#endif
|
||||
ix += inc_x ;
|
||||
iy += inc_y ;
|
||||
i++ ;
|
||||
|
||||
}
|
||||
return(0);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,135 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#define HAVE_KERNEL_8 1
|
||||
static void caxpy_kernel_8( BLASLONG n, FLOAT *x, FLOAT *y , FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void caxpy_kernel_8( BLASLONG n, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vbroadcastss (%4), %%xmm0 \n\t" // real part of alpha
|
||||
"vbroadcastss 4(%4), %%xmm1 \n\t" // imag part of alpha
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
|
||||
"prefetcht0 768(%2,%0,4) \n\t"
|
||||
"vmovups (%2,%0,4), %%xmm5 \n\t" // 2 complex values from x
|
||||
"vmovups 16(%2,%0,4), %%xmm7 \n\t" // 2 complex values from x
|
||||
"vmovups 32(%2,%0,4), %%xmm9 \n\t" // 2 complex values from x
|
||||
"vmovups 48(%2,%0,4), %%xmm11 \n\t" // 2 complex values from x
|
||||
"prefetcht0 768(%3,%0,4) \n\t"
|
||||
|
||||
#if !defined(CONJ)
|
||||
"vfmaddps (%3,%0,4), %%xmm0 , %%xmm5, %%xmm12 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm5 , %%xmm4 \n\t" // exchange real and imag part
|
||||
"vmulps %%xmm1, %%xmm4 , %%xmm4 \n\t"
|
||||
|
||||
"vfmaddps 16(%3,%0,4), %%xmm0 , %%xmm7, %%xmm13 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm7 , %%xmm6 \n\t" // exchange real and imag part
|
||||
"vmulps %%xmm1, %%xmm6 , %%xmm6 \n\t"
|
||||
|
||||
"vfmaddps 32(%3,%0,4), %%xmm0 , %%xmm9, %%xmm14 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm9 , %%xmm8 \n\t" // exchange real and imag part
|
||||
"vmulps %%xmm1, %%xmm8 , %%xmm8 \n\t"
|
||||
|
||||
"vfmaddps 48(%3,%0,4), %%xmm0 , %%xmm11,%%xmm15 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm11, %%xmm10 \n\t" // exchange real and imag part
|
||||
"vmulps %%xmm1, %%xmm10, %%xmm10 \n\t"
|
||||
|
||||
"vaddsubps %%xmm4, %%xmm12, %%xmm12 \n\t"
|
||||
"vaddsubps %%xmm6, %%xmm13, %%xmm13 \n\t"
|
||||
"vaddsubps %%xmm8, %%xmm14, %%xmm14 \n\t"
|
||||
"vaddsubps %%xmm10,%%xmm15, %%xmm15 \n\t"
|
||||
|
||||
#else
|
||||
|
||||
"vmulps %%xmm0, %%xmm5, %%xmm4 \n\t" // a_r*x_r, a_r*x_i
|
||||
"vmulps %%xmm1, %%xmm5, %%xmm5 \n\t" // a_i*x_r, a_i*x_i
|
||||
"vmulps %%xmm0, %%xmm7, %%xmm6 \n\t" // a_r*x_r, a_r*x_i
|
||||
"vmulps %%xmm1, %%xmm7, %%xmm7 \n\t" // a_i*x_r, a_i*x_i
|
||||
"vmulps %%xmm0, %%xmm9, %%xmm8 \n\t" // a_r*x_r, a_r*x_i
|
||||
"vmulps %%xmm1, %%xmm9, %%xmm9 \n\t" // a_i*x_r, a_i*x_i
|
||||
"vmulps %%xmm0, %%xmm11, %%xmm10 \n\t" // a_r*x_r, a_r*x_i
|
||||
"vmulps %%xmm1, %%xmm11, %%xmm11 \n\t" // a_i*x_r, a_i*x_i
|
||||
|
||||
"vpermilps $0xb1 , %%xmm4 , %%xmm4 \n\t" // exchange real and imag part
|
||||
"vaddsubps %%xmm4 ,%%xmm5 , %%xmm4 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm4 , %%xmm4 \n\t" // exchange real and imag part
|
||||
|
||||
"vpermilps $0xb1 , %%xmm6 , %%xmm6 \n\t" // exchange real and imag part
|
||||
"vaddsubps %%xmm6 ,%%xmm7 , %%xmm6 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm6 , %%xmm6 \n\t" // exchange real and imag part
|
||||
|
||||
"vpermilps $0xb1 , %%xmm8 , %%xmm8 \n\t" // exchange real and imag part
|
||||
"vaddsubps %%xmm8 ,%%xmm9 , %%xmm8 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm8 , %%xmm8 \n\t" // exchange real and imag part
|
||||
|
||||
"vpermilps $0xb1 , %%xmm10, %%xmm10 \n\t" // exchange real and imag part
|
||||
"vaddsubps %%xmm10,%%xmm11, %%xmm10 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm10, %%xmm10 \n\t" // exchange real and imag part
|
||||
|
||||
"vaddps (%3,%0,4) ,%%xmm4 , %%xmm12 \n\t"
|
||||
"vaddps 16(%3,%0,4) ,%%xmm6 , %%xmm13 \n\t"
|
||||
"vaddps 32(%3,%0,4) ,%%xmm8 , %%xmm14 \n\t"
|
||||
"vaddps 48(%3,%0,4) ,%%xmm10, %%xmm15 \n\t"
|
||||
|
||||
|
||||
#endif
|
||||
|
||||
"vmovups %%xmm12, (%3,%0,4) \n\t"
|
||||
"vmovups %%xmm13, 16(%3,%0,4) \n\t"
|
||||
"vmovups %%xmm14, 32(%3,%0,4) \n\t"
|
||||
"vmovups %%xmm15, 48(%3,%0,4) \n\t"
|
||||
|
||||
"addq $16, %0 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (alpha) // 4
|
||||
: "cc",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -227,8 +227,8 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|||
VFMADDPS_I( %ymm7 ,%ymm3,%ymm1 )
|
||||
|
||||
|
||||
addq $6*SIZE, BO
|
||||
addq $16*SIZE, AO
|
||||
addq $ 6*SIZE, BO
|
||||
addq $ 16*SIZE, AO
|
||||
decq %rax
|
||||
.endm
|
||||
|
||||
|
|
@ -356,8 +356,8 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|||
VFMADDPS_R( %ymm4 ,%ymm2,%ymm0 )
|
||||
VFMADDPS_I( %ymm5 ,%ymm3,%ymm0 )
|
||||
|
||||
addq $6*SIZE, BO
|
||||
addq $8*SIZE, AO
|
||||
addq $ 6*SIZE, BO
|
||||
addq $ 8*SIZE, AO
|
||||
decq %rax
|
||||
.endm
|
||||
|
||||
|
|
@ -447,8 +447,8 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|||
VFMADDPS_R( %xmm4 ,%xmm2,%xmm0 )
|
||||
VFMADDPS_I( %xmm5 ,%xmm3,%xmm0 )
|
||||
|
||||
addq $6*SIZE, BO
|
||||
addq $4*SIZE, AO
|
||||
addq $ 6*SIZE, BO
|
||||
addq $ 4*SIZE, AO
|
||||
decq %rax
|
||||
|
||||
.endm
|
||||
|
|
@ -540,8 +540,8 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|||
VFMADDPS_R( %xmm4 ,%xmm2,%xmm0 )
|
||||
VFMADDPS_I( %xmm5 ,%xmm3,%xmm0 )
|
||||
|
||||
addq $6*SIZE, BO
|
||||
addq $2*SIZE, AO
|
||||
addq $ 6*SIZE, BO
|
||||
addq $ 2*SIZE, AO
|
||||
decq %rax
|
||||
|
||||
.endm
|
||||
|
|
|
|||
|
|
@ -1,255 +0,0 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#include <stdlib.h>
|
||||
#include <stdio.h>
|
||||
#include "common.h"
|
||||
|
||||
#if defined(HASWELL)
|
||||
#include "cgemv_n_microk_haswell-2.c"
|
||||
#endif
|
||||
|
||||
|
||||
#define NBMAX 2048
|
||||
|
||||
#ifndef HAVE_KERNEL_16x4
|
||||
|
||||
static void cgemv_kernel_16x4(BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
BLASLONG i;
|
||||
FLOAT *a0,*a1,*a2,*a3;
|
||||
a0 = ap[0];
|
||||
a1 = ap[1];
|
||||
a2 = ap[2];
|
||||
a3 = ap[3];
|
||||
|
||||
for ( i=0; i< 2*n; i+=2 )
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
y[i] += a0[i]*x[0] - a0[i+1] * x[1];
|
||||
y[i+1] += a0[i]*x[1] + a0[i+1] * x[0];
|
||||
y[i] += a1[i]*x[2] - a1[i+1] * x[3];
|
||||
y[i+1] += a1[i]*x[3] + a1[i+1] * x[2];
|
||||
y[i] += a2[i]*x[4] - a2[i+1] * x[5];
|
||||
y[i+1] += a2[i]*x[5] + a2[i+1] * x[4];
|
||||
y[i] += a3[i]*x[6] - a3[i+1] * x[7];
|
||||
y[i+1] += a3[i]*x[7] + a3[i+1] * x[6];
|
||||
#else
|
||||
y[i] += a0[i]*x[0] + a0[i+1] * x[1];
|
||||
y[i+1] += a0[i]*x[1] - a0[i+1] * x[0];
|
||||
y[i] += a1[i]*x[2] + a1[i+1] * x[3];
|
||||
y[i+1] += a1[i]*x[3] - a1[i+1] * x[2];
|
||||
y[i] += a2[i]*x[4] + a2[i+1] * x[5];
|
||||
y[i+1] += a2[i]*x[5] - a2[i+1] * x[4];
|
||||
y[i] += a3[i]*x[6] + a3[i+1] * x[7];
|
||||
y[i+1] += a3[i]*x[7] - a3[i+1] * x[6];
|
||||
#endif
|
||||
}
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
static void cgemv_kernel_16x1(BLASLONG n, FLOAT *ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
BLASLONG i;
|
||||
FLOAT *a0;
|
||||
a0 = ap;
|
||||
|
||||
for ( i=0; i< 2*n; i+=2 )
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
y[i] += a0[i]*x[0] - a0[i+1] * x[1];
|
||||
y[i+1] += a0[i]*x[1] + a0[i+1] * x[0];
|
||||
#else
|
||||
y[i] += a0[i]*x[0] + a0[i+1] * x[1];
|
||||
y[i+1] += a0[i]*x[1] - a0[i+1] * x[0];
|
||||
#endif
|
||||
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
static void zero_y(BLASLONG n, FLOAT *dest)
|
||||
{
|
||||
BLASLONG i;
|
||||
for ( i=0; i<2*n; i++ )
|
||||
{
|
||||
*dest = 0.0;
|
||||
dest++;
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
|
||||
static void add_y(BLASLONG n, FLOAT *src, FLOAT *dest, BLASLONG inc_dest,FLOAT alpha_r, FLOAT alpha_i)
|
||||
{
|
||||
BLASLONG i;
|
||||
FLOAT temp_r;
|
||||
FLOAT temp_i;
|
||||
for ( i=0; i<n; i++ )
|
||||
{
|
||||
#if !defined(XCONJ)
|
||||
temp_r = alpha_r * src[0] - alpha_i * src[1];
|
||||
temp_i = alpha_r * src[1] + alpha_i * src[0];
|
||||
#else
|
||||
temp_r = alpha_r * src[0] + alpha_i * src[1];
|
||||
temp_i = -alpha_r * src[1] + alpha_i * src[0];
|
||||
#endif
|
||||
|
||||
*dest += temp_r;
|
||||
*(dest+1) += temp_i;
|
||||
|
||||
src+=2;
|
||||
dest += inc_dest;
|
||||
}
|
||||
}
|
||||
|
||||
int CNAME(BLASLONG m, BLASLONG n, BLASLONG dummy1, FLOAT alpha_r,FLOAT alpha_i, FLOAT *a, BLASLONG lda, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y, FLOAT *buffer)
|
||||
{
|
||||
BLASLONG i;
|
||||
BLASLONG j;
|
||||
FLOAT *a_ptr;
|
||||
FLOAT *x_ptr;
|
||||
FLOAT *y_ptr;
|
||||
FLOAT *ap[4];
|
||||
BLASLONG n1;
|
||||
BLASLONG m1;
|
||||
BLASLONG m2;
|
||||
BLASLONG n2;
|
||||
FLOAT xbuffer[8],*ybuffer;
|
||||
|
||||
|
||||
#if 0
|
||||
printf("%s %d %d %.16f %.16f %d %d %d\n","zgemv_n",m,n,alpha_r,alpha_i,lda,inc_x,inc_y);
|
||||
#endif
|
||||
|
||||
if ( m < 1 ) return(0);
|
||||
if ( n < 1 ) return(0);
|
||||
|
||||
ybuffer = buffer;
|
||||
|
||||
inc_x *= 2;
|
||||
inc_y *= 2;
|
||||
lda *= 2;
|
||||
|
||||
n1 = n / 4 ;
|
||||
n2 = n % 4 ;
|
||||
|
||||
m1 = m - ( m % 16 );
|
||||
m2 = (m % NBMAX) - (m % 16) ;
|
||||
|
||||
y_ptr = y;
|
||||
|
||||
BLASLONG NB = NBMAX;
|
||||
|
||||
while ( NB == NBMAX )
|
||||
{
|
||||
|
||||
m1 -= NB;
|
||||
if ( m1 < 0)
|
||||
{
|
||||
if ( m2 == 0 ) break;
|
||||
NB = m2;
|
||||
}
|
||||
|
||||
a_ptr = a;
|
||||
x_ptr = x;
|
||||
zero_y(NB,ybuffer);
|
||||
for( i = 0; i < n1 ; i++)
|
||||
{
|
||||
|
||||
xbuffer[0] = x_ptr[0];
|
||||
xbuffer[1] = x_ptr[1];
|
||||
x_ptr += inc_x;
|
||||
xbuffer[2] = x_ptr[0];
|
||||
xbuffer[3] = x_ptr[1];
|
||||
x_ptr += inc_x;
|
||||
xbuffer[4] = x_ptr[0];
|
||||
xbuffer[5] = x_ptr[1];
|
||||
x_ptr += inc_x;
|
||||
xbuffer[6] = x_ptr[0];
|
||||
xbuffer[7] = x_ptr[1];
|
||||
x_ptr += inc_x;
|
||||
|
||||
ap[0] = a_ptr;
|
||||
ap[1] = a_ptr + lda;
|
||||
ap[2] = ap[1] + lda;
|
||||
ap[3] = ap[2] + lda;
|
||||
cgemv_kernel_16x4(NB,ap,xbuffer,ybuffer);
|
||||
a_ptr += 4 * lda;
|
||||
}
|
||||
|
||||
for( i = 0; i < n2 ; i++)
|
||||
{
|
||||
xbuffer[0] = x_ptr[0];
|
||||
xbuffer[1] = x_ptr[1];
|
||||
x_ptr += inc_x;
|
||||
cgemv_kernel_16x1(NB,a_ptr,xbuffer,ybuffer);
|
||||
a_ptr += 1 * lda;
|
||||
|
||||
}
|
||||
add_y(NB,ybuffer,y_ptr,inc_y,alpha_r,alpha_i);
|
||||
a += 2 * NB;
|
||||
y_ptr += NB * inc_y;
|
||||
}
|
||||
|
||||
j=0;
|
||||
while ( j < (m % 16))
|
||||
{
|
||||
a_ptr = a;
|
||||
x_ptr = x;
|
||||
FLOAT temp_r = 0.0;
|
||||
FLOAT temp_i = 0.0;
|
||||
for( i = 0; i < n; i++ )
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
temp_r += a_ptr[0] * x_ptr[0] - a_ptr[1] * x_ptr[1];
|
||||
temp_i += a_ptr[0] * x_ptr[1] + a_ptr[1] * x_ptr[0];
|
||||
#else
|
||||
temp_r += a_ptr[0] * x_ptr[0] + a_ptr[1] * x_ptr[1];
|
||||
temp_i += a_ptr[0] * x_ptr[1] - a_ptr[1] * x_ptr[0];
|
||||
#endif
|
||||
|
||||
a_ptr += lda;
|
||||
x_ptr += inc_x;
|
||||
}
|
||||
|
||||
#if !defined(XCONJ)
|
||||
y_ptr[0] += alpha_r * temp_r - alpha_i * temp_i;
|
||||
y_ptr[1] += alpha_r * temp_i + alpha_i * temp_r;
|
||||
#else
|
||||
y_ptr[0] += alpha_r * temp_r + alpha_i * temp_i;
|
||||
y_ptr[1] -= alpha_r * temp_i - alpha_i * temp_r;
|
||||
#endif
|
||||
y_ptr += inc_y;
|
||||
a+=2;
|
||||
j++;
|
||||
}
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,623 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#include <stdlib.h>
|
||||
#include <stdio.h>
|
||||
#include "common.h"
|
||||
|
||||
#if defined(HASWELL)
|
||||
#include "cgemv_n_microk_haswell-4.c"
|
||||
#endif
|
||||
|
||||
|
||||
#define NBMAX 2048
|
||||
|
||||
#ifndef HAVE_KERNEL_4x4
|
||||
|
||||
static void cgemv_kernel_4x4(BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
BLASLONG i;
|
||||
FLOAT *a0,*a1,*a2,*a3;
|
||||
a0 = ap[0];
|
||||
a1 = ap[1];
|
||||
a2 = ap[2];
|
||||
a3 = ap[3];
|
||||
|
||||
for ( i=0; i< 2*n; i+=2 )
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
y[i] += a0[i]*x[0] - a0[i+1] * x[1];
|
||||
y[i+1] += a0[i]*x[1] + a0[i+1] * x[0];
|
||||
y[i] += a1[i]*x[2] - a1[i+1] * x[3];
|
||||
y[i+1] += a1[i]*x[3] + a1[i+1] * x[2];
|
||||
y[i] += a2[i]*x[4] - a2[i+1] * x[5];
|
||||
y[i+1] += a2[i]*x[5] + a2[i+1] * x[4];
|
||||
y[i] += a3[i]*x[6] - a3[i+1] * x[7];
|
||||
y[i+1] += a3[i]*x[7] + a3[i+1] * x[6];
|
||||
#else
|
||||
y[i] += a0[i]*x[0] + a0[i+1] * x[1];
|
||||
y[i+1] += a0[i]*x[1] - a0[i+1] * x[0];
|
||||
y[i] += a1[i]*x[2] + a1[i+1] * x[3];
|
||||
y[i+1] += a1[i]*x[3] - a1[i+1] * x[2];
|
||||
y[i] += a2[i]*x[4] + a2[i+1] * x[5];
|
||||
y[i+1] += a2[i]*x[5] - a2[i+1] * x[4];
|
||||
y[i] += a3[i]*x[6] + a3[i+1] * x[7];
|
||||
y[i+1] += a3[i]*x[7] - a3[i+1] * x[6];
|
||||
#endif
|
||||
}
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
|
||||
|
||||
#ifndef HAVE_KERNEL_4x2
|
||||
|
||||
static void cgemv_kernel_4x2(BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
BLASLONG i;
|
||||
FLOAT *a0,*a1;
|
||||
a0 = ap[0];
|
||||
a1 = ap[1];
|
||||
|
||||
for ( i=0; i< 2*n; i+=2 )
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
y[i] += a0[i]*x[0] - a0[i+1] * x[1];
|
||||
y[i+1] += a0[i]*x[1] + a0[i+1] * x[0];
|
||||
y[i] += a1[i]*x[2] - a1[i+1] * x[3];
|
||||
y[i+1] += a1[i]*x[3] + a1[i+1] * x[2];
|
||||
#else
|
||||
y[i] += a0[i]*x[0] + a0[i+1] * x[1];
|
||||
y[i+1] += a0[i]*x[1] - a0[i+1] * x[0];
|
||||
y[i] += a1[i]*x[2] + a1[i+1] * x[3];
|
||||
y[i+1] += a1[i]*x[3] - a1[i+1] * x[2];
|
||||
#endif
|
||||
}
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
|
||||
|
||||
|
||||
#ifndef HAVE_KERNEL_4x1
|
||||
|
||||
|
||||
static void cgemv_kernel_4x1(BLASLONG n, FLOAT *ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
BLASLONG i;
|
||||
FLOAT *a0;
|
||||
a0 = ap;
|
||||
|
||||
for ( i=0; i< 2*n; i+=2 )
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
y[i] += a0[i]*x[0] - a0[i+1] * x[1];
|
||||
y[i+1] += a0[i]*x[1] + a0[i+1] * x[0];
|
||||
#else
|
||||
y[i] += a0[i]*x[0] + a0[i+1] * x[1];
|
||||
y[i+1] += a0[i]*x[1] - a0[i+1] * x[0];
|
||||
#endif
|
||||
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
#endif
|
||||
|
||||
|
||||
#ifndef HAVE_KERNEL_ADDY
|
||||
|
||||
static void add_y(BLASLONG n, FLOAT *src, FLOAT *dest, BLASLONG inc_dest,FLOAT alpha_r, FLOAT alpha_i) __attribute__ ((noinline));
|
||||
|
||||
static void add_y(BLASLONG n, FLOAT *src, FLOAT *dest, BLASLONG inc_dest,FLOAT alpha_r, FLOAT alpha_i)
|
||||
{
|
||||
BLASLONG i;
|
||||
|
||||
if ( inc_dest != 2 )
|
||||
{
|
||||
|
||||
FLOAT temp_r;
|
||||
FLOAT temp_i;
|
||||
for ( i=0; i<n; i++ )
|
||||
{
|
||||
#if !defined(XCONJ)
|
||||
temp_r = alpha_r * src[0] - alpha_i * src[1];
|
||||
temp_i = alpha_r * src[1] + alpha_i * src[0];
|
||||
#else
|
||||
temp_r = alpha_r * src[0] + alpha_i * src[1];
|
||||
temp_i = -alpha_r * src[1] + alpha_i * src[0];
|
||||
#endif
|
||||
|
||||
*dest += temp_r;
|
||||
*(dest+1) += temp_i;
|
||||
|
||||
src+=2;
|
||||
dest += inc_dest;
|
||||
}
|
||||
return;
|
||||
}
|
||||
|
||||
FLOAT temp_r0;
|
||||
FLOAT temp_i0;
|
||||
FLOAT temp_r1;
|
||||
FLOAT temp_i1;
|
||||
FLOAT temp_r2;
|
||||
FLOAT temp_i2;
|
||||
FLOAT temp_r3;
|
||||
FLOAT temp_i3;
|
||||
for ( i=0; i<n; i+=4 )
|
||||
{
|
||||
#if !defined(XCONJ)
|
||||
temp_r0 = alpha_r * src[0] - alpha_i * src[1];
|
||||
temp_i0 = alpha_r * src[1] + alpha_i * src[0];
|
||||
temp_r1 = alpha_r * src[2] - alpha_i * src[3];
|
||||
temp_i1 = alpha_r * src[3] + alpha_i * src[2];
|
||||
temp_r2 = alpha_r * src[4] - alpha_i * src[5];
|
||||
temp_i2 = alpha_r * src[5] + alpha_i * src[4];
|
||||
temp_r3 = alpha_r * src[6] - alpha_i * src[7];
|
||||
temp_i3 = alpha_r * src[7] + alpha_i * src[6];
|
||||
#else
|
||||
temp_r0 = alpha_r * src[0] + alpha_i * src[1];
|
||||
temp_i0 = -alpha_r * src[1] + alpha_i * src[0];
|
||||
temp_r1 = alpha_r * src[2] + alpha_i * src[3];
|
||||
temp_i1 = -alpha_r * src[3] + alpha_i * src[2];
|
||||
temp_r2 = alpha_r * src[4] + alpha_i * src[5];
|
||||
temp_i2 = -alpha_r * src[5] + alpha_i * src[4];
|
||||
temp_r3 = alpha_r * src[6] + alpha_i * src[7];
|
||||
temp_i3 = -alpha_r * src[7] + alpha_i * src[6];
|
||||
#endif
|
||||
|
||||
dest[0] += temp_r0;
|
||||
dest[1] += temp_i0;
|
||||
dest[2] += temp_r1;
|
||||
dest[3] += temp_i1;
|
||||
dest[4] += temp_r2;
|
||||
dest[5] += temp_i2;
|
||||
dest[6] += temp_r3;
|
||||
dest[7] += temp_i3;
|
||||
|
||||
src += 8;
|
||||
dest += 8;
|
||||
}
|
||||
return;
|
||||
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
int CNAME(BLASLONG m, BLASLONG n, BLASLONG dummy1, FLOAT alpha_r,FLOAT alpha_i, FLOAT *a, BLASLONG lda, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y, FLOAT *buffer)
|
||||
{
|
||||
BLASLONG i;
|
||||
BLASLONG j;
|
||||
FLOAT *a_ptr;
|
||||
FLOAT *x_ptr;
|
||||
FLOAT *y_ptr;
|
||||
FLOAT *ap[4];
|
||||
BLASLONG n1;
|
||||
BLASLONG m1;
|
||||
BLASLONG m2;
|
||||
BLASLONG m3;
|
||||
BLASLONG n2;
|
||||
BLASLONG lda4;
|
||||
FLOAT xbuffer[8],*ybuffer;
|
||||
|
||||
|
||||
#if 0
|
||||
printf("%s %d %d %.16f %.16f %d %d %d\n","zgemv_n",m,n,alpha_r,alpha_i,lda,inc_x,inc_y);
|
||||
#endif
|
||||
|
||||
if ( m < 1 ) return(0);
|
||||
if ( n < 1 ) return(0);
|
||||
|
||||
ybuffer = buffer;
|
||||
|
||||
inc_x *= 2;
|
||||
inc_y *= 2;
|
||||
lda *= 2;
|
||||
lda4 = 4 * lda;
|
||||
|
||||
n1 = n / 4 ;
|
||||
n2 = n % 4 ;
|
||||
|
||||
m3 = m % 4;
|
||||
m1 = m - ( m % 4 );
|
||||
m2 = (m % NBMAX) - (m % 4) ;
|
||||
|
||||
y_ptr = y;
|
||||
|
||||
BLASLONG NB = NBMAX;
|
||||
|
||||
while ( NB == NBMAX )
|
||||
{
|
||||
|
||||
m1 -= NB;
|
||||
if ( m1 < 0)
|
||||
{
|
||||
if ( m2 == 0 ) break;
|
||||
NB = m2;
|
||||
}
|
||||
|
||||
a_ptr = a;
|
||||
ap[0] = a_ptr;
|
||||
ap[1] = a_ptr + lda;
|
||||
ap[2] = ap[1] + lda;
|
||||
ap[3] = ap[2] + lda;
|
||||
x_ptr = x;
|
||||
//zero_y(NB,ybuffer);
|
||||
memset(ybuffer,0,NB*8);
|
||||
|
||||
if ( inc_x == 2 )
|
||||
{
|
||||
|
||||
for( i = 0; i < n1 ; i++)
|
||||
{
|
||||
cgemv_kernel_4x4(NB,ap,x_ptr,ybuffer);
|
||||
ap[0] += lda4;
|
||||
ap[1] += lda4;
|
||||
ap[2] += lda4;
|
||||
ap[3] += lda4;
|
||||
a_ptr += lda4;
|
||||
x_ptr += 8;
|
||||
}
|
||||
|
||||
if ( n2 & 2 )
|
||||
{
|
||||
cgemv_kernel_4x2(NB,ap,x_ptr,ybuffer);
|
||||
x_ptr += 4;
|
||||
a_ptr += 2 * lda;
|
||||
|
||||
}
|
||||
|
||||
if ( n2 & 1 )
|
||||
{
|
||||
cgemv_kernel_4x1(NB,a_ptr,x_ptr,ybuffer);
|
||||
x_ptr += 2;
|
||||
a_ptr += lda;
|
||||
|
||||
}
|
||||
}
|
||||
else
|
||||
{
|
||||
|
||||
for( i = 0; i < n1 ; i++)
|
||||
{
|
||||
|
||||
xbuffer[0] = x_ptr[0];
|
||||
xbuffer[1] = x_ptr[1];
|
||||
x_ptr += inc_x;
|
||||
xbuffer[2] = x_ptr[0];
|
||||
xbuffer[3] = x_ptr[1];
|
||||
x_ptr += inc_x;
|
||||
xbuffer[4] = x_ptr[0];
|
||||
xbuffer[5] = x_ptr[1];
|
||||
x_ptr += inc_x;
|
||||
xbuffer[6] = x_ptr[0];
|
||||
xbuffer[7] = x_ptr[1];
|
||||
x_ptr += inc_x;
|
||||
|
||||
cgemv_kernel_4x4(NB,ap,xbuffer,ybuffer);
|
||||
ap[0] += lda4;
|
||||
ap[1] += lda4;
|
||||
ap[2] += lda4;
|
||||
ap[3] += lda4;
|
||||
a_ptr += lda4;
|
||||
}
|
||||
|
||||
for( i = 0; i < n2 ; i++)
|
||||
{
|
||||
xbuffer[0] = x_ptr[0];
|
||||
xbuffer[1] = x_ptr[1];
|
||||
x_ptr += inc_x;
|
||||
cgemv_kernel_4x1(NB,a_ptr,xbuffer,ybuffer);
|
||||
a_ptr += 1 * lda;
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
add_y(NB,ybuffer,y_ptr,inc_y,alpha_r,alpha_i);
|
||||
a += 2 * NB;
|
||||
y_ptr += NB * inc_y;
|
||||
}
|
||||
|
||||
if ( m3 == 0 ) return(0);
|
||||
|
||||
if ( m3 == 1 )
|
||||
{
|
||||
a_ptr = a;
|
||||
x_ptr = x;
|
||||
FLOAT temp_r = 0.0;
|
||||
FLOAT temp_i = 0.0;
|
||||
|
||||
if ( lda == 2 && inc_x == 2 )
|
||||
{
|
||||
|
||||
|
||||
for( i=0 ; i < (n & -2); i+=2 )
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
temp_r += a_ptr[0] * x_ptr[0] - a_ptr[1] * x_ptr[1];
|
||||
temp_i += a_ptr[0] * x_ptr[1] + a_ptr[1] * x_ptr[0];
|
||||
temp_r += a_ptr[2] * x_ptr[2] - a_ptr[3] * x_ptr[3];
|
||||
temp_i += a_ptr[2] * x_ptr[3] + a_ptr[3] * x_ptr[2];
|
||||
#else
|
||||
temp_r += a_ptr[0] * x_ptr[0] + a_ptr[1] * x_ptr[1];
|
||||
temp_i += a_ptr[0] * x_ptr[1] - a_ptr[1] * x_ptr[0];
|
||||
temp_r += a_ptr[2] * x_ptr[2] + a_ptr[3] * x_ptr[3];
|
||||
temp_i += a_ptr[2] * x_ptr[3] - a_ptr[3] * x_ptr[2];
|
||||
#endif
|
||||
|
||||
a_ptr += 4;
|
||||
x_ptr += 4;
|
||||
}
|
||||
|
||||
|
||||
|
||||
for( ; i < n; i++ )
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
temp_r += a_ptr[0] * x_ptr[0] - a_ptr[1] * x_ptr[1];
|
||||
temp_i += a_ptr[0] * x_ptr[1] + a_ptr[1] * x_ptr[0];
|
||||
#else
|
||||
temp_r += a_ptr[0] * x_ptr[0] + a_ptr[1] * x_ptr[1];
|
||||
temp_i += a_ptr[0] * x_ptr[1] - a_ptr[1] * x_ptr[0];
|
||||
#endif
|
||||
|
||||
a_ptr += 2;
|
||||
x_ptr += 2;
|
||||
}
|
||||
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
|
||||
for( i = 0; i < n; i++ )
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
temp_r += a_ptr[0] * x_ptr[0] - a_ptr[1] * x_ptr[1];
|
||||
temp_i += a_ptr[0] * x_ptr[1] + a_ptr[1] * x_ptr[0];
|
||||
#else
|
||||
temp_r += a_ptr[0] * x_ptr[0] + a_ptr[1] * x_ptr[1];
|
||||
temp_i += a_ptr[0] * x_ptr[1] - a_ptr[1] * x_ptr[0];
|
||||
#endif
|
||||
|
||||
a_ptr += lda;
|
||||
x_ptr += inc_x;
|
||||
}
|
||||
|
||||
}
|
||||
#if !defined(XCONJ)
|
||||
y_ptr[0] += alpha_r * temp_r - alpha_i * temp_i;
|
||||
y_ptr[1] += alpha_r * temp_i + alpha_i * temp_r;
|
||||
#else
|
||||
y_ptr[0] += alpha_r * temp_r + alpha_i * temp_i;
|
||||
y_ptr[1] -= alpha_r * temp_i - alpha_i * temp_r;
|
||||
#endif
|
||||
return(0);
|
||||
}
|
||||
|
||||
if ( m3 == 2 )
|
||||
{
|
||||
a_ptr = a;
|
||||
x_ptr = x;
|
||||
FLOAT temp_r0 = 0.0;
|
||||
FLOAT temp_i0 = 0.0;
|
||||
FLOAT temp_r1 = 0.0;
|
||||
FLOAT temp_i1 = 0.0;
|
||||
|
||||
if ( lda == 4 && inc_x == 2 )
|
||||
{
|
||||
|
||||
for( i = 0; i < (n & -2); i+=2 )
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
|
||||
temp_r0 += a_ptr[0] * x_ptr[0] - a_ptr[1] * x_ptr[1];
|
||||
temp_i0 += a_ptr[0] * x_ptr[1] + a_ptr[1] * x_ptr[0];
|
||||
temp_r1 += a_ptr[2] * x_ptr[0] - a_ptr[3] * x_ptr[1];
|
||||
temp_i1 += a_ptr[2] * x_ptr[1] + a_ptr[3] * x_ptr[0];
|
||||
|
||||
temp_r0 += a_ptr[4] * x_ptr[2] - a_ptr[5] * x_ptr[3];
|
||||
temp_i0 += a_ptr[4] * x_ptr[3] + a_ptr[5] * x_ptr[2];
|
||||
temp_r1 += a_ptr[6] * x_ptr[2] - a_ptr[7] * x_ptr[3];
|
||||
temp_i1 += a_ptr[6] * x_ptr[3] + a_ptr[7] * x_ptr[2];
|
||||
|
||||
#else
|
||||
temp_r0 += a_ptr[0] * x_ptr[0] + a_ptr[1] * x_ptr[1];
|
||||
temp_i0 += a_ptr[0] * x_ptr[1] - a_ptr[1] * x_ptr[0];
|
||||
temp_r1 += a_ptr[2] * x_ptr[0] + a_ptr[3] * x_ptr[1];
|
||||
temp_i1 += a_ptr[2] * x_ptr[1] - a_ptr[3] * x_ptr[0];
|
||||
|
||||
temp_r0 += a_ptr[4] * x_ptr[2] + a_ptr[5] * x_ptr[3];
|
||||
temp_i0 += a_ptr[4] * x_ptr[3] - a_ptr[5] * x_ptr[2];
|
||||
temp_r1 += a_ptr[6] * x_ptr[2] + a_ptr[7] * x_ptr[3];
|
||||
temp_i1 += a_ptr[6] * x_ptr[3] - a_ptr[7] * x_ptr[2];
|
||||
|
||||
#endif
|
||||
|
||||
a_ptr += 8;
|
||||
x_ptr += 4;
|
||||
}
|
||||
|
||||
|
||||
for( ; i < n; i++ )
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
temp_r0 += a_ptr[0] * x_ptr[0] - a_ptr[1] * x_ptr[1];
|
||||
temp_i0 += a_ptr[0] * x_ptr[1] + a_ptr[1] * x_ptr[0];
|
||||
temp_r1 += a_ptr[2] * x_ptr[0] - a_ptr[3] * x_ptr[1];
|
||||
temp_i1 += a_ptr[2] * x_ptr[1] + a_ptr[3] * x_ptr[0];
|
||||
#else
|
||||
temp_r0 += a_ptr[0] * x_ptr[0] + a_ptr[1] * x_ptr[1];
|
||||
temp_i0 += a_ptr[0] * x_ptr[1] - a_ptr[1] * x_ptr[0];
|
||||
temp_r1 += a_ptr[2] * x_ptr[0] + a_ptr[3] * x_ptr[1];
|
||||
temp_i1 += a_ptr[2] * x_ptr[1] - a_ptr[3] * x_ptr[0];
|
||||
#endif
|
||||
|
||||
a_ptr += 4;
|
||||
x_ptr += 2;
|
||||
}
|
||||
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
|
||||
for( i=0 ; i < n; i++ )
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
temp_r0 += a_ptr[0] * x_ptr[0] - a_ptr[1] * x_ptr[1];
|
||||
temp_i0 += a_ptr[0] * x_ptr[1] + a_ptr[1] * x_ptr[0];
|
||||
temp_r1 += a_ptr[2] * x_ptr[0] - a_ptr[3] * x_ptr[1];
|
||||
temp_i1 += a_ptr[2] * x_ptr[1] + a_ptr[3] * x_ptr[0];
|
||||
#else
|
||||
temp_r0 += a_ptr[0] * x_ptr[0] + a_ptr[1] * x_ptr[1];
|
||||
temp_i0 += a_ptr[0] * x_ptr[1] - a_ptr[1] * x_ptr[0];
|
||||
temp_r1 += a_ptr[2] * x_ptr[0] + a_ptr[3] * x_ptr[1];
|
||||
temp_i1 += a_ptr[2] * x_ptr[1] - a_ptr[3] * x_ptr[0];
|
||||
#endif
|
||||
|
||||
a_ptr += lda;
|
||||
x_ptr += inc_x;
|
||||
}
|
||||
|
||||
|
||||
}
|
||||
#if !defined(XCONJ)
|
||||
y_ptr[0] += alpha_r * temp_r0 - alpha_i * temp_i0;
|
||||
y_ptr[1] += alpha_r * temp_i0 + alpha_i * temp_r0;
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += alpha_r * temp_r1 - alpha_i * temp_i1;
|
||||
y_ptr[1] += alpha_r * temp_i1 + alpha_i * temp_r1;
|
||||
#else
|
||||
y_ptr[0] += alpha_r * temp_r0 + alpha_i * temp_i0;
|
||||
y_ptr[1] -= alpha_r * temp_i0 - alpha_i * temp_r0;
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += alpha_r * temp_r1 + alpha_i * temp_i1;
|
||||
y_ptr[1] -= alpha_r * temp_i1 - alpha_i * temp_r1;
|
||||
#endif
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
if ( m3 == 3 )
|
||||
{
|
||||
a_ptr = a;
|
||||
x_ptr = x;
|
||||
FLOAT temp_r0 = 0.0;
|
||||
FLOAT temp_i0 = 0.0;
|
||||
FLOAT temp_r1 = 0.0;
|
||||
FLOAT temp_i1 = 0.0;
|
||||
FLOAT temp_r2 = 0.0;
|
||||
FLOAT temp_i2 = 0.0;
|
||||
|
||||
if ( lda == 6 && inc_x == 2 )
|
||||
{
|
||||
|
||||
for( i=0 ; i < n; i++ )
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
temp_r0 += a_ptr[0] * x_ptr[0] - a_ptr[1] * x_ptr[1];
|
||||
temp_i0 += a_ptr[0] * x_ptr[1] + a_ptr[1] * x_ptr[0];
|
||||
temp_r1 += a_ptr[2] * x_ptr[0] - a_ptr[3] * x_ptr[1];
|
||||
temp_i1 += a_ptr[2] * x_ptr[1] + a_ptr[3] * x_ptr[0];
|
||||
temp_r2 += a_ptr[4] * x_ptr[0] - a_ptr[5] * x_ptr[1];
|
||||
temp_i2 += a_ptr[4] * x_ptr[1] + a_ptr[5] * x_ptr[0];
|
||||
#else
|
||||
temp_r0 += a_ptr[0] * x_ptr[0] + a_ptr[1] * x_ptr[1];
|
||||
temp_i0 += a_ptr[0] * x_ptr[1] - a_ptr[1] * x_ptr[0];
|
||||
temp_r1 += a_ptr[2] * x_ptr[0] + a_ptr[3] * x_ptr[1];
|
||||
temp_i1 += a_ptr[2] * x_ptr[1] - a_ptr[3] * x_ptr[0];
|
||||
temp_r2 += a_ptr[4] * x_ptr[0] + a_ptr[5] * x_ptr[1];
|
||||
temp_i2 += a_ptr[4] * x_ptr[1] - a_ptr[5] * x_ptr[0];
|
||||
#endif
|
||||
|
||||
a_ptr += 6;
|
||||
x_ptr += 2;
|
||||
}
|
||||
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
|
||||
for( i = 0; i < n; i++ )
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
temp_r0 += a_ptr[0] * x_ptr[0] - a_ptr[1] * x_ptr[1];
|
||||
temp_i0 += a_ptr[0] * x_ptr[1] + a_ptr[1] * x_ptr[0];
|
||||
temp_r1 += a_ptr[2] * x_ptr[0] - a_ptr[3] * x_ptr[1];
|
||||
temp_i1 += a_ptr[2] * x_ptr[1] + a_ptr[3] * x_ptr[0];
|
||||
temp_r2 += a_ptr[4] * x_ptr[0] - a_ptr[5] * x_ptr[1];
|
||||
temp_i2 += a_ptr[4] * x_ptr[1] + a_ptr[5] * x_ptr[0];
|
||||
#else
|
||||
temp_r0 += a_ptr[0] * x_ptr[0] + a_ptr[1] * x_ptr[1];
|
||||
temp_i0 += a_ptr[0] * x_ptr[1] - a_ptr[1] * x_ptr[0];
|
||||
temp_r1 += a_ptr[2] * x_ptr[0] + a_ptr[3] * x_ptr[1];
|
||||
temp_i1 += a_ptr[2] * x_ptr[1] - a_ptr[3] * x_ptr[0];
|
||||
temp_r2 += a_ptr[4] * x_ptr[0] + a_ptr[5] * x_ptr[1];
|
||||
temp_i2 += a_ptr[4] * x_ptr[1] - a_ptr[5] * x_ptr[0];
|
||||
#endif
|
||||
|
||||
a_ptr += lda;
|
||||
x_ptr += inc_x;
|
||||
}
|
||||
|
||||
}
|
||||
#if !defined(XCONJ)
|
||||
y_ptr[0] += alpha_r * temp_r0 - alpha_i * temp_i0;
|
||||
y_ptr[1] += alpha_r * temp_i0 + alpha_i * temp_r0;
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += alpha_r * temp_r1 - alpha_i * temp_i1;
|
||||
y_ptr[1] += alpha_r * temp_i1 + alpha_i * temp_r1;
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += alpha_r * temp_r2 - alpha_i * temp_i2;
|
||||
y_ptr[1] += alpha_r * temp_i2 + alpha_i * temp_r2;
|
||||
#else
|
||||
y_ptr[0] += alpha_r * temp_r0 + alpha_i * temp_i0;
|
||||
y_ptr[1] -= alpha_r * temp_i0 - alpha_i * temp_r0;
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += alpha_r * temp_r1 + alpha_i * temp_i1;
|
||||
y_ptr[1] -= alpha_r * temp_i1 - alpha_i * temp_r1;
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += alpha_r * temp_r2 + alpha_i * temp_i2;
|
||||
y_ptr[1] -= alpha_r * temp_i2 - alpha_i * temp_r2;
|
||||
#endif
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -1,137 +0,0 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#define HAVE_KERNEL_16x4 1
|
||||
static void cgemv_kernel_16x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y) __attribute__ ((noinline));
|
||||
|
||||
static void cgemv_kernel_16x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vzeroupper \n\t"
|
||||
|
||||
"vbroadcastss (%2), %%ymm0 \n\t" // real part x0
|
||||
"vbroadcastss 4(%2), %%ymm1 \n\t" // imag part x0
|
||||
"vbroadcastss 8(%2), %%ymm2 \n\t" // real part x1
|
||||
"vbroadcastss 12(%2), %%ymm3 \n\t" // imag part x1
|
||||
"vbroadcastss 16(%2), %%ymm4 \n\t" // real part x2
|
||||
"vbroadcastss 20(%2), %%ymm5 \n\t" // imag part x2
|
||||
"vbroadcastss 24(%2), %%ymm6 \n\t" // real part x3
|
||||
"vbroadcastss 28(%2), %%ymm7 \n\t" // imag part x3
|
||||
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"prefetcht0 320(%4,%0,4) \n\t"
|
||||
"vmovups (%4,%0,4), %%ymm8 \n\t" // 4 complex values form a0
|
||||
"vmovups 32(%4,%0,4), %%ymm9 \n\t" // 4 complex values form a0
|
||||
|
||||
"prefetcht0 320(%5,%0,4) \n\t"
|
||||
"vmovups (%5,%0,4), %%ymm10 \n\t" // 4 complex values form a1
|
||||
"vmovups 32(%5,%0,4), %%ymm11 \n\t" // 4 complex values form a1
|
||||
|
||||
"vmulps %%ymm8 , %%ymm0, %%ymm12 \n\t" // a_r[0] * x_r , a_i[0] * x_r, a_r[1] * x_r, a_i[1] * x_r
|
||||
"vmulps %%ymm8 , %%ymm1, %%ymm13 \n\t" // a_r[0] * x_i , a_i[0] * x_i, a_r[1] * x_i, a_i[1] * x_i
|
||||
"vmulps %%ymm9 , %%ymm0, %%ymm14 \n\t" // a_r[2] * x_r , a_i[2] * x_r, a_r[3] * x_r, a_i[3] * x_r
|
||||
"vmulps %%ymm9 , %%ymm1, %%ymm15 \n\t" // a_r[2] * x_i , a_i[2] * x_i, a_r[3] * x_i, a_i[3] * x_i
|
||||
|
||||
"prefetcht0 320(%6,%0,4) \n\t"
|
||||
"vmovups (%6,%0,4), %%ymm8 \n\t" // 4 complex values form a2
|
||||
"vmovups 32(%6,%0,4), %%ymm9 \n\t" // 4 complex values form a2
|
||||
|
||||
"vfmadd231ps %%ymm10, %%ymm2, %%ymm12 \n\t" // a_r[0] * x_r , a_i[0] * x_r, a_r[1] * x_r, a_i[1] * x_r
|
||||
"vfmadd231ps %%ymm10, %%ymm3, %%ymm13 \n\t" // a_r[0] * x_i , a_i[0] * x_i, a_r[1] * x_i, a_i[1] * x_i
|
||||
"vfmadd231ps %%ymm11, %%ymm2, %%ymm14 \n\t" // a_r[2] * x_r , a_i[2] * x_r, a_r[3] * x_r, a_i[3] * x_r
|
||||
"vfmadd231ps %%ymm11, %%ymm3, %%ymm15 \n\t" // a_r[2] * x_i , a_i[2] * x_i, a_r[3] * x_i, a_i[3] * x_i
|
||||
|
||||
"prefetcht0 320(%7,%0,4) \n\t"
|
||||
"vmovups (%7,%0,4), %%ymm10 \n\t" // 4 complex values form a3
|
||||
"vmovups 32(%7,%0,4), %%ymm11 \n\t" // 4 complex values form a3
|
||||
|
||||
"vfmadd231ps %%ymm8 , %%ymm4, %%ymm12 \n\t" // a_r[0] * x_r , a_i[0] * x_r, a_r[1] * x_r, a_i[1] * x_r
|
||||
"vfmadd231ps %%ymm8 , %%ymm5, %%ymm13 \n\t" // a_r[0] * x_i , a_i[0] * x_i, a_r[1] * x_i, a_i[1] * x_i
|
||||
"vfmadd231ps %%ymm9 , %%ymm4, %%ymm14 \n\t" // a_r[2] * x_r , a_i[2] * x_r, a_r[3] * x_r, a_i[3] * x_r
|
||||
"vfmadd231ps %%ymm9 , %%ymm5, %%ymm15 \n\t" // a_r[2] * x_i , a_i[2] * x_i, a_r[3] * x_i, a_i[3] * x_i
|
||||
|
||||
"vfmadd231ps %%ymm10, %%ymm6, %%ymm12 \n\t" // a_r[0] * x_r , a_i[0] * x_r, a_r[1] * x_r, a_i[1] * x_r
|
||||
"vfmadd231ps %%ymm10, %%ymm7, %%ymm13 \n\t" // a_r[0] * x_i , a_i[0] * x_i, a_r[1] * x_i, a_i[1] * x_i
|
||||
"vfmadd231ps %%ymm11, %%ymm6, %%ymm14 \n\t" // a_r[2] * x_r , a_i[2] * x_r, a_r[3] * x_r, a_i[3] * x_r
|
||||
"vfmadd231ps %%ymm11, %%ymm7, %%ymm15 \n\t" // a_r[2] * x_i , a_i[2] * x_i, a_r[3] * x_i, a_i[3] * x_i
|
||||
|
||||
"prefetcht0 320(%3,%0,4) \n\t"
|
||||
"vmovups (%3,%0,4), %%ymm10 \n\t"
|
||||
"vmovups 32(%3,%0,4), %%ymm11 \n\t"
|
||||
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
"vpermilps $0xb1 , %%ymm13, %%ymm13 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm15, %%ymm15 \n\t"
|
||||
"vaddsubps %%ymm13, %%ymm12, %%ymm8 \n\t"
|
||||
"vaddsubps %%ymm15, %%ymm14, %%ymm9 \n\t"
|
||||
#else
|
||||
"vpermilps $0xb1 , %%ymm12, %%ymm12 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm14, %%ymm14 \n\t"
|
||||
"vaddsubps %%ymm12, %%ymm13, %%ymm8 \n\t"
|
||||
"vaddsubps %%ymm14, %%ymm15, %%ymm9 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm8 , %%ymm8 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm9 , %%ymm9 \n\t"
|
||||
#endif
|
||||
|
||||
"vaddps %%ymm8, %%ymm10, %%ymm12 \n\t"
|
||||
"vaddps %%ymm9, %%ymm11, %%ymm13 \n\t"
|
||||
|
||||
"vmovups %%ymm12, (%3,%0,4) \n\t" // 4 complex values to y
|
||||
"vmovups %%ymm13, 32(%3,%0,4) \n\t"
|
||||
|
||||
"addq $16, %0 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
"vzeroupper \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (ap[2]), // 6
|
||||
"r" (ap[3]) // 7
|
||||
: "cc",
|
||||
"%xmm0", "%xmm1", "%xmm2", "%xmm3",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,542 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#define HAVE_KERNEL_4x4 1
|
||||
static void cgemv_kernel_4x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y) __attribute__ ((noinline));
|
||||
|
||||
static void cgemv_kernel_4x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
BLASLONG register n1 = n & -8 ;
|
||||
BLASLONG register n2 = n & 4 ;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vzeroupper \n\t"
|
||||
|
||||
"vbroadcastss (%2), %%ymm0 \n\t" // real part x0
|
||||
"vbroadcastss 4(%2), %%ymm1 \n\t" // imag part x0
|
||||
"vbroadcastss 8(%2), %%ymm2 \n\t" // real part x1
|
||||
"vbroadcastss 12(%2), %%ymm3 \n\t" // imag part x1
|
||||
"vbroadcastss 16(%2), %%ymm4 \n\t" // real part x2
|
||||
"vbroadcastss 20(%2), %%ymm5 \n\t" // imag part x2
|
||||
"vbroadcastss 24(%2), %%ymm6 \n\t" // real part x3
|
||||
"vbroadcastss 28(%2), %%ymm7 \n\t" // imag part x3
|
||||
|
||||
"cmpq $0 , %1 \n\t"
|
||||
"je .L01END%= \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"prefetcht0 320(%4,%0,4) \n\t"
|
||||
"vmovups (%4,%0,4), %%ymm8 \n\t" // 4 complex values form a0
|
||||
"vmovups 32(%4,%0,4), %%ymm9 \n\t" // 4 complex values form a0
|
||||
|
||||
"prefetcht0 320(%5,%0,4) \n\t"
|
||||
"vmovups (%5,%0,4), %%ymm10 \n\t" // 4 complex values form a1
|
||||
"vmovups 32(%5,%0,4), %%ymm11 \n\t" // 4 complex values form a1
|
||||
|
||||
"vmulps %%ymm8 , %%ymm0, %%ymm12 \n\t" // a_r[0] * x_r , a_i[0] * x_r, a_r[1] * x_r, a_i[1] * x_r
|
||||
"vmulps %%ymm8 , %%ymm1, %%ymm13 \n\t" // a_r[0] * x_i , a_i[0] * x_i, a_r[1] * x_i, a_i[1] * x_i
|
||||
"vmulps %%ymm9 , %%ymm0, %%ymm14 \n\t" // a_r[2] * x_r , a_i[2] * x_r, a_r[3] * x_r, a_i[3] * x_r
|
||||
"vmulps %%ymm9 , %%ymm1, %%ymm15 \n\t" // a_r[2] * x_i , a_i[2] * x_i, a_r[3] * x_i, a_i[3] * x_i
|
||||
|
||||
"prefetcht0 320(%6,%0,4) \n\t"
|
||||
"vmovups (%6,%0,4), %%ymm8 \n\t" // 4 complex values form a2
|
||||
"vmovups 32(%6,%0,4), %%ymm9 \n\t" // 4 complex values form a2
|
||||
|
||||
"vfmadd231ps %%ymm10, %%ymm2, %%ymm12 \n\t" // a_r[0] * x_r , a_i[0] * x_r, a_r[1] * x_r, a_i[1] * x_r
|
||||
"vfmadd231ps %%ymm10, %%ymm3, %%ymm13 \n\t" // a_r[0] * x_i , a_i[0] * x_i, a_r[1] * x_i, a_i[1] * x_i
|
||||
"vfmadd231ps %%ymm11, %%ymm2, %%ymm14 \n\t" // a_r[2] * x_r , a_i[2] * x_r, a_r[3] * x_r, a_i[3] * x_r
|
||||
"vfmadd231ps %%ymm11, %%ymm3, %%ymm15 \n\t" // a_r[2] * x_i , a_i[2] * x_i, a_r[3] * x_i, a_i[3] * x_i
|
||||
|
||||
"prefetcht0 320(%7,%0,4) \n\t"
|
||||
"vmovups (%7,%0,4), %%ymm10 \n\t" // 4 complex values form a3
|
||||
"vmovups 32(%7,%0,4), %%ymm11 \n\t" // 4 complex values form a3
|
||||
|
||||
"vfmadd231ps %%ymm8 , %%ymm4, %%ymm12 \n\t" // a_r[0] * x_r , a_i[0] * x_r, a_r[1] * x_r, a_i[1] * x_r
|
||||
"vfmadd231ps %%ymm8 , %%ymm5, %%ymm13 \n\t" // a_r[0] * x_i , a_i[0] * x_i, a_r[1] * x_i, a_i[1] * x_i
|
||||
"vfmadd231ps %%ymm9 , %%ymm4, %%ymm14 \n\t" // a_r[2] * x_r , a_i[2] * x_r, a_r[3] * x_r, a_i[3] * x_r
|
||||
"vfmadd231ps %%ymm9 , %%ymm5, %%ymm15 \n\t" // a_r[2] * x_i , a_i[2] * x_i, a_r[3] * x_i, a_i[3] * x_i
|
||||
|
||||
"vfmadd231ps %%ymm10, %%ymm6, %%ymm12 \n\t" // a_r[0] * x_r , a_i[0] * x_r, a_r[1] * x_r, a_i[1] * x_r
|
||||
"vfmadd231ps %%ymm10, %%ymm7, %%ymm13 \n\t" // a_r[0] * x_i , a_i[0] * x_i, a_r[1] * x_i, a_i[1] * x_i
|
||||
"vfmadd231ps %%ymm11, %%ymm6, %%ymm14 \n\t" // a_r[2] * x_r , a_i[2] * x_r, a_r[3] * x_r, a_i[3] * x_r
|
||||
"vfmadd231ps %%ymm11, %%ymm7, %%ymm15 \n\t" // a_r[2] * x_i , a_i[2] * x_i, a_r[3] * x_i, a_i[3] * x_i
|
||||
|
||||
"prefetcht0 320(%3,%0,4) \n\t"
|
||||
"vmovups (%3,%0,4), %%ymm10 \n\t"
|
||||
"vmovups 32(%3,%0,4), %%ymm11 \n\t"
|
||||
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
"vpermilps $0xb1 , %%ymm13, %%ymm13 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm15, %%ymm15 \n\t"
|
||||
"vaddsubps %%ymm13, %%ymm12, %%ymm8 \n\t"
|
||||
"vaddsubps %%ymm15, %%ymm14, %%ymm9 \n\t"
|
||||
#else
|
||||
"vpermilps $0xb1 , %%ymm12, %%ymm12 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm14, %%ymm14 \n\t"
|
||||
"vaddsubps %%ymm12, %%ymm13, %%ymm8 \n\t"
|
||||
"vaddsubps %%ymm14, %%ymm15, %%ymm9 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm8 , %%ymm8 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm9 , %%ymm9 \n\t"
|
||||
#endif
|
||||
|
||||
"vaddps %%ymm8, %%ymm10, %%ymm12 \n\t"
|
||||
"vaddps %%ymm9, %%ymm11, %%ymm13 \n\t"
|
||||
|
||||
"vmovups %%ymm12, (%3,%0,4) \n\t" // 4 complex values to y
|
||||
"vmovups %%ymm13, 32(%3,%0,4) \n\t"
|
||||
|
||||
"addq $16, %0 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
".L01END%=: \n\t"
|
||||
|
||||
"cmpq $4, %8 \n\t"
|
||||
"jne .L02END%= \n\t"
|
||||
|
||||
"vmovups (%4,%0,4), %%ymm8 \n\t" // 4 complex values form a0
|
||||
"vmovups (%5,%0,4), %%ymm10 \n\t" // 4 complex values form a1
|
||||
|
||||
"vmulps %%ymm8 , %%ymm0, %%ymm12 \n\t" // a_r[0] * x_r , a_i[0] * x_r, a_r[1] * x_r, a_i[1] * x_r
|
||||
"vmulps %%ymm8 , %%ymm1, %%ymm13 \n\t" // a_r[0] * x_i , a_i[0] * x_i, a_r[1] * x_i, a_i[1] * x_i
|
||||
|
||||
"vfmadd231ps %%ymm10, %%ymm2, %%ymm12 \n\t" // a_r[0] * x_r , a_i[0] * x_r, a_r[1] * x_r, a_i[1] * x_r
|
||||
"vfmadd231ps %%ymm10, %%ymm3, %%ymm13 \n\t" // a_r[0] * x_i , a_i[0] * x_i, a_r[1] * x_i, a_i[1] * x_i
|
||||
|
||||
"vmovups (%6,%0,4), %%ymm8 \n\t" // 4 complex values form a2
|
||||
"vmovups (%7,%0,4), %%ymm10 \n\t" // 4 complex values form a3
|
||||
|
||||
"vfmadd231ps %%ymm8 , %%ymm4, %%ymm12 \n\t" // a_r[0] * x_r , a_i[0] * x_r, a_r[1] * x_r, a_i[1] * x_r
|
||||
"vfmadd231ps %%ymm8 , %%ymm5, %%ymm13 \n\t" // a_r[0] * x_i , a_i[0] * x_i, a_r[1] * x_i, a_i[1] * x_i
|
||||
|
||||
"vfmadd231ps %%ymm10, %%ymm6, %%ymm12 \n\t" // a_r[0] * x_r , a_i[0] * x_r, a_r[1] * x_r, a_i[1] * x_r
|
||||
"vfmadd231ps %%ymm10, %%ymm7, %%ymm13 \n\t" // a_r[0] * x_i , a_i[0] * x_i, a_r[1] * x_i, a_i[1] * x_i
|
||||
|
||||
"vmovups (%3,%0,4), %%ymm10 \n\t"
|
||||
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
"vpermilps $0xb1 , %%ymm13, %%ymm13 \n\t"
|
||||
"vaddsubps %%ymm13, %%ymm12, %%ymm8 \n\t"
|
||||
#else
|
||||
"vpermilps $0xb1 , %%ymm12, %%ymm12 \n\t"
|
||||
"vaddsubps %%ymm12, %%ymm13, %%ymm8 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm8 , %%ymm8 \n\t"
|
||||
#endif
|
||||
|
||||
"vaddps %%ymm8, %%ymm10, %%ymm12 \n\t"
|
||||
|
||||
"vmovups %%ymm12, (%3,%0,4) \n\t" // 4 complex values to y
|
||||
|
||||
".L02END%=: \n\t"
|
||||
"vzeroupper \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n1), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (ap[2]), // 6
|
||||
"r" (ap[3]), // 7
|
||||
"r" (n2) // 8
|
||||
: "cc",
|
||||
"%xmm0", "%xmm1", "%xmm2", "%xmm3",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
#define HAVE_KERNEL_4x2 1
|
||||
static void cgemv_kernel_4x2( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y) __attribute__ ((noinline));
|
||||
|
||||
static void cgemv_kernel_4x2( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
BLASLONG register n1 = n & -8 ;
|
||||
BLASLONG register n2 = n & 4 ;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vzeroupper \n\t"
|
||||
|
||||
"vbroadcastss (%2), %%ymm0 \n\t" // real part x0
|
||||
"vbroadcastss 4(%2), %%ymm1 \n\t" // imag part x0
|
||||
"vbroadcastss 8(%2), %%ymm2 \n\t" // real part x1
|
||||
"vbroadcastss 12(%2), %%ymm3 \n\t" // imag part x1
|
||||
|
||||
"cmpq $0 , %1 \n\t"
|
||||
"je .L01END%= \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"prefetcht0 320(%4,%0,4) \n\t"
|
||||
"vmovups (%4,%0,4), %%ymm8 \n\t" // 4 complex values form a0
|
||||
"vmovups 32(%4,%0,4), %%ymm9 \n\t" // 4 complex values form a0
|
||||
|
||||
"prefetcht0 320(%5,%0,4) \n\t"
|
||||
"vmovups (%5,%0,4), %%ymm10 \n\t" // 4 complex values form a1
|
||||
"vmovups 32(%5,%0,4), %%ymm11 \n\t" // 4 complex values form a1
|
||||
|
||||
"vmulps %%ymm8 , %%ymm0, %%ymm12 \n\t" // a_r[0] * x_r , a_i[0] * x_r, a_r[1] * x_r, a_i[1] * x_r
|
||||
"vmulps %%ymm8 , %%ymm1, %%ymm13 \n\t" // a_r[0] * x_i , a_i[0] * x_i, a_r[1] * x_i, a_i[1] * x_i
|
||||
"vmulps %%ymm9 , %%ymm0, %%ymm14 \n\t" // a_r[2] * x_r , a_i[2] * x_r, a_r[3] * x_r, a_i[3] * x_r
|
||||
"vmulps %%ymm9 , %%ymm1, %%ymm15 \n\t" // a_r[2] * x_i , a_i[2] * x_i, a_r[3] * x_i, a_i[3] * x_i
|
||||
|
||||
"vfmadd231ps %%ymm10, %%ymm2, %%ymm12 \n\t" // a_r[0] * x_r , a_i[0] * x_r, a_r[1] * x_r, a_i[1] * x_r
|
||||
"vfmadd231ps %%ymm10, %%ymm3, %%ymm13 \n\t" // a_r[0] * x_i , a_i[0] * x_i, a_r[1] * x_i, a_i[1] * x_i
|
||||
"vfmadd231ps %%ymm11, %%ymm2, %%ymm14 \n\t" // a_r[2] * x_r , a_i[2] * x_r, a_r[3] * x_r, a_i[3] * x_r
|
||||
"vfmadd231ps %%ymm11, %%ymm3, %%ymm15 \n\t" // a_r[2] * x_i , a_i[2] * x_i, a_r[3] * x_i, a_i[3] * x_i
|
||||
|
||||
"prefetcht0 320(%3,%0,4) \n\t"
|
||||
"vmovups (%3,%0,4), %%ymm10 \n\t"
|
||||
"vmovups 32(%3,%0,4), %%ymm11 \n\t"
|
||||
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
"vpermilps $0xb1 , %%ymm13, %%ymm13 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm15, %%ymm15 \n\t"
|
||||
"vaddsubps %%ymm13, %%ymm12, %%ymm8 \n\t"
|
||||
"vaddsubps %%ymm15, %%ymm14, %%ymm9 \n\t"
|
||||
#else
|
||||
"vpermilps $0xb1 , %%ymm12, %%ymm12 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm14, %%ymm14 \n\t"
|
||||
"vaddsubps %%ymm12, %%ymm13, %%ymm8 \n\t"
|
||||
"vaddsubps %%ymm14, %%ymm15, %%ymm9 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm8 , %%ymm8 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm9 , %%ymm9 \n\t"
|
||||
#endif
|
||||
|
||||
"vaddps %%ymm8, %%ymm10, %%ymm12 \n\t"
|
||||
"vaddps %%ymm9, %%ymm11, %%ymm13 \n\t"
|
||||
|
||||
"vmovups %%ymm12, (%3,%0,4) \n\t" // 4 complex values to y
|
||||
"vmovups %%ymm13, 32(%3,%0,4) \n\t"
|
||||
|
||||
"addq $16, %0 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
".L01END%=: \n\t"
|
||||
|
||||
"cmpq $4, %6 \n\t"
|
||||
"jne .L02END%= \n\t"
|
||||
|
||||
"vmovups (%4,%0,4), %%ymm8 \n\t" // 4 complex values form a0
|
||||
"vmovups (%5,%0,4), %%ymm10 \n\t" // 4 complex values form a1
|
||||
|
||||
"vmulps %%ymm8 , %%ymm0, %%ymm12 \n\t" // a_r[0] * x_r , a_i[0] * x_r, a_r[1] * x_r, a_i[1] * x_r
|
||||
"vmulps %%ymm8 , %%ymm1, %%ymm13 \n\t" // a_r[0] * x_i , a_i[0] * x_i, a_r[1] * x_i, a_i[1] * x_i
|
||||
|
||||
"vfmadd231ps %%ymm10, %%ymm2, %%ymm12 \n\t" // a_r[0] * x_r , a_i[0] * x_r, a_r[1] * x_r, a_i[1] * x_r
|
||||
"vfmadd231ps %%ymm10, %%ymm3, %%ymm13 \n\t" // a_r[0] * x_i , a_i[0] * x_i, a_r[1] * x_i, a_i[1] * x_i
|
||||
|
||||
"vmovups (%3,%0,4), %%ymm10 \n\t"
|
||||
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
"vpermilps $0xb1 , %%ymm13, %%ymm13 \n\t"
|
||||
"vaddsubps %%ymm13, %%ymm12, %%ymm8 \n\t"
|
||||
#else
|
||||
"vpermilps $0xb1 , %%ymm12, %%ymm12 \n\t"
|
||||
"vaddsubps %%ymm12, %%ymm13, %%ymm8 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm8 , %%ymm8 \n\t"
|
||||
#endif
|
||||
|
||||
"vaddps %%ymm8, %%ymm10, %%ymm12 \n\t"
|
||||
|
||||
"vmovups %%ymm12, (%3,%0,4) \n\t" // 4 complex values to y
|
||||
|
||||
".L02END%=: \n\t"
|
||||
"vzeroupper \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n1), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (n2) // 6
|
||||
: "cc",
|
||||
"%xmm0", "%xmm1", "%xmm2", "%xmm3",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
#define HAVE_KERNEL_4x1 1
|
||||
static void cgemv_kernel_4x1( BLASLONG n, FLOAT *ap, FLOAT *x, FLOAT *y) __attribute__ ((noinline));
|
||||
|
||||
static void cgemv_kernel_4x1( BLASLONG n, FLOAT *ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
BLASLONG register n1 = n & -8 ;
|
||||
BLASLONG register n2 = n & 4 ;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vzeroupper \n\t"
|
||||
|
||||
"vbroadcastss (%2), %%ymm0 \n\t" // real part x0
|
||||
"vbroadcastss 4(%2), %%ymm1 \n\t" // imag part x0
|
||||
|
||||
"cmpq $0 , %1 \n\t"
|
||||
"je .L01END%= \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"prefetcht0 320(%4,%0,4) \n\t"
|
||||
"vmovups (%4,%0,4), %%ymm8 \n\t" // 4 complex values form a0
|
||||
"vmovups 32(%4,%0,4), %%ymm9 \n\t" // 4 complex values form a0
|
||||
|
||||
"vmulps %%ymm8 , %%ymm0, %%ymm12 \n\t" // a_r[0] * x_r , a_i[0] * x_r, a_r[1] * x_r, a_i[1] * x_r
|
||||
"vmulps %%ymm8 , %%ymm1, %%ymm13 \n\t" // a_r[0] * x_i , a_i[0] * x_i, a_r[1] * x_i, a_i[1] * x_i
|
||||
"vmulps %%ymm9 , %%ymm0, %%ymm14 \n\t" // a_r[2] * x_r , a_i[2] * x_r, a_r[3] * x_r, a_i[3] * x_r
|
||||
"vmulps %%ymm9 , %%ymm1, %%ymm15 \n\t" // a_r[2] * x_i , a_i[2] * x_i, a_r[3] * x_i, a_i[3] * x_i
|
||||
|
||||
"prefetcht0 320(%3,%0,4) \n\t"
|
||||
"vmovups (%3,%0,4), %%ymm10 \n\t"
|
||||
"vmovups 32(%3,%0,4), %%ymm11 \n\t"
|
||||
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
"vpermilps $0xb1 , %%ymm13, %%ymm13 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm15, %%ymm15 \n\t"
|
||||
"vaddsubps %%ymm13, %%ymm12, %%ymm8 \n\t"
|
||||
"vaddsubps %%ymm15, %%ymm14, %%ymm9 \n\t"
|
||||
#else
|
||||
"vpermilps $0xb1 , %%ymm12, %%ymm12 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm14, %%ymm14 \n\t"
|
||||
"vaddsubps %%ymm12, %%ymm13, %%ymm8 \n\t"
|
||||
"vaddsubps %%ymm14, %%ymm15, %%ymm9 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm8 , %%ymm8 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm9 , %%ymm9 \n\t"
|
||||
#endif
|
||||
|
||||
"addq $16, %0 \n\t"
|
||||
"vaddps %%ymm8, %%ymm10, %%ymm12 \n\t"
|
||||
"vaddps %%ymm9, %%ymm11, %%ymm13 \n\t"
|
||||
|
||||
"subq $8 , %1 \n\t"
|
||||
"vmovups %%ymm12,-64(%3,%0,4) \n\t" // 4 complex values to y
|
||||
"vmovups %%ymm13,-32(%3,%0,4) \n\t"
|
||||
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
".L01END%=: \n\t"
|
||||
|
||||
"cmpq $4, %5 \n\t"
|
||||
"jne .L02END%= \n\t"
|
||||
|
||||
"vmovups (%4,%0,4), %%ymm8 \n\t" // 4 complex values form a0
|
||||
|
||||
"vmulps %%ymm8 , %%ymm0, %%ymm12 \n\t" // a_r[0] * x_r , a_i[0] * x_r, a_r[1] * x_r, a_i[1] * x_r
|
||||
"vmulps %%ymm8 , %%ymm1, %%ymm13 \n\t" // a_r[0] * x_i , a_i[0] * x_i, a_r[1] * x_i, a_i[1] * x_i
|
||||
|
||||
"vmovups (%3,%0,4), %%ymm10 \n\t"
|
||||
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
"vpermilps $0xb1 , %%ymm13, %%ymm13 \n\t"
|
||||
"vaddsubps %%ymm13, %%ymm12, %%ymm8 \n\t"
|
||||
#else
|
||||
"vpermilps $0xb1 , %%ymm12, %%ymm12 \n\t"
|
||||
"vaddsubps %%ymm12, %%ymm13, %%ymm8 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm8 , %%ymm8 \n\t"
|
||||
#endif
|
||||
|
||||
"vaddps %%ymm8, %%ymm10, %%ymm12 \n\t"
|
||||
|
||||
"vmovups %%ymm12, (%3,%0,4) \n\t" // 4 complex values to y
|
||||
|
||||
".L02END%=: \n\t"
|
||||
"vzeroupper \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n1), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap), // 4
|
||||
"r" (n2) // 5
|
||||
: "cc",
|
||||
"%xmm0", "%xmm1", "%xmm2", "%xmm3",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
#define HAVE_KERNEL_ADDY 1
|
||||
|
||||
static void add_y(BLASLONG n, FLOAT *src, FLOAT *dest, BLASLONG inc_dest,FLOAT alpha_r, FLOAT alpha_i) __attribute__ ((noinline));
|
||||
|
||||
static void add_y(BLASLONG n, FLOAT *src, FLOAT *dest, BLASLONG inc_dest,FLOAT alpha_r, FLOAT alpha_i)
|
||||
{
|
||||
BLASLONG i;
|
||||
|
||||
if ( inc_dest != 2 )
|
||||
{
|
||||
|
||||
FLOAT temp_r;
|
||||
FLOAT temp_i;
|
||||
for ( i=0; i<n; i++ )
|
||||
{
|
||||
#if !defined(XCONJ)
|
||||
temp_r = alpha_r * src[0] - alpha_i * src[1];
|
||||
temp_i = alpha_r * src[1] + alpha_i * src[0];
|
||||
#else
|
||||
temp_r = alpha_r * src[0] + alpha_i * src[1];
|
||||
temp_i = -alpha_r * src[1] + alpha_i * src[0];
|
||||
#endif
|
||||
|
||||
*dest += temp_r;
|
||||
*(dest+1) += temp_i;
|
||||
|
||||
src+=2;
|
||||
dest += inc_dest;
|
||||
}
|
||||
return;
|
||||
}
|
||||
|
||||
i=0;
|
||||
BLASLONG register n1 = n & -8 ;
|
||||
BLASLONG register n2 = n & 4 ;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vzeroupper \n\t"
|
||||
|
||||
"vbroadcastss (%4), %%ymm0 \n\t" // alpha_r
|
||||
"vbroadcastss (%5), %%ymm1 \n\t" // alpha_i
|
||||
|
||||
"cmpq $0 , %1 \n\t"
|
||||
"je .L01END%= \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vmovups (%2,%0,4), %%ymm8 \n\t" // 4 complex values from src
|
||||
"vmovups 32(%2,%0,4), %%ymm9 \n\t"
|
||||
|
||||
"vmulps %%ymm8 , %%ymm0, %%ymm12 \n\t" // a_r[0] * x_r , a_i[0] * x_r, a_r[1] * x_r, a_i[1] * x_r
|
||||
"vmulps %%ymm8 , %%ymm1, %%ymm13 \n\t" // a_r[0] * x_i , a_i[0] * x_i, a_r[1] * x_i, a_i[1] * x_i
|
||||
"vmulps %%ymm9 , %%ymm0, %%ymm14 \n\t" // a_r[2] * x_r , a_i[2] * x_r, a_r[3] * x_r, a_i[3] * x_r
|
||||
"vmulps %%ymm9 , %%ymm1, %%ymm15 \n\t" // a_r[2] * x_i , a_i[2] * x_i, a_r[3] * x_i, a_i[3] * x_i
|
||||
|
||||
"vmovups (%3,%0,4), %%ymm10 \n\t" // 4 complex values from dest
|
||||
"vmovups 32(%3,%0,4), %%ymm11 \n\t"
|
||||
|
||||
#if !defined(XCONJ)
|
||||
"vpermilps $0xb1 , %%ymm13, %%ymm13 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm15, %%ymm15 \n\t"
|
||||
"vaddsubps %%ymm13, %%ymm12, %%ymm8 \n\t"
|
||||
"vaddsubps %%ymm15, %%ymm14, %%ymm9 \n\t"
|
||||
#else
|
||||
"vpermilps $0xb1 , %%ymm12, %%ymm12 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm14, %%ymm14 \n\t"
|
||||
"vaddsubps %%ymm12, %%ymm13, %%ymm8 \n\t"
|
||||
"vaddsubps %%ymm14, %%ymm15, %%ymm9 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm8 , %%ymm8 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm9 , %%ymm9 \n\t"
|
||||
#endif
|
||||
|
||||
"addq $16, %0 \n\t"
|
||||
"vaddps %%ymm8, %%ymm10, %%ymm12 \n\t"
|
||||
"vaddps %%ymm9, %%ymm11, %%ymm13 \n\t"
|
||||
|
||||
"subq $8 , %1 \n\t"
|
||||
"vmovups %%ymm12,-64(%3,%0,4) \n\t" // 4 complex values to y
|
||||
"vmovups %%ymm13,-32(%3,%0,4) \n\t"
|
||||
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
".L01END%=: \n\t"
|
||||
|
||||
"cmpq $4, %6 \n\t"
|
||||
"jne .L02END%= \n\t"
|
||||
|
||||
"vmovups (%2,%0,4), %%ymm8 \n\t" // 4 complex values src
|
||||
|
||||
"vmulps %%ymm8 , %%ymm0, %%ymm12 \n\t" // a_r[0] * x_r , a_i[0] * x_r, a_r[1] * x_r, a_i[1] * x_r
|
||||
"vmulps %%ymm8 , %%ymm1, %%ymm13 \n\t" // a_r[0] * x_i , a_i[0] * x_i, a_r[1] * x_i, a_i[1] * x_i
|
||||
|
||||
"vmovups (%3,%0,4), %%ymm10 \n\t"
|
||||
|
||||
#if !defined(XCONJ)
|
||||
"vpermilps $0xb1 , %%ymm13, %%ymm13 \n\t"
|
||||
"vaddsubps %%ymm13, %%ymm12, %%ymm8 \n\t"
|
||||
#else
|
||||
"vpermilps $0xb1 , %%ymm12, %%ymm12 \n\t"
|
||||
"vaddsubps %%ymm12, %%ymm13, %%ymm8 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm8 , %%ymm8 \n\t"
|
||||
#endif
|
||||
|
||||
"vaddps %%ymm8, %%ymm10, %%ymm12 \n\t"
|
||||
|
||||
"vmovups %%ymm12, (%3,%0,4) \n\t" // 4 complex values to y
|
||||
|
||||
".L02END%=: \n\t"
|
||||
"vzeroupper \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n1), // 1
|
||||
"r" (src), // 2
|
||||
"r" (dest), // 3
|
||||
"r" (&alpha_r), // 4
|
||||
"r" (&alpha_i), // 5
|
||||
"r" (n2) // 6
|
||||
: "cc",
|
||||
"%xmm0", "%xmm1", "%xmm2", "%xmm3",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
return;
|
||||
|
||||
}
|
||||
|
||||
|
|
@ -1,265 +0,0 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
#include "common.h"
|
||||
|
||||
#if defined(HASWELL)
|
||||
#include "cgemv_t_microk_haswell-2.c"
|
||||
#endif
|
||||
|
||||
#define NBMAX 2048
|
||||
|
||||
#ifndef HAVE_KERNEL_16x4
|
||||
|
||||
static void cgemv_kernel_16x4(BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
BLASLONG i;
|
||||
FLOAT *a0,*a1,*a2,*a3;
|
||||
a0 = ap[0];
|
||||
a1 = ap[1];
|
||||
a2 = ap[2];
|
||||
a3 = ap[3];
|
||||
FLOAT temp_r0 = 0.0;
|
||||
FLOAT temp_r1 = 0.0;
|
||||
FLOAT temp_r2 = 0.0;
|
||||
FLOAT temp_r3 = 0.0;
|
||||
FLOAT temp_i0 = 0.0;
|
||||
FLOAT temp_i1 = 0.0;
|
||||
FLOAT temp_i2 = 0.0;
|
||||
FLOAT temp_i3 = 0.0;
|
||||
|
||||
|
||||
for ( i=0; i< 2*n; i+=2 )
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
temp_r0 += a0[i]*x[i] - a0[i+1]*x[i+1];
|
||||
temp_i0 += a0[i]*x[i+1] + a0[i+1]*x[i];
|
||||
temp_r1 += a1[i]*x[i] - a1[i+1]*x[i+1];
|
||||
temp_i1 += a1[i]*x[i+1] + a1[i+1]*x[i];
|
||||
temp_r2 += a2[i]*x[i] - a2[i+1]*x[i+1];
|
||||
temp_i2 += a2[i]*x[i+1] + a2[i+1]*x[i];
|
||||
temp_r3 += a3[i]*x[i] - a3[i+1]*x[i+1];
|
||||
temp_i3 += a3[i]*x[i+1] + a3[i+1]*x[i];
|
||||
#else
|
||||
temp_r0 += a0[i]*x[i] + a0[i+1]*x[i+1];
|
||||
temp_i0 += a0[i]*x[i+1] - a0[i+1]*x[i];
|
||||
temp_r1 += a1[i]*x[i] + a1[i+1]*x[i+1];
|
||||
temp_i1 += a1[i]*x[i+1] - a1[i+1]*x[i];
|
||||
temp_r2 += a2[i]*x[i] + a2[i+1]*x[i+1];
|
||||
temp_i2 += a2[i]*x[i+1] - a2[i+1]*x[i];
|
||||
temp_r3 += a3[i]*x[i] + a3[i+1]*x[i+1];
|
||||
temp_i3 += a3[i]*x[i+1] - a3[i+1]*x[i];
|
||||
#endif
|
||||
}
|
||||
y[0] = temp_r0;
|
||||
y[1] = temp_i0;
|
||||
y[2] = temp_r1;
|
||||
y[3] = temp_i1;
|
||||
y[4] = temp_r2;
|
||||
y[5] = temp_i2;
|
||||
y[6] = temp_r3;
|
||||
y[7] = temp_i3;
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
static void cgemv_kernel_16x1(BLASLONG n, FLOAT *ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
BLASLONG i;
|
||||
FLOAT *a0;
|
||||
a0 = ap;
|
||||
FLOAT temp_r = 0.0;
|
||||
FLOAT temp_i = 0.0;
|
||||
|
||||
for ( i=0; i< 2*n; i+=2 )
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
temp_r += a0[i]*x[i] - a0[i+1]*x[i+1];
|
||||
temp_i += a0[i]*x[i+1] + a0[i+1]*x[i];
|
||||
#else
|
||||
temp_r += a0[i]*x[i] + a0[i+1]*x[i+1];
|
||||
temp_i += a0[i]*x[i+1] - a0[i+1]*x[i];
|
||||
#endif
|
||||
}
|
||||
*y = temp_r;
|
||||
*(y+1) = temp_i;
|
||||
}
|
||||
|
||||
static void copy_x(BLASLONG n, FLOAT *src, FLOAT *dest, BLASLONG inc_src)
|
||||
{
|
||||
BLASLONG i;
|
||||
for ( i=0; i<n; i++ )
|
||||
{
|
||||
*dest = *src;
|
||||
*(dest+1) = *(src+1);
|
||||
dest+=2;
|
||||
src += inc_src;
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
int CNAME(BLASLONG m, BLASLONG n, BLASLONG dummy1, FLOAT alpha_r, FLOAT alpha_i, FLOAT *a, BLASLONG lda, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y, FLOAT *buffer)
|
||||
{
|
||||
BLASLONG i;
|
||||
BLASLONG j;
|
||||
FLOAT *a_ptr;
|
||||
FLOAT *x_ptr;
|
||||
FLOAT *y_ptr;
|
||||
FLOAT *ap[8];
|
||||
BLASLONG n1;
|
||||
BLASLONG m1;
|
||||
BLASLONG m2;
|
||||
BLASLONG n2;
|
||||
FLOAT ybuffer[8],*xbuffer;
|
||||
|
||||
inc_x *= 2;
|
||||
inc_y *= 2;
|
||||
lda *= 2;
|
||||
|
||||
xbuffer = buffer;
|
||||
|
||||
n1 = n / 4 ;
|
||||
n2 = n % 4 ;
|
||||
|
||||
m1 = m - ( m % 16 );
|
||||
m2 = (m % NBMAX) - (m % 16) ;
|
||||
|
||||
|
||||
BLASLONG NB = NBMAX;
|
||||
|
||||
while ( NB == NBMAX )
|
||||
{
|
||||
|
||||
m1 -= NB;
|
||||
if ( m1 < 0)
|
||||
{
|
||||
if ( m2 == 0 ) break;
|
||||
NB = m2;
|
||||
}
|
||||
|
||||
y_ptr = y;
|
||||
a_ptr = a;
|
||||
x_ptr = x;
|
||||
copy_x(NB,x_ptr,xbuffer,inc_x);
|
||||
for( i = 0; i < n1 ; i++)
|
||||
{
|
||||
ap[0] = a_ptr;
|
||||
ap[1] = a_ptr + lda;
|
||||
ap[2] = ap[1] + lda;
|
||||
ap[3] = ap[2] + lda;
|
||||
cgemv_kernel_16x4(NB,ap,xbuffer,ybuffer);
|
||||
a_ptr += 4 * lda;
|
||||
|
||||
#if !defined(XCONJ)
|
||||
y_ptr[0] += alpha_r * ybuffer[0] - alpha_i * ybuffer[1];
|
||||
y_ptr[1] += alpha_r * ybuffer[1] + alpha_i * ybuffer[0];
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += alpha_r * ybuffer[2] - alpha_i * ybuffer[3];
|
||||
y_ptr[1] += alpha_r * ybuffer[3] + alpha_i * ybuffer[2];
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += alpha_r * ybuffer[4] - alpha_i * ybuffer[5];
|
||||
y_ptr[1] += alpha_r * ybuffer[5] + alpha_i * ybuffer[4];
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += alpha_r * ybuffer[6] - alpha_i * ybuffer[7];
|
||||
y_ptr[1] += alpha_r * ybuffer[7] + alpha_i * ybuffer[6];
|
||||
y_ptr += inc_y;
|
||||
#else
|
||||
y_ptr[0] += alpha_r * ybuffer[0] + alpha_i * ybuffer[1];
|
||||
y_ptr[1] -= alpha_r * ybuffer[1] - alpha_i * ybuffer[0];
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += alpha_r * ybuffer[2] + alpha_i * ybuffer[3];
|
||||
y_ptr[1] -= alpha_r * ybuffer[3] - alpha_i * ybuffer[2];
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += alpha_r * ybuffer[4] + alpha_i * ybuffer[5];
|
||||
y_ptr[1] -= alpha_r * ybuffer[5] - alpha_i * ybuffer[4];
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += alpha_r * ybuffer[6] + alpha_i * ybuffer[7];
|
||||
y_ptr[1] -= alpha_r * ybuffer[7] - alpha_i * ybuffer[6];
|
||||
y_ptr += inc_y;
|
||||
#endif
|
||||
}
|
||||
|
||||
for( i = 0; i < n2 ; i++)
|
||||
{
|
||||
cgemv_kernel_16x1(NB,a_ptr,xbuffer,ybuffer);
|
||||
a_ptr += 1 * lda;
|
||||
|
||||
#if !defined(XCONJ)
|
||||
y_ptr[0] += alpha_r * ybuffer[0] - alpha_i * ybuffer[1];
|
||||
y_ptr[1] += alpha_r * ybuffer[1] + alpha_i * ybuffer[0];
|
||||
y_ptr += inc_y;
|
||||
#else
|
||||
y_ptr[0] += alpha_r * ybuffer[0] + alpha_i * ybuffer[1];
|
||||
y_ptr[1] -= alpha_r * ybuffer[1] - alpha_i * ybuffer[0];
|
||||
y_ptr += inc_y;
|
||||
#endif
|
||||
|
||||
}
|
||||
a += 2* NB;
|
||||
x += NB * inc_x;
|
||||
}
|
||||
|
||||
BLASLONG m3 = m % 16;
|
||||
if ( m3 == 0 ) return(0);
|
||||
|
||||
x_ptr = x;
|
||||
copy_x(m3,x_ptr,xbuffer,inc_x);
|
||||
j=0;
|
||||
a_ptr = a;
|
||||
y_ptr = y;
|
||||
while ( j < n)
|
||||
{
|
||||
FLOAT temp_r = 0.0;
|
||||
FLOAT temp_i = 0.0;
|
||||
for( i = 0; i < m3*2; i+=2 )
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
temp_r += a_ptr[i] * xbuffer[i] - a_ptr[i+1] * xbuffer[i+1];
|
||||
temp_i += a_ptr[i] * xbuffer[i+1] + a_ptr[i+1] * xbuffer[i];
|
||||
#else
|
||||
temp_r += a_ptr[i] * xbuffer[i] + a_ptr[i+1] * xbuffer[i+1];
|
||||
temp_i += a_ptr[i] * xbuffer[i+1] - a_ptr[i+1] * xbuffer[i];
|
||||
#endif
|
||||
}
|
||||
a_ptr += lda;
|
||||
|
||||
#if !defined(XCONJ)
|
||||
y_ptr[0] += alpha_r * temp_r - alpha_i * temp_i;
|
||||
y_ptr[1] += alpha_r * temp_i + alpha_i * temp_r;
|
||||
#else
|
||||
y_ptr[0] += alpha_r * temp_r + alpha_i * temp_i;
|
||||
y_ptr[1] -= alpha_r * temp_i - alpha_i * temp_r;
|
||||
#endif
|
||||
|
||||
y_ptr += inc_y;
|
||||
j++;
|
||||
}
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,579 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
#include "common.h"
|
||||
|
||||
#if defined(HASWELL)
|
||||
#include "cgemv_t_microk_haswell-4.c"
|
||||
#endif
|
||||
|
||||
#define NBMAX 2048
|
||||
|
||||
#ifndef HAVE_KERNEL_4x4
|
||||
|
||||
static void cgemv_kernel_4x4(BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
BLASLONG i;
|
||||
FLOAT *a0,*a1,*a2,*a3;
|
||||
a0 = ap[0];
|
||||
a1 = ap[1];
|
||||
a2 = ap[2];
|
||||
a3 = ap[3];
|
||||
FLOAT alpha_r = alpha[0];
|
||||
FLOAT alpha_i = alpha[1];
|
||||
FLOAT temp_r0 = 0.0;
|
||||
FLOAT temp_r1 = 0.0;
|
||||
FLOAT temp_r2 = 0.0;
|
||||
FLOAT temp_r3 = 0.0;
|
||||
FLOAT temp_i0 = 0.0;
|
||||
FLOAT temp_i1 = 0.0;
|
||||
FLOAT temp_i2 = 0.0;
|
||||
FLOAT temp_i3 = 0.0;
|
||||
|
||||
|
||||
for ( i=0; i< 2*n; i+=2 )
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
temp_r0 += a0[i]*x[i] - a0[i+1]*x[i+1];
|
||||
temp_i0 += a0[i]*x[i+1] + a0[i+1]*x[i];
|
||||
temp_r1 += a1[i]*x[i] - a1[i+1]*x[i+1];
|
||||
temp_i1 += a1[i]*x[i+1] + a1[i+1]*x[i];
|
||||
temp_r2 += a2[i]*x[i] - a2[i+1]*x[i+1];
|
||||
temp_i2 += a2[i]*x[i+1] + a2[i+1]*x[i];
|
||||
temp_r3 += a3[i]*x[i] - a3[i+1]*x[i+1];
|
||||
temp_i3 += a3[i]*x[i+1] + a3[i+1]*x[i];
|
||||
#else
|
||||
temp_r0 += a0[i]*x[i] + a0[i+1]*x[i+1];
|
||||
temp_i0 += a0[i]*x[i+1] - a0[i+1]*x[i];
|
||||
temp_r1 += a1[i]*x[i] + a1[i+1]*x[i+1];
|
||||
temp_i1 += a1[i]*x[i+1] - a1[i+1]*x[i];
|
||||
temp_r2 += a2[i]*x[i] + a2[i+1]*x[i+1];
|
||||
temp_i2 += a2[i]*x[i+1] - a2[i+1]*x[i];
|
||||
temp_r3 += a3[i]*x[i] + a3[i+1]*x[i+1];
|
||||
temp_i3 += a3[i]*x[i+1] - a3[i+1]*x[i];
|
||||
#endif
|
||||
}
|
||||
|
||||
#if !defined(XCONJ)
|
||||
|
||||
y[0] += alpha_r * temp_r0 - alpha_i * temp_i0;
|
||||
y[1] += alpha_r * temp_i0 + alpha_i * temp_r0;
|
||||
y[2] += alpha_r * temp_r1 - alpha_i * temp_i1;
|
||||
y[3] += alpha_r * temp_i1 + alpha_i * temp_r1;
|
||||
y[4] += alpha_r * temp_r2 - alpha_i * temp_i2;
|
||||
y[5] += alpha_r * temp_i2 + alpha_i * temp_r2;
|
||||
y[6] += alpha_r * temp_r3 - alpha_i * temp_i3;
|
||||
y[7] += alpha_r * temp_i3 + alpha_i * temp_r3;
|
||||
|
||||
#else
|
||||
|
||||
y[0] += alpha_r * temp_r0 + alpha_i * temp_i0;
|
||||
y[1] -= alpha_r * temp_i0 - alpha_i * temp_r0;
|
||||
y[2] += alpha_r * temp_r1 + alpha_i * temp_i1;
|
||||
y[3] -= alpha_r * temp_i1 - alpha_i * temp_r1;
|
||||
y[4] += alpha_r * temp_r2 + alpha_i * temp_i2;
|
||||
y[5] -= alpha_r * temp_i2 - alpha_i * temp_r2;
|
||||
y[6] += alpha_r * temp_r3 + alpha_i * temp_i3;
|
||||
y[7] -= alpha_r * temp_i3 - alpha_i * temp_r3;
|
||||
|
||||
#endif
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
#ifndef HAVE_KERNEL_4x2
|
||||
|
||||
static void cgemv_kernel_4x2(BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
BLASLONG i;
|
||||
FLOAT *a0,*a1;
|
||||
a0 = ap[0];
|
||||
a1 = ap[1];
|
||||
FLOAT alpha_r = alpha[0];
|
||||
FLOAT alpha_i = alpha[1];
|
||||
FLOAT temp_r0 = 0.0;
|
||||
FLOAT temp_r1 = 0.0;
|
||||
FLOAT temp_i0 = 0.0;
|
||||
FLOAT temp_i1 = 0.0;
|
||||
|
||||
|
||||
for ( i=0; i< 2*n; i+=2 )
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
temp_r0 += a0[i]*x[i] - a0[i+1]*x[i+1];
|
||||
temp_i0 += a0[i]*x[i+1] + a0[i+1]*x[i];
|
||||
temp_r1 += a1[i]*x[i] - a1[i+1]*x[i+1];
|
||||
temp_i1 += a1[i]*x[i+1] + a1[i+1]*x[i];
|
||||
#else
|
||||
temp_r0 += a0[i]*x[i] + a0[i+1]*x[i+1];
|
||||
temp_i0 += a0[i]*x[i+1] - a0[i+1]*x[i];
|
||||
temp_r1 += a1[i]*x[i] + a1[i+1]*x[i+1];
|
||||
temp_i1 += a1[i]*x[i+1] - a1[i+1]*x[i];
|
||||
#endif
|
||||
}
|
||||
|
||||
#if !defined(XCONJ)
|
||||
|
||||
y[0] += alpha_r * temp_r0 - alpha_i * temp_i0;
|
||||
y[1] += alpha_r * temp_i0 + alpha_i * temp_r0;
|
||||
y[2] += alpha_r * temp_r1 - alpha_i * temp_i1;
|
||||
y[3] += alpha_r * temp_i1 + alpha_i * temp_r1;
|
||||
|
||||
#else
|
||||
|
||||
y[0] += alpha_r * temp_r0 + alpha_i * temp_i0;
|
||||
y[1] -= alpha_r * temp_i0 - alpha_i * temp_r0;
|
||||
y[2] += alpha_r * temp_r1 + alpha_i * temp_i1;
|
||||
y[3] -= alpha_r * temp_i1 - alpha_i * temp_r1;
|
||||
|
||||
#endif
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
|
||||
#ifndef HAVE_KERNEL_4x1
|
||||
|
||||
static void cgemv_kernel_4x1(BLASLONG n, FLOAT *ap, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
BLASLONG i;
|
||||
FLOAT *a0;
|
||||
a0 = ap;
|
||||
FLOAT alpha_r = alpha[0];
|
||||
FLOAT alpha_i = alpha[1];
|
||||
FLOAT temp_r0 = 0.0;
|
||||
FLOAT temp_i0 = 0.0;
|
||||
|
||||
for ( i=0; i< 2*n; i+=2 )
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
temp_r0 += a0[i]*x[i] - a0[i+1]*x[i+1];
|
||||
temp_i0 += a0[i]*x[i+1] + a0[i+1]*x[i];
|
||||
#else
|
||||
temp_r0 += a0[i]*x[i] + a0[i+1]*x[i+1];
|
||||
temp_i0 += a0[i]*x[i+1] - a0[i+1]*x[i];
|
||||
#endif
|
||||
}
|
||||
|
||||
#if !defined(XCONJ)
|
||||
|
||||
y[0] += alpha_r * temp_r0 - alpha_i * temp_i0;
|
||||
y[1] += alpha_r * temp_i0 + alpha_i * temp_r0;
|
||||
|
||||
#else
|
||||
|
||||
y[0] += alpha_r * temp_r0 + alpha_i * temp_i0;
|
||||
y[1] -= alpha_r * temp_i0 - alpha_i * temp_r0;
|
||||
|
||||
#endif
|
||||
|
||||
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
|
||||
static void copy_x(BLASLONG n, FLOAT *src, FLOAT *dest, BLASLONG inc_src)
|
||||
{
|
||||
BLASLONG i;
|
||||
for ( i=0; i<n; i++ )
|
||||
{
|
||||
*dest = *src;
|
||||
*(dest+1) = *(src+1);
|
||||
dest+=2;
|
||||
src += inc_src;
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
int CNAME(BLASLONG m, BLASLONG n, BLASLONG dummy1, FLOAT alpha_r, FLOAT alpha_i, FLOAT *a, BLASLONG lda, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y, FLOAT *buffer)
|
||||
{
|
||||
BLASLONG i;
|
||||
BLASLONG j;
|
||||
FLOAT *a_ptr;
|
||||
FLOAT *x_ptr;
|
||||
FLOAT *y_ptr;
|
||||
FLOAT *ap[8];
|
||||
BLASLONG n1;
|
||||
BLASLONG m1;
|
||||
BLASLONG m2;
|
||||
BLASLONG m3;
|
||||
BLASLONG n2;
|
||||
BLASLONG lda4;
|
||||
FLOAT ybuffer[8],*xbuffer;
|
||||
FLOAT alpha[2];
|
||||
|
||||
if ( m < 1 ) return(0);
|
||||
if ( n < 1 ) return(0);
|
||||
|
||||
inc_x <<= 1;
|
||||
inc_y <<= 1;
|
||||
lda <<= 1;
|
||||
lda4 = lda << 2;
|
||||
|
||||
xbuffer = buffer;
|
||||
|
||||
n1 = n >> 2 ;
|
||||
n2 = n & 3 ;
|
||||
|
||||
m3 = m & 3 ;
|
||||
m1 = m - m3;
|
||||
m2 = (m & (NBMAX-1)) - m3 ;
|
||||
|
||||
alpha[0] = alpha_r;
|
||||
alpha[1] = alpha_i;
|
||||
|
||||
BLASLONG NB = NBMAX;
|
||||
|
||||
while ( NB == NBMAX )
|
||||
{
|
||||
|
||||
m1 -= NB;
|
||||
if ( m1 < 0)
|
||||
{
|
||||
if ( m2 == 0 ) break;
|
||||
NB = m2;
|
||||
}
|
||||
|
||||
y_ptr = y;
|
||||
a_ptr = a;
|
||||
x_ptr = x;
|
||||
ap[0] = a_ptr;
|
||||
ap[1] = a_ptr + lda;
|
||||
ap[2] = ap[1] + lda;
|
||||
ap[3] = ap[2] + lda;
|
||||
if ( inc_x != 2 )
|
||||
copy_x(NB,x_ptr,xbuffer,inc_x);
|
||||
else
|
||||
xbuffer = x_ptr;
|
||||
|
||||
if ( inc_y == 2 )
|
||||
{
|
||||
|
||||
for( i = 0; i < n1 ; i++)
|
||||
{
|
||||
cgemv_kernel_4x4(NB,ap,xbuffer,y_ptr,alpha);
|
||||
ap[0] += lda4;
|
||||
ap[1] += lda4;
|
||||
ap[2] += lda4;
|
||||
ap[3] += lda4;
|
||||
a_ptr += lda4;
|
||||
y_ptr += 8;
|
||||
|
||||
}
|
||||
|
||||
if ( n2 & 2 )
|
||||
{
|
||||
cgemv_kernel_4x2(NB,ap,xbuffer,y_ptr,alpha);
|
||||
a_ptr += lda * 2;
|
||||
y_ptr += 4;
|
||||
|
||||
}
|
||||
|
||||
if ( n2 & 1 )
|
||||
{
|
||||
cgemv_kernel_4x1(NB,a_ptr,xbuffer,y_ptr,alpha);
|
||||
a_ptr += lda;
|
||||
y_ptr += 2;
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
|
||||
for( i = 0; i < n1 ; i++)
|
||||
{
|
||||
memset(ybuffer,0,32);
|
||||
cgemv_kernel_4x4(NB,ap,xbuffer,ybuffer,alpha);
|
||||
ap[0] += lda4;
|
||||
ap[1] += lda4;
|
||||
ap[2] += lda4;
|
||||
ap[3] += lda4;
|
||||
a_ptr += lda4;
|
||||
|
||||
y_ptr[0] += ybuffer[0];
|
||||
y_ptr[1] += ybuffer[1];
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += ybuffer[2];
|
||||
y_ptr[1] += ybuffer[3];
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += ybuffer[4];
|
||||
y_ptr[1] += ybuffer[5];
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += ybuffer[6];
|
||||
y_ptr[1] += ybuffer[7];
|
||||
y_ptr += inc_y;
|
||||
|
||||
}
|
||||
|
||||
for( i = 0; i < n2 ; i++)
|
||||
{
|
||||
memset(ybuffer,0,32);
|
||||
cgemv_kernel_4x1(NB,a_ptr,xbuffer,ybuffer,alpha);
|
||||
a_ptr += lda;
|
||||
y_ptr[0] += ybuffer[0];
|
||||
y_ptr[1] += ybuffer[1];
|
||||
y_ptr += inc_y;
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
a += 2 * NB;
|
||||
x += NB * inc_x;
|
||||
}
|
||||
|
||||
|
||||
|
||||
if ( m3 == 0 ) return(0);
|
||||
|
||||
x_ptr = x;
|
||||
j=0;
|
||||
a_ptr = a;
|
||||
y_ptr = y;
|
||||
|
||||
if ( m3 == 3 )
|
||||
{
|
||||
|
||||
FLOAT temp_r ;
|
||||
FLOAT temp_i ;
|
||||
FLOAT x0 = x_ptr[0];
|
||||
FLOAT x1 = x_ptr[1];
|
||||
x_ptr += inc_x;
|
||||
FLOAT x2 = x_ptr[0];
|
||||
FLOAT x3 = x_ptr[1];
|
||||
x_ptr += inc_x;
|
||||
FLOAT x4 = x_ptr[0];
|
||||
FLOAT x5 = x_ptr[1];
|
||||
while ( j < n)
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
temp_r = a_ptr[0] * x0 - a_ptr[1] * x1;
|
||||
temp_i = a_ptr[0] * x1 + a_ptr[1] * x0;
|
||||
temp_r += a_ptr[2] * x2 - a_ptr[3] * x3;
|
||||
temp_i += a_ptr[2] * x3 + a_ptr[3] * x2;
|
||||
temp_r += a_ptr[4] * x4 - a_ptr[5] * x5;
|
||||
temp_i += a_ptr[4] * x5 + a_ptr[5] * x4;
|
||||
#else
|
||||
|
||||
temp_r = a_ptr[0] * x0 + a_ptr[1] * x1;
|
||||
temp_i = a_ptr[0] * x1 - a_ptr[1] * x0;
|
||||
temp_r += a_ptr[2] * x2 + a_ptr[3] * x3;
|
||||
temp_i += a_ptr[2] * x3 - a_ptr[3] * x2;
|
||||
temp_r += a_ptr[4] * x4 + a_ptr[5] * x5;
|
||||
temp_i += a_ptr[4] * x5 - a_ptr[5] * x4;
|
||||
#endif
|
||||
|
||||
#if !defined(XCONJ)
|
||||
y_ptr[0] += alpha_r * temp_r - alpha_i * temp_i;
|
||||
y_ptr[1] += alpha_r * temp_i + alpha_i * temp_r;
|
||||
#else
|
||||
y_ptr[0] += alpha_r * temp_r + alpha_i * temp_i;
|
||||
y_ptr[1] -= alpha_r * temp_i - alpha_i * temp_r;
|
||||
#endif
|
||||
|
||||
a_ptr += lda;
|
||||
y_ptr += inc_y;
|
||||
j++;
|
||||
}
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
if ( m3 == 2 )
|
||||
{
|
||||
|
||||
FLOAT temp_r ;
|
||||
FLOAT temp_i ;
|
||||
FLOAT temp_r1 ;
|
||||
FLOAT temp_i1 ;
|
||||
FLOAT x0 = x_ptr[0];
|
||||
FLOAT x1 = x_ptr[1];
|
||||
x_ptr += inc_x;
|
||||
FLOAT x2 = x_ptr[0];
|
||||
FLOAT x3 = x_ptr[1];
|
||||
FLOAT ar = alpha[0];
|
||||
FLOAT ai = alpha[1];
|
||||
|
||||
while ( j < ( n & -2 ))
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
temp_r = a_ptr[0] * x0 - a_ptr[1] * x1;
|
||||
temp_i = a_ptr[0] * x1 + a_ptr[1] * x0;
|
||||
temp_r += a_ptr[2] * x2 - a_ptr[3] * x3;
|
||||
temp_i += a_ptr[2] * x3 + a_ptr[3] * x2;
|
||||
a_ptr += lda;
|
||||
temp_r1 = a_ptr[0] * x0 - a_ptr[1] * x1;
|
||||
temp_i1 = a_ptr[0] * x1 + a_ptr[1] * x0;
|
||||
temp_r1 += a_ptr[2] * x2 - a_ptr[3] * x3;
|
||||
temp_i1 += a_ptr[2] * x3 + a_ptr[3] * x2;
|
||||
#else
|
||||
|
||||
temp_r = a_ptr[0] * x0 + a_ptr[1] * x1;
|
||||
temp_i = a_ptr[0] * x1 - a_ptr[1] * x0;
|
||||
temp_r += a_ptr[2] * x2 + a_ptr[3] * x3;
|
||||
temp_i += a_ptr[2] * x3 - a_ptr[3] * x2;
|
||||
a_ptr += lda;
|
||||
temp_r1 = a_ptr[0] * x0 + a_ptr[1] * x1;
|
||||
temp_i1 = a_ptr[0] * x1 - a_ptr[1] * x0;
|
||||
temp_r1 += a_ptr[2] * x2 + a_ptr[3] * x3;
|
||||
temp_i1 += a_ptr[2] * x3 - a_ptr[3] * x2;
|
||||
#endif
|
||||
|
||||
#if !defined(XCONJ)
|
||||
y_ptr[0] += ar * temp_r - ai * temp_i;
|
||||
y_ptr[1] += ar * temp_i + ai * temp_r;
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += ar * temp_r1 - ai * temp_i1;
|
||||
y_ptr[1] += ar * temp_i1 + ai * temp_r1;
|
||||
#else
|
||||
y_ptr[0] += ar * temp_r + ai * temp_i;
|
||||
y_ptr[1] -= ar * temp_i - ai * temp_r;
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += ar * temp_r1 + ai * temp_i1;
|
||||
y_ptr[1] -= ar * temp_i1 - ai * temp_r1;
|
||||
#endif
|
||||
|
||||
a_ptr += lda;
|
||||
y_ptr += inc_y;
|
||||
j+=2;
|
||||
}
|
||||
|
||||
|
||||
while ( j < n)
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
temp_r = a_ptr[0] * x0 - a_ptr[1] * x1;
|
||||
temp_i = a_ptr[0] * x1 + a_ptr[1] * x0;
|
||||
temp_r += a_ptr[2] * x2 - a_ptr[3] * x3;
|
||||
temp_i += a_ptr[2] * x3 + a_ptr[3] * x2;
|
||||
#else
|
||||
|
||||
temp_r = a_ptr[0] * x0 + a_ptr[1] * x1;
|
||||
temp_i = a_ptr[0] * x1 - a_ptr[1] * x0;
|
||||
temp_r += a_ptr[2] * x2 + a_ptr[3] * x3;
|
||||
temp_i += a_ptr[2] * x3 - a_ptr[3] * x2;
|
||||
#endif
|
||||
|
||||
#if !defined(XCONJ)
|
||||
y_ptr[0] += ar * temp_r - ai * temp_i;
|
||||
y_ptr[1] += ar * temp_i + ai * temp_r;
|
||||
#else
|
||||
y_ptr[0] += ar * temp_r + ai * temp_i;
|
||||
y_ptr[1] -= ar * temp_i - ai * temp_r;
|
||||
#endif
|
||||
|
||||
a_ptr += lda;
|
||||
y_ptr += inc_y;
|
||||
j++;
|
||||
}
|
||||
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
if ( m3 == 1 )
|
||||
{
|
||||
|
||||
FLOAT temp_r ;
|
||||
FLOAT temp_i ;
|
||||
FLOAT temp_r1 ;
|
||||
FLOAT temp_i1 ;
|
||||
FLOAT x0 = x_ptr[0];
|
||||
FLOAT x1 = x_ptr[1];
|
||||
FLOAT ar = alpha[0];
|
||||
FLOAT ai = alpha[1];
|
||||
|
||||
while ( j < ( n & -2 ))
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
temp_r = a_ptr[0] * x0 - a_ptr[1] * x1;
|
||||
temp_i = a_ptr[0] * x1 + a_ptr[1] * x0;
|
||||
a_ptr += lda;
|
||||
temp_r1 = a_ptr[0] * x0 - a_ptr[1] * x1;
|
||||
temp_i1 = a_ptr[0] * x1 + a_ptr[1] * x0;
|
||||
#else
|
||||
|
||||
temp_r = a_ptr[0] * x0 + a_ptr[1] * x1;
|
||||
temp_i = a_ptr[0] * x1 - a_ptr[1] * x0;
|
||||
a_ptr += lda;
|
||||
temp_r1 = a_ptr[0] * x0 + a_ptr[1] * x1;
|
||||
temp_i1 = a_ptr[0] * x1 - a_ptr[1] * x0;
|
||||
#endif
|
||||
|
||||
#if !defined(XCONJ)
|
||||
y_ptr[0] += ar * temp_r - ai * temp_i;
|
||||
y_ptr[1] += ar * temp_i + ai * temp_r;
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += ar * temp_r1 - ai * temp_i1;
|
||||
y_ptr[1] += ar * temp_i1 + ai * temp_r1;
|
||||
#else
|
||||
y_ptr[0] += ar * temp_r + ai * temp_i;
|
||||
y_ptr[1] -= ar * temp_i - ai * temp_r;
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += ar * temp_r1 + ai * temp_i1;
|
||||
y_ptr[1] -= ar * temp_i1 - ai * temp_r1;
|
||||
#endif
|
||||
|
||||
a_ptr += lda;
|
||||
y_ptr += inc_y;
|
||||
j+=2;
|
||||
}
|
||||
|
||||
while ( j < n)
|
||||
{
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
temp_r = a_ptr[0] * x0 - a_ptr[1] * x1;
|
||||
temp_i = a_ptr[0] * x1 + a_ptr[1] * x0;
|
||||
#else
|
||||
|
||||
temp_r = a_ptr[0] * x0 + a_ptr[1] * x1;
|
||||
temp_i = a_ptr[0] * x1 - a_ptr[1] * x0;
|
||||
#endif
|
||||
|
||||
#if !defined(XCONJ)
|
||||
y_ptr[0] += ar * temp_r - ai * temp_i;
|
||||
y_ptr[1] += ar * temp_i + ai * temp_r;
|
||||
#else
|
||||
y_ptr[0] += ar * temp_r + ai * temp_i;
|
||||
y_ptr[1] -= ar * temp_i - ai * temp_r;
|
||||
#endif
|
||||
|
||||
a_ptr += lda;
|
||||
y_ptr += inc_y;
|
||||
j++;
|
||||
}
|
||||
return(0);
|
||||
}
|
||||
|
||||
return(0);
|
||||
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -1,171 +0,0 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary froms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary from must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#define HAVE_KERNEL_16x4 1
|
||||
static void cgemv_kernel_16x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y) __attribute__ ((noinline));
|
||||
|
||||
static void cgemv_kernel_16x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vzeroupper \n\t"
|
||||
|
||||
"vxorps %%ymm8 , %%ymm8 , %%ymm8 \n\t" // temp
|
||||
"vxorps %%ymm9 , %%ymm9 , %%ymm9 \n\t" // temp
|
||||
"vxorps %%ymm10, %%ymm10, %%ymm10 \n\t" // temp
|
||||
"vxorps %%ymm11, %%ymm11, %%ymm11 \n\t" // temp
|
||||
"vxorps %%ymm12, %%ymm12, %%ymm12 \n\t" // temp
|
||||
"vxorps %%ymm13, %%ymm13, %%ymm13 \n\t"
|
||||
"vxorps %%ymm14, %%ymm14, %%ymm14 \n\t"
|
||||
"vxorps %%ymm15, %%ymm15, %%ymm15 \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"prefetcht0 192(%4,%0,4) \n\t"
|
||||
"vmovups (%4,%0,4), %%ymm4 \n\t" // 4 complex values from a0
|
||||
"prefetcht0 192(%5,%0,4) \n\t"
|
||||
"vmovups (%5,%0,4), %%ymm5 \n\t" // 4 complex values from a1
|
||||
|
||||
"prefetcht0 192(%2,%0,4) \n\t"
|
||||
"vmovups (%2,%0,4) , %%ymm6 \n\t" // 4 complex values from x
|
||||
"vpermilps $0xb1, %%ymm6, %%ymm7 \n\t" // exchange real and imap parts
|
||||
"vblendps $0x55, %%ymm6, %%ymm7, %%ymm0 \n\t" // only the real parts
|
||||
"vblendps $0x55, %%ymm7, %%ymm6, %%ymm1 \n\t" // only the imag parts
|
||||
|
||||
"prefetcht0 192(%6,%0,4) \n\t"
|
||||
"vmovups (%6,%0,4), %%ymm6 \n\t" // 4 complex values from a2
|
||||
"prefetcht0 192(%7,%0,4) \n\t"
|
||||
"vmovups (%7,%0,4), %%ymm7 \n\t" // 4 complex values from a3
|
||||
|
||||
"vfmadd231ps %%ymm4 , %%ymm0, %%ymm8 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm4 , %%ymm1, %%ymm9 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
"vfmadd231ps %%ymm5 , %%ymm0, %%ymm10 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm5 , %%ymm1, %%ymm11 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
"vfmadd231ps %%ymm6 , %%ymm0, %%ymm12 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm6 , %%ymm1, %%ymm13 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
"vfmadd231ps %%ymm7 , %%ymm0, %%ymm14 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm7 , %%ymm1, %%ymm15 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
|
||||
"vmovups 32(%4,%0,4), %%ymm4 \n\t" // 2 complex values from a0
|
||||
"vmovups 32(%5,%0,4), %%ymm5 \n\t" // 2 complex values from a1
|
||||
|
||||
"vmovups 32(%2,%0,4) , %%ymm6 \n\t" // 4 complex values from x
|
||||
"vpermilps $0xb1, %%ymm6, %%ymm7 \n\t" // exchange real and imap parts
|
||||
"vblendps $0x55, %%ymm6, %%ymm7, %%ymm0 \n\t" // only the real parts
|
||||
"vblendps $0x55, %%ymm7, %%ymm6, %%ymm1 \n\t" // only the imag parts
|
||||
|
||||
"vmovups 32(%6,%0,4), %%ymm6 \n\t" // 2 complex values from a2
|
||||
"vmovups 32(%7,%0,4), %%ymm7 \n\t" // 2 complex values from a3
|
||||
|
||||
"vfmadd231ps %%ymm4 , %%ymm0, %%ymm8 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm4 , %%ymm1, %%ymm9 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
"vfmadd231ps %%ymm5 , %%ymm0, %%ymm10 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm5 , %%ymm1, %%ymm11 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
"vfmadd231ps %%ymm6 , %%ymm0, %%ymm12 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm6 , %%ymm1, %%ymm13 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
"vfmadd231ps %%ymm7 , %%ymm0, %%ymm14 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm7 , %%ymm1, %%ymm15 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
|
||||
"addq $16 , %0 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
"vpermilps $0xb1 , %%ymm9 , %%ymm9 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm11, %%ymm11 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm13, %%ymm13 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm15, %%ymm15 \n\t"
|
||||
"vaddsubps %%ymm9 , %%ymm8, %%ymm8 \n\t"
|
||||
"vaddsubps %%ymm11, %%ymm10, %%ymm10 \n\t"
|
||||
"vaddsubps %%ymm13, %%ymm12, %%ymm12 \n\t"
|
||||
"vaddsubps %%ymm15, %%ymm14, %%ymm14 \n\t"
|
||||
#else
|
||||
"vpermilps $0xb1 , %%ymm8 , %%ymm8 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm10, %%ymm10 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm12, %%ymm12 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm14, %%ymm14 \n\t"
|
||||
"vaddsubps %%ymm8 , %%ymm9 , %%ymm8 \n\t"
|
||||
"vaddsubps %%ymm10, %%ymm11, %%ymm10 \n\t"
|
||||
"vaddsubps %%ymm12, %%ymm13, %%ymm12 \n\t"
|
||||
"vaddsubps %%ymm14, %%ymm15, %%ymm14 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm8 , %%ymm8 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm10, %%ymm10 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm12, %%ymm12 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm14, %%ymm14 \n\t"
|
||||
#endif
|
||||
|
||||
"vextractf128 $1, %%ymm8 , %%xmm9 \n\t"
|
||||
"vextractf128 $1, %%ymm10, %%xmm11 \n\t"
|
||||
"vextractf128 $1, %%ymm12, %%xmm13 \n\t"
|
||||
"vextractf128 $1, %%ymm14, %%xmm15 \n\t"
|
||||
|
||||
"vaddps %%xmm8 , %%xmm9 , %%xmm8 \n\t"
|
||||
"vaddps %%xmm10, %%xmm11, %%xmm10 \n\t"
|
||||
"vaddps %%xmm12, %%xmm13, %%xmm12 \n\t"
|
||||
"vaddps %%xmm14, %%xmm15, %%xmm14 \n\t"
|
||||
|
||||
"vshufpd $0x1, %%xmm8 , %%xmm8 , %%xmm9 \n\t"
|
||||
"vshufpd $0x1, %%xmm10, %%xmm10, %%xmm11 \n\t"
|
||||
"vshufpd $0x1, %%xmm12, %%xmm12, %%xmm13 \n\t"
|
||||
"vshufpd $0x1, %%xmm14, %%xmm14, %%xmm15 \n\t"
|
||||
|
||||
"vaddps %%xmm8 , %%xmm9 , %%xmm8 \n\t"
|
||||
"vaddps %%xmm10, %%xmm11, %%xmm10 \n\t"
|
||||
"vaddps %%xmm12, %%xmm13, %%xmm12 \n\t"
|
||||
"vaddps %%xmm14, %%xmm15, %%xmm14 \n\t"
|
||||
|
||||
"vmovsd %%xmm8 , (%3) \n\t"
|
||||
"vmovsd %%xmm10, 8(%3) \n\t"
|
||||
"vmovsd %%xmm12, 16(%3) \n\t"
|
||||
"vmovsd %%xmm14, 24(%3) \n\t"
|
||||
|
||||
"vzeroupper \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (ap[2]), // 6
|
||||
"r" (ap[3]) // 7
|
||||
: "cc",
|
||||
"%xmm0", "%xmm1", "%xmm2", "%xmm3",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,539 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary froms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary from must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#define HAVE_KERNEL_4x4 1
|
||||
static void cgemv_kernel_4x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void cgemv_kernel_4x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vzeroupper \n\t"
|
||||
|
||||
"vxorps %%ymm8 , %%ymm8 , %%ymm8 \n\t" // temp
|
||||
"vxorps %%ymm9 , %%ymm9 , %%ymm9 \n\t" // temp
|
||||
"vxorps %%ymm10, %%ymm10, %%ymm10 \n\t" // temp
|
||||
"vxorps %%ymm11, %%ymm11, %%ymm11 \n\t" // temp
|
||||
"vxorps %%ymm12, %%ymm12, %%ymm12 \n\t" // temp
|
||||
"vxorps %%ymm13, %%ymm13, %%ymm13 \n\t"
|
||||
"vxorps %%ymm14, %%ymm14, %%ymm14 \n\t"
|
||||
"vxorps %%ymm15, %%ymm15, %%ymm15 \n\t"
|
||||
|
||||
"testq $0x04, %1 \n\t"
|
||||
"jz .L08LABEL%= \n\t"
|
||||
|
||||
"vmovups (%4,%0,4), %%ymm4 \n\t" // 4 complex values from a0
|
||||
"vmovups (%5,%0,4), %%ymm5 \n\t" // 4 complex values from a1
|
||||
|
||||
"vmovups (%2,%0,4) , %%ymm6 \n\t" // 4 complex values from x
|
||||
"vpermilps $0xb1, %%ymm6, %%ymm7 \n\t" // exchange real and imap parts
|
||||
"vblendps $0x55, %%ymm6, %%ymm7, %%ymm0 \n\t" // only the real parts
|
||||
"vblendps $0x55, %%ymm7, %%ymm6, %%ymm1 \n\t" // only the imag parts
|
||||
|
||||
"vmovups (%6,%0,4), %%ymm6 \n\t" // 4 complex values from a2
|
||||
"vmovups (%7,%0,4), %%ymm7 \n\t" // 4 complex values from a3
|
||||
|
||||
"vfmadd231ps %%ymm4 , %%ymm0, %%ymm8 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm4 , %%ymm1, %%ymm9 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
"vfmadd231ps %%ymm5 , %%ymm0, %%ymm10 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm5 , %%ymm1, %%ymm11 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
"vfmadd231ps %%ymm6 , %%ymm0, %%ymm12 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm6 , %%ymm1, %%ymm13 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
"vfmadd231ps %%ymm7 , %%ymm0, %%ymm14 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm7 , %%ymm1, %%ymm15 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
|
||||
"addq $8 , %0 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
|
||||
".L08LABEL%=: \n\t"
|
||||
"cmpq $0, %1 \n\t"
|
||||
"je .L08END%= \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"prefetcht0 192(%4,%0,4) \n\t"
|
||||
"vmovups (%4,%0,4), %%ymm4 \n\t" // 4 complex values from a0
|
||||
"prefetcht0 192(%5,%0,4) \n\t"
|
||||
"vmovups (%5,%0,4), %%ymm5 \n\t" // 4 complex values from a1
|
||||
|
||||
"prefetcht0 192(%2,%0,4) \n\t"
|
||||
"vmovups (%2,%0,4) , %%ymm6 \n\t" // 4 complex values from x
|
||||
"vpermilps $0xb1, %%ymm6, %%ymm7 \n\t" // exchange real and imap parts
|
||||
"vblendps $0x55, %%ymm6, %%ymm7, %%ymm0 \n\t" // only the real parts
|
||||
"vblendps $0x55, %%ymm7, %%ymm6, %%ymm1 \n\t" // only the imag parts
|
||||
|
||||
"prefetcht0 192(%6,%0,4) \n\t"
|
||||
"vmovups (%6,%0,4), %%ymm6 \n\t" // 4 complex values from a2
|
||||
"prefetcht0 192(%7,%0,4) \n\t"
|
||||
"vmovups (%7,%0,4), %%ymm7 \n\t" // 4 complex values from a3
|
||||
|
||||
"vfmadd231ps %%ymm4 , %%ymm0, %%ymm8 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm4 , %%ymm1, %%ymm9 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
"vfmadd231ps %%ymm5 , %%ymm0, %%ymm10 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm5 , %%ymm1, %%ymm11 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
"vfmadd231ps %%ymm6 , %%ymm0, %%ymm12 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm6 , %%ymm1, %%ymm13 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
"vfmadd231ps %%ymm7 , %%ymm0, %%ymm14 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm7 , %%ymm1, %%ymm15 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
|
||||
"vmovups 32(%4,%0,4), %%ymm4 \n\t" // 4 complex values from a0
|
||||
"vmovups 32(%5,%0,4), %%ymm5 \n\t" // 4 complex values from a1
|
||||
|
||||
"vmovups 32(%2,%0,4) , %%ymm6 \n\t" // 4 complex values from x
|
||||
"vpermilps $0xb1, %%ymm6, %%ymm7 \n\t" // exchange real and imap parts
|
||||
"vblendps $0x55, %%ymm6, %%ymm7, %%ymm0 \n\t" // only the real parts
|
||||
"vblendps $0x55, %%ymm7, %%ymm6, %%ymm1 \n\t" // only the imag parts
|
||||
|
||||
"vmovups 32(%6,%0,4), %%ymm6 \n\t" // 4 complex values from a2
|
||||
"vmovups 32(%7,%0,4), %%ymm7 \n\t" // 4 complex values from a3
|
||||
|
||||
"vfmadd231ps %%ymm4 , %%ymm0, %%ymm8 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm4 , %%ymm1, %%ymm9 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
"vfmadd231ps %%ymm5 , %%ymm0, %%ymm10 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm5 , %%ymm1, %%ymm11 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
"vfmadd231ps %%ymm6 , %%ymm0, %%ymm12 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm6 , %%ymm1, %%ymm13 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
"vfmadd231ps %%ymm7 , %%ymm0, %%ymm14 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm7 , %%ymm1, %%ymm15 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
|
||||
"addq $16 , %0 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
".L08END%=: \n\t"
|
||||
|
||||
"vbroadcastss (%8) , %%xmm0 \n\t" // value from alpha
|
||||
"vbroadcastss 4(%8) , %%xmm1 \n\t" // value from alpha
|
||||
|
||||
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
"vpermilps $0xb1 , %%ymm9 , %%ymm9 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm11, %%ymm11 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm13, %%ymm13 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm15, %%ymm15 \n\t"
|
||||
"vaddsubps %%ymm9 , %%ymm8, %%ymm8 \n\t"
|
||||
"vaddsubps %%ymm11, %%ymm10, %%ymm10 \n\t"
|
||||
"vaddsubps %%ymm13, %%ymm12, %%ymm12 \n\t"
|
||||
"vaddsubps %%ymm15, %%ymm14, %%ymm14 \n\t"
|
||||
#else
|
||||
"vpermilps $0xb1 , %%ymm8 , %%ymm8 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm10, %%ymm10 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm12, %%ymm12 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm14, %%ymm14 \n\t"
|
||||
"vaddsubps %%ymm8 , %%ymm9 , %%ymm8 \n\t"
|
||||
"vaddsubps %%ymm10, %%ymm11, %%ymm10 \n\t"
|
||||
"vaddsubps %%ymm12, %%ymm13, %%ymm12 \n\t"
|
||||
"vaddsubps %%ymm14, %%ymm15, %%ymm14 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm8 , %%ymm8 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm10, %%ymm10 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm12, %%ymm12 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm14, %%ymm14 \n\t"
|
||||
#endif
|
||||
|
||||
"vmovsd (%3), %%xmm4 \n\t" // read y
|
||||
"vmovsd 8(%3), %%xmm5 \n\t"
|
||||
"vmovsd 16(%3), %%xmm6 \n\t"
|
||||
"vmovsd 24(%3), %%xmm7 \n\t"
|
||||
|
||||
"vextractf128 $1, %%ymm8 , %%xmm9 \n\t"
|
||||
"vextractf128 $1, %%ymm10, %%xmm11 \n\t"
|
||||
"vextractf128 $1, %%ymm12, %%xmm13 \n\t"
|
||||
"vextractf128 $1, %%ymm14, %%xmm15 \n\t"
|
||||
|
||||
"vaddps %%xmm8 , %%xmm9 , %%xmm8 \n\t"
|
||||
"vaddps %%xmm10, %%xmm11, %%xmm10 \n\t"
|
||||
"vaddps %%xmm12, %%xmm13, %%xmm12 \n\t"
|
||||
"vaddps %%xmm14, %%xmm15, %%xmm14 \n\t"
|
||||
|
||||
"vshufpd $0x1, %%xmm8 , %%xmm8 , %%xmm9 \n\t"
|
||||
"vshufpd $0x1, %%xmm10, %%xmm10, %%xmm11 \n\t"
|
||||
"vshufpd $0x1, %%xmm12, %%xmm12, %%xmm13 \n\t"
|
||||
"vshufpd $0x1, %%xmm14, %%xmm14, %%xmm15 \n\t"
|
||||
|
||||
"vaddps %%xmm8 , %%xmm9 , %%xmm8 \n\t"
|
||||
"vaddps %%xmm10, %%xmm11, %%xmm10 \n\t"
|
||||
"vaddps %%xmm12, %%xmm13, %%xmm12 \n\t"
|
||||
"vaddps %%xmm14, %%xmm15, %%xmm14 \n\t"
|
||||
|
||||
|
||||
"vmulps %%xmm8 , %%xmm1 , %%xmm9 \n\t" // t_r * alpha_i , t_i * alpha_i
|
||||
"vmulps %%xmm8 , %%xmm0 , %%xmm8 \n\t" // t_r * alpha_r , t_i * alpha_r
|
||||
"vmulps %%xmm10, %%xmm1 , %%xmm11 \n\t" // t_r * alpha_i , t_i * alpha_i
|
||||
"vmulps %%xmm10, %%xmm0 , %%xmm10 \n\t" // t_r * alpha_r , t_i * alpha_r
|
||||
"vmulps %%xmm12, %%xmm1 , %%xmm13 \n\t" // t_r * alpha_i , t_i * alpha_i
|
||||
"vmulps %%xmm12, %%xmm0 , %%xmm12 \n\t" // t_r * alpha_r , t_i * alpha_r
|
||||
"vmulps %%xmm14, %%xmm1 , %%xmm15 \n\t" // t_r * alpha_i , t_i * alpha_i
|
||||
"vmulps %%xmm14, %%xmm0 , %%xmm14 \n\t" // t_r * alpha_r , t_i * alpha_r
|
||||
|
||||
#if !defined(XCONJ)
|
||||
"vpermilps $0xb1 , %%xmm9 , %%xmm9 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm11, %%xmm11 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm13, %%xmm13 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm15, %%xmm15 \n\t"
|
||||
"vaddsubps %%xmm9 , %%xmm8, %%xmm8 \n\t"
|
||||
"vaddsubps %%xmm11, %%xmm10, %%xmm10 \n\t"
|
||||
"vaddsubps %%xmm13, %%xmm12, %%xmm12 \n\t"
|
||||
"vaddsubps %%xmm15, %%xmm14, %%xmm14 \n\t"
|
||||
#else
|
||||
"vpermilps $0xb1 , %%xmm8 , %%xmm8 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm10, %%xmm10 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm12, %%xmm12 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm14, %%xmm14 \n\t"
|
||||
"vaddsubps %%xmm8 , %%xmm9 , %%xmm8 \n\t"
|
||||
"vaddsubps %%xmm10, %%xmm11, %%xmm10 \n\t"
|
||||
"vaddsubps %%xmm12, %%xmm13, %%xmm12 \n\t"
|
||||
"vaddsubps %%xmm14, %%xmm15, %%xmm14 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm8 , %%xmm8 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm10, %%xmm10 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm12, %%xmm12 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm14, %%xmm14 \n\t"
|
||||
#endif
|
||||
|
||||
|
||||
"vaddps %%xmm8 , %%xmm4 , %%xmm8 \n\t"
|
||||
"vaddps %%xmm10, %%xmm5 , %%xmm10 \n\t"
|
||||
"vaddps %%xmm12, %%xmm6 , %%xmm12 \n\t"
|
||||
"vaddps %%xmm14, %%xmm7 , %%xmm14 \n\t"
|
||||
|
||||
"vmovsd %%xmm8 , (%3) \n\t"
|
||||
"vmovsd %%xmm10, 8(%3) \n\t"
|
||||
"vmovsd %%xmm12, 16(%3) \n\t"
|
||||
"vmovsd %%xmm14, 24(%3) \n\t"
|
||||
|
||||
"vzeroupper \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (ap[2]), // 6
|
||||
"r" (ap[3]), // 7
|
||||
"r" (alpha) // 8
|
||||
: "cc",
|
||||
"%xmm0", "%xmm1", "%xmm2", "%xmm3",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
#define HAVE_KERNEL_4x2 1
|
||||
static void cgemv_kernel_4x2( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void cgemv_kernel_4x2( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vzeroupper \n\t"
|
||||
|
||||
"vxorps %%ymm8 , %%ymm8 , %%ymm8 \n\t" // temp
|
||||
"vxorps %%ymm9 , %%ymm9 , %%ymm9 \n\t" // temp
|
||||
"vxorps %%ymm10, %%ymm10, %%ymm10 \n\t" // temp
|
||||
"vxorps %%ymm11, %%ymm11, %%ymm11 \n\t" // temp
|
||||
|
||||
"testq $0x04, %1 \n\t"
|
||||
"jz .L08LABEL%= \n\t"
|
||||
|
||||
"vmovups (%4,%0,4), %%ymm4 \n\t" // 4 complex values from a0
|
||||
"vmovups (%5,%0,4), %%ymm5 \n\t" // 4 complex values from a1
|
||||
|
||||
"vmovups (%2,%0,4) , %%ymm6 \n\t" // 4 complex values from x
|
||||
"vpermilps $0xb1, %%ymm6, %%ymm7 \n\t" // exchange real and imap parts
|
||||
"vblendps $0x55, %%ymm6, %%ymm7, %%ymm0 \n\t" // only the real parts
|
||||
"vblendps $0x55, %%ymm7, %%ymm6, %%ymm1 \n\t" // only the imag parts
|
||||
|
||||
|
||||
"vfmadd231ps %%ymm4 , %%ymm0, %%ymm8 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm4 , %%ymm1, %%ymm9 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
"vfmadd231ps %%ymm5 , %%ymm0, %%ymm10 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm5 , %%ymm1, %%ymm11 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
|
||||
"addq $8 , %0 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
|
||||
".L08LABEL%=: \n\t"
|
||||
"cmpq $0, %1 \n\t"
|
||||
"je .L08END%= \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"prefetcht0 192(%4,%0,4) \n\t"
|
||||
"vmovups (%4,%0,4), %%ymm4 \n\t" // 4 complex values from a0
|
||||
"prefetcht0 192(%5,%0,4) \n\t"
|
||||
"vmovups (%5,%0,4), %%ymm5 \n\t" // 4 complex values from a1
|
||||
|
||||
"prefetcht0 192(%2,%0,4) \n\t"
|
||||
"vmovups (%2,%0,4) , %%ymm6 \n\t" // 4 complex values from x
|
||||
"vpermilps $0xb1, %%ymm6, %%ymm7 \n\t" // exchange real and imap parts
|
||||
"vblendps $0x55, %%ymm6, %%ymm7, %%ymm0 \n\t" // only the real parts
|
||||
"vblendps $0x55, %%ymm7, %%ymm6, %%ymm1 \n\t" // only the imag parts
|
||||
|
||||
"vfmadd231ps %%ymm4 , %%ymm0, %%ymm8 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm4 , %%ymm1, %%ymm9 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
"vfmadd231ps %%ymm5 , %%ymm0, %%ymm10 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm5 , %%ymm1, %%ymm11 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
|
||||
"vmovups 32(%4,%0,4), %%ymm4 \n\t" // 4 complex values from a0
|
||||
"vmovups 32(%5,%0,4), %%ymm5 \n\t" // 4 complex values from a1
|
||||
|
||||
"vmovups 32(%2,%0,4) , %%ymm6 \n\t" // 4 complex values from x
|
||||
"vpermilps $0xb1, %%ymm6, %%ymm7 \n\t" // exchange real and imap parts
|
||||
"vblendps $0x55, %%ymm6, %%ymm7, %%ymm0 \n\t" // only the real parts
|
||||
"vblendps $0x55, %%ymm7, %%ymm6, %%ymm1 \n\t" // only the imag parts
|
||||
|
||||
"vfmadd231ps %%ymm4 , %%ymm0, %%ymm8 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm4 , %%ymm1, %%ymm9 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
"vfmadd231ps %%ymm5 , %%ymm0, %%ymm10 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm5 , %%ymm1, %%ymm11 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
|
||||
"addq $16 , %0 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
".L08END%=: \n\t"
|
||||
|
||||
"vbroadcastss (%6) , %%xmm0 \n\t" // value from alpha
|
||||
"vbroadcastss 4(%6) , %%xmm1 \n\t" // value from alpha
|
||||
|
||||
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
"vpermilps $0xb1 , %%ymm9 , %%ymm9 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm11, %%ymm11 \n\t"
|
||||
"vaddsubps %%ymm9 , %%ymm8, %%ymm8 \n\t"
|
||||
"vaddsubps %%ymm11, %%ymm10, %%ymm10 \n\t"
|
||||
#else
|
||||
"vpermilps $0xb1 , %%ymm8 , %%ymm8 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm10, %%ymm10 \n\t"
|
||||
"vaddsubps %%ymm8 , %%ymm9 , %%ymm8 \n\t"
|
||||
"vaddsubps %%ymm10, %%ymm11, %%ymm10 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm8 , %%ymm8 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm10, %%ymm10 \n\t"
|
||||
#endif
|
||||
|
||||
"vmovsd (%3), %%xmm4 \n\t" // read y
|
||||
"vmovsd 8(%3), %%xmm5 \n\t"
|
||||
|
||||
"vextractf128 $1, %%ymm8 , %%xmm9 \n\t"
|
||||
"vextractf128 $1, %%ymm10, %%xmm11 \n\t"
|
||||
|
||||
"vaddps %%xmm8 , %%xmm9 , %%xmm8 \n\t"
|
||||
"vaddps %%xmm10, %%xmm11, %%xmm10 \n\t"
|
||||
|
||||
"vshufpd $0x1, %%xmm8 , %%xmm8 , %%xmm9 \n\t"
|
||||
"vshufpd $0x1, %%xmm10, %%xmm10, %%xmm11 \n\t"
|
||||
|
||||
"vaddps %%xmm8 , %%xmm9 , %%xmm8 \n\t"
|
||||
"vaddps %%xmm10, %%xmm11, %%xmm10 \n\t"
|
||||
|
||||
"vmulps %%xmm8 , %%xmm1 , %%xmm9 \n\t" // t_r * alpha_i , t_i * alpha_i
|
||||
"vmulps %%xmm8 , %%xmm0 , %%xmm8 \n\t" // t_r * alpha_r , t_i * alpha_r
|
||||
"vmulps %%xmm10, %%xmm1 , %%xmm11 \n\t" // t_r * alpha_i , t_i * alpha_i
|
||||
"vmulps %%xmm10, %%xmm0 , %%xmm10 \n\t" // t_r * alpha_r , t_i * alpha_r
|
||||
|
||||
#if !defined(XCONJ)
|
||||
"vpermilps $0xb1 , %%xmm9 , %%xmm9 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm11, %%xmm11 \n\t"
|
||||
"vaddsubps %%xmm9 , %%xmm8, %%xmm8 \n\t"
|
||||
"vaddsubps %%xmm11, %%xmm10, %%xmm10 \n\t"
|
||||
#else
|
||||
"vpermilps $0xb1 , %%xmm8 , %%xmm8 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm10, %%xmm10 \n\t"
|
||||
"vaddsubps %%xmm8 , %%xmm9 , %%xmm8 \n\t"
|
||||
"vaddsubps %%xmm10, %%xmm11, %%xmm10 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm8 , %%xmm8 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm10, %%xmm10 \n\t"
|
||||
#endif
|
||||
|
||||
|
||||
"vaddps %%xmm8 , %%xmm4 , %%xmm8 \n\t"
|
||||
"vaddps %%xmm10, %%xmm5 , %%xmm10 \n\t"
|
||||
|
||||
"vmovsd %%xmm8 , (%3) \n\t"
|
||||
"vmovsd %%xmm10, 8(%3) \n\t"
|
||||
|
||||
"vzeroupper \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (alpha) // 6
|
||||
: "cc",
|
||||
"%xmm0", "%xmm1", "%xmm2", "%xmm3",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
#define HAVE_KERNEL_4x1 1
|
||||
static void cgemv_kernel_4x1( BLASLONG n, FLOAT *ap, FLOAT *x, FLOAT *y, FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void cgemv_kernel_4x1( BLASLONG n, FLOAT *ap, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vzeroupper \n\t"
|
||||
|
||||
"vxorps %%ymm8 , %%ymm8 , %%ymm8 \n\t" // temp
|
||||
"vxorps %%ymm9 , %%ymm9 , %%ymm9 \n\t" // temp
|
||||
|
||||
"testq $0x04, %1 \n\t"
|
||||
"jz .L08LABEL%= \n\t"
|
||||
|
||||
"vmovups (%4,%0,4), %%ymm4 \n\t" // 4 complex values from a0
|
||||
|
||||
"vmovups (%2,%0,4) , %%ymm6 \n\t" // 4 complex values from x
|
||||
"vpermilps $0xb1, %%ymm6, %%ymm7 \n\t" // exchange real and imap parts
|
||||
"vblendps $0x55, %%ymm6, %%ymm7, %%ymm0 \n\t" // only the real parts
|
||||
"vblendps $0x55, %%ymm7, %%ymm6, %%ymm1 \n\t" // only the imag parts
|
||||
|
||||
|
||||
"vfmadd231ps %%ymm4 , %%ymm0, %%ymm8 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm4 , %%ymm1, %%ymm9 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
|
||||
"addq $8 , %0 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
|
||||
".L08LABEL%=: \n\t"
|
||||
"cmpq $0, %1 \n\t"
|
||||
"je .L08END%= \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"prefetcht0 192(%4,%0,4) \n\t"
|
||||
"vmovups (%4,%0,4), %%ymm4 \n\t" // 4 complex values from a0
|
||||
|
||||
"prefetcht0 192(%2,%0,4) \n\t"
|
||||
"vmovups (%2,%0,4) , %%ymm6 \n\t" // 4 complex values from x
|
||||
"vpermilps $0xb1, %%ymm6, %%ymm7 \n\t" // exchange real and imap parts
|
||||
"vblendps $0x55, %%ymm6, %%ymm7, %%ymm0 \n\t" // only the real parts
|
||||
"vblendps $0x55, %%ymm7, %%ymm6, %%ymm1 \n\t" // only the imag parts
|
||||
|
||||
"vfmadd231ps %%ymm4 , %%ymm0, %%ymm8 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm4 , %%ymm1, %%ymm9 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
|
||||
"vmovups 32(%4,%0,4), %%ymm4 \n\t" // 4 complex values from a0
|
||||
|
||||
"vmovups 32(%2,%0,4) , %%ymm6 \n\t" // 4 complex values from x
|
||||
"vpermilps $0xb1, %%ymm6, %%ymm7 \n\t" // exchange real and imap parts
|
||||
"vblendps $0x55, %%ymm6, %%ymm7, %%ymm0 \n\t" // only the real parts
|
||||
"vblendps $0x55, %%ymm7, %%ymm6, %%ymm1 \n\t" // only the imag parts
|
||||
|
||||
"vfmadd231ps %%ymm4 , %%ymm0, %%ymm8 \n\t" // ar0*xr0,al0*xr0,ar1*xr1,al1*xr1
|
||||
"vfmadd231ps %%ymm4 , %%ymm1, %%ymm9 \n\t" // ar0*xl0,al0*xl0,ar1*xl1,al1*xl1
|
||||
|
||||
"addq $16 , %0 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
".L08END%=: \n\t"
|
||||
|
||||
"vbroadcastss (%5) , %%xmm0 \n\t" // value from alpha
|
||||
"vbroadcastss 4(%5) , %%xmm1 \n\t" // value from alpha
|
||||
|
||||
|
||||
#if ( !defined(CONJ) && !defined(XCONJ) ) || ( defined(CONJ) && defined(XCONJ) )
|
||||
"vpermilps $0xb1 , %%ymm9 , %%ymm9 \n\t"
|
||||
"vaddsubps %%ymm9 , %%ymm8, %%ymm8 \n\t"
|
||||
#else
|
||||
"vpermilps $0xb1 , %%ymm8 , %%ymm8 \n\t"
|
||||
"vaddsubps %%ymm8 , %%ymm9 , %%ymm8 \n\t"
|
||||
"vpermilps $0xb1 , %%ymm8 , %%ymm8 \n\t"
|
||||
#endif
|
||||
|
||||
"vmovsd (%3), %%xmm4 \n\t" // read y
|
||||
|
||||
"vextractf128 $1, %%ymm8 , %%xmm9 \n\t"
|
||||
|
||||
"vaddps %%xmm8 , %%xmm9 , %%xmm8 \n\t"
|
||||
|
||||
"vshufpd $0x1, %%xmm8 , %%xmm8 , %%xmm9 \n\t"
|
||||
|
||||
"vaddps %%xmm8 , %%xmm9 , %%xmm8 \n\t"
|
||||
|
||||
"vmulps %%xmm8 , %%xmm1 , %%xmm9 \n\t" // t_r * alpha_i , t_i * alpha_i
|
||||
"vmulps %%xmm8 , %%xmm0 , %%xmm8 \n\t" // t_r * alpha_r , t_i * alpha_r
|
||||
|
||||
#if !defined(XCONJ)
|
||||
"vpermilps $0xb1 , %%xmm9 , %%xmm9 \n\t"
|
||||
"vaddsubps %%xmm9 , %%xmm8, %%xmm8 \n\t"
|
||||
#else
|
||||
"vpermilps $0xb1 , %%xmm8 , %%xmm8 \n\t"
|
||||
"vaddsubps %%xmm8 , %%xmm9 , %%xmm8 \n\t"
|
||||
"vpermilps $0xb1 , %%xmm8 , %%xmm8 \n\t"
|
||||
#endif
|
||||
|
||||
|
||||
"vaddps %%xmm8 , %%xmm4 , %%xmm8 \n\t"
|
||||
|
||||
"vmovsd %%xmm8 , (%3) \n\t"
|
||||
|
||||
"vzeroupper \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap), // 4
|
||||
"r" (alpha) // 5
|
||||
: "cc",
|
||||
"%xmm0", "%xmm1", "%xmm2", "%xmm3",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,105 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
#include "common.h"
|
||||
|
||||
|
||||
#if defined(NEHALEM)
|
||||
#include "daxpy_microk_nehalem-2.c"
|
||||
#elif defined(BULLDOZER)
|
||||
#include "daxpy_microk_bulldozer-2.c"
|
||||
#endif
|
||||
|
||||
|
||||
#ifndef HAVE_KERNEL_8
|
||||
|
||||
static void daxpy_kernel_8(BLASLONG n, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
BLASLONG register i = 0;
|
||||
FLOAT a = *alpha;
|
||||
|
||||
while(i < n)
|
||||
{
|
||||
y[i] += a * x[i];
|
||||
y[i+1] += a * x[i+1];
|
||||
y[i+2] += a * x[i+2];
|
||||
y[i+3] += a * x[i+3];
|
||||
y[i+4] += a * x[i+4];
|
||||
y[i+5] += a * x[i+5];
|
||||
y[i+6] += a * x[i+6];
|
||||
y[i+7] += a * x[i+7];
|
||||
i+=8 ;
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
int CNAME(BLASLONG n, BLASLONG dummy0, BLASLONG dummy1, FLOAT da, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y, FLOAT *dummy, BLASLONG dummy2)
|
||||
{
|
||||
BLASLONG i=0;
|
||||
BLASLONG ix=0,iy=0;
|
||||
|
||||
if ( n <= 0 ) return(0);
|
||||
|
||||
if ( (inc_x == 1) && (inc_y == 1) )
|
||||
{
|
||||
|
||||
int n1 = n & -8;
|
||||
|
||||
if ( n1 )
|
||||
daxpy_kernel_8(n1, x, y , &da );
|
||||
|
||||
i = n1;
|
||||
while(i < n)
|
||||
{
|
||||
|
||||
y[i] += da * x[i] ;
|
||||
i++ ;
|
||||
|
||||
}
|
||||
return(0);
|
||||
|
||||
|
||||
}
|
||||
|
||||
while(i < n)
|
||||
{
|
||||
|
||||
y[iy] += da * x[ix] ;
|
||||
ix += inc_x ;
|
||||
iy += inc_y ;
|
||||
i++ ;
|
||||
|
||||
}
|
||||
return(0);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,82 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#define HAVE_KERNEL_8 1
|
||||
static void daxpy_kernel_8( BLASLONG n, FLOAT *x, FLOAT *y , FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void daxpy_kernel_8( BLASLONG n, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vmovddup (%4), %%xmm0 \n\t" // alpha
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
|
||||
"prefetcht0 768(%3,%0,8) \n\t"
|
||||
"vmovups (%2,%0,8), %%xmm12 \n\t" // 2 * x
|
||||
"vfmaddpd (%3,%0,8), %%xmm0 , %%xmm12, %%xmm8 \n\t" // y += alpha * x
|
||||
"vmovups 16(%2,%0,8), %%xmm13 \n\t" // 2 * x
|
||||
".align 2 \n\t"
|
||||
"vmovups %%xmm8 , (%3,%0,8) \n\t"
|
||||
"vfmaddpd 16(%3,%0,8), %%xmm0 , %%xmm13, %%xmm9 \n\t" // y += alpha * x
|
||||
".align 2 \n\t"
|
||||
"vmovups 32(%2,%0,8), %%xmm14 \n\t" // 2 * x
|
||||
"vmovups %%xmm9 , 16(%3,%0,8) \n\t"
|
||||
"prefetcht0 768(%2,%0,8) \n\t"
|
||||
".align 2 \n\t"
|
||||
"vfmaddpd 32(%3,%0,8), %%xmm0 , %%xmm14, %%xmm10 \n\t" // y += alpha * x
|
||||
"vmovups 48(%2,%0,8), %%xmm15 \n\t" // 2 * x
|
||||
"vmovups %%xmm10, 32(%3,%0,8) \n\t"
|
||||
"vfmaddpd 48(%3,%0,8), %%xmm0 , %%xmm15, %%xmm11 \n\t" // y += alpha * x
|
||||
"vmovups %%xmm11, 48(%3,%0,8) \n\t"
|
||||
|
||||
"addq $8 , %0 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (alpha) // 4
|
||||
: "cc",
|
||||
"%xmm0",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,91 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#define HAVE_KERNEL_8 1
|
||||
static void daxpy_kernel_8( BLASLONG n, FLOAT *x, FLOAT *y , FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void daxpy_kernel_8( BLASLONG n, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movsd (%4), %%xmm0 \n\t" // alpha
|
||||
"shufpd $0, %%xmm0, %%xmm0 \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
// "prefetcht0 192(%2,%0,8) \n\t"
|
||||
// "prefetcht0 192(%3,%0,8) \n\t"
|
||||
|
||||
"movups (%2,%0,8), %%xmm12 \n\t" // 2 * x
|
||||
"movups 16(%2,%0,8), %%xmm13 \n\t" // 2 * x
|
||||
"movups 32(%2,%0,8), %%xmm14 \n\t" // 2 * x
|
||||
"movups 48(%2,%0,8), %%xmm15 \n\t" // 2 * x
|
||||
"movups (%3,%0,8), %%xmm8 \n\t" // 2 * y
|
||||
"movups 16(%3,%0,8), %%xmm9 \n\t" // 2 * y
|
||||
"movups 32(%3,%0,8), %%xmm10 \n\t" // 2 * y
|
||||
"movups 48(%3,%0,8), %%xmm11 \n\t" // 2 * y
|
||||
|
||||
"mulpd %%xmm0 , %%xmm12 \n\t" // alpha * x
|
||||
"mulpd %%xmm0 , %%xmm13 \n\t"
|
||||
"mulpd %%xmm0 , %%xmm14 \n\t"
|
||||
"mulpd %%xmm0 , %%xmm15 \n\t"
|
||||
|
||||
"addpd %%xmm12, %%xmm8 \n\t" // y += alpha *x
|
||||
"addpd %%xmm13, %%xmm9 \n\t"
|
||||
"addpd %%xmm14, %%xmm10 \n\t"
|
||||
"addpd %%xmm15, %%xmm11 \n\t"
|
||||
|
||||
"movups %%xmm8 , (%3,%0,8) \n\t"
|
||||
"movups %%xmm9 , 16(%3,%0,8) \n\t"
|
||||
"movups %%xmm10, 32(%3,%0,8) \n\t"
|
||||
"movups %%xmm11, 48(%3,%0,8) \n\t"
|
||||
|
||||
"addq $8 , %0 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (alpha) // 4
|
||||
: "cc",
|
||||
"%xmm0",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,110 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
#include "common.h"
|
||||
|
||||
|
||||
#if defined(BULLDOZER) || defined(PILEDRIVER)
|
||||
#include "ddot_microk_bulldozer-2.c"
|
||||
#elif defined(NEHALEM)
|
||||
#include "ddot_microk_nehalem-2.c"
|
||||
#endif
|
||||
|
||||
|
||||
#ifndef HAVE_KERNEL_8
|
||||
|
||||
static void ddot_kernel_8(BLASLONG n, FLOAT *x, FLOAT *y, FLOAT *d)
|
||||
{
|
||||
BLASLONG register i = 0;
|
||||
FLOAT dot = 0.0;
|
||||
|
||||
while(i < n)
|
||||
{
|
||||
dot += y[i] * x[i]
|
||||
+ y[i+1] * x[i+1]
|
||||
+ y[i+2] * x[i+2]
|
||||
+ y[i+3] * x[i+3]
|
||||
+ y[i+4] * x[i+4]
|
||||
+ y[i+5] * x[i+5]
|
||||
+ y[i+6] * x[i+6]
|
||||
+ y[i+7] * x[i+7] ;
|
||||
|
||||
i+=8 ;
|
||||
|
||||
}
|
||||
*d += dot;
|
||||
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
FLOAT CNAME(BLASLONG n, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y)
|
||||
{
|
||||
BLASLONG i=0;
|
||||
BLASLONG ix=0,iy=0;
|
||||
|
||||
FLOAT dot = 0.0 ;
|
||||
|
||||
if ( n <= 0 ) return(dot);
|
||||
|
||||
if ( (inc_x == 1) && (inc_y == 1) )
|
||||
{
|
||||
|
||||
int n1 = n & -8;
|
||||
|
||||
if ( n1 )
|
||||
ddot_kernel_8(n1, x, y , &dot );
|
||||
|
||||
|
||||
i = n1;
|
||||
while(i < n)
|
||||
{
|
||||
|
||||
dot += y[i] * x[i] ;
|
||||
i++ ;
|
||||
|
||||
}
|
||||
return(dot);
|
||||
|
||||
|
||||
}
|
||||
|
||||
while(i < n)
|
||||
{
|
||||
|
||||
dot += y[iy] * x[ix] ;
|
||||
ix += inc_x ;
|
||||
iy += inc_y ;
|
||||
i++ ;
|
||||
|
||||
}
|
||||
return(dot);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -25,47 +25,45 @@ OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
|||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#define HAVE_KERNEL_16x4 1
|
||||
static void sgemv_kernel_16x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y) __attribute__ ((noinline));
|
||||
#define HAVE_KERNEL_8 1
|
||||
static void ddot_kernel_8( BLASLONG n, FLOAT *x, FLOAT *y , FLOAT *dot) __attribute__ ((noinline));
|
||||
|
||||
static void sgemv_kernel_16x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
||||
static void ddot_kernel_8( BLASLONG n, FLOAT *x, FLOAT *y, FLOAT *dot)
|
||||
{
|
||||
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vzeroupper \n\t"
|
||||
"vbroadcastss (%2), %%ymm12 \n\t" // x0
|
||||
"vbroadcastss 4(%2), %%ymm13 \n\t" // x1
|
||||
"vbroadcastss 8(%2), %%ymm14 \n\t" // x2
|
||||
"vbroadcastss 12(%2), %%ymm15 \n\t" // x3
|
||||
"vxorpd %%xmm4, %%xmm4, %%xmm4 \n\t"
|
||||
"vxorpd %%xmm5, %%xmm5, %%xmm5 \n\t"
|
||||
"vxorpd %%xmm6, %%xmm6, %%xmm6 \n\t"
|
||||
"vxorpd %%xmm7, %%xmm7, %%xmm7 \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vmovups (%3,%0,4), %%ymm4 \n\t" // 8 * y
|
||||
"vmovups 32(%3,%0,4), %%ymm5 \n\t" // 8 * y
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vmovups (%2,%0,8), %%xmm12 \n\t" // 2 * x
|
||||
"vmovups 16(%2,%0,8), %%xmm13 \n\t" // 2 * x
|
||||
"vmovups 32(%2,%0,8), %%xmm14 \n\t" // 2 * x
|
||||
"vmovups 48(%2,%0,8), %%xmm15 \n\t" // 2 * x
|
||||
|
||||
"prefetcht0 192(%4,%0,4) \n\t"
|
||||
"vfmadd231ps (%4,%0,4), %%ymm12, %%ymm4 \n\t"
|
||||
"vfmadd231ps 32(%4,%0,4), %%ymm12, %%ymm5 \n\t"
|
||||
"prefetcht0 192(%5,%0,4) \n\t"
|
||||
"vfmadd231ps (%5,%0,4), %%ymm13, %%ymm4 \n\t"
|
||||
"vfmadd231ps 32(%5,%0,4), %%ymm13, %%ymm5 \n\t"
|
||||
"prefetcht0 192(%6,%0,4) \n\t"
|
||||
"vfmadd231ps (%6,%0,4), %%ymm14, %%ymm4 \n\t"
|
||||
"vfmadd231ps 32(%6,%0,4), %%ymm14, %%ymm5 \n\t"
|
||||
"prefetcht0 192(%7,%0,4) \n\t"
|
||||
"vfmadd231ps (%7,%0,4), %%ymm15, %%ymm4 \n\t"
|
||||
"vfmadd231ps 32(%7,%0,4), %%ymm15, %%ymm5 \n\t"
|
||||
"vfmaddpd %%xmm4, (%3,%0,8), %%xmm12, %%xmm4 \n\t" // 2 * y
|
||||
"vfmaddpd %%xmm5, 16(%3,%0,8), %%xmm13, %%xmm5 \n\t" // 2 * y
|
||||
"vfmaddpd %%xmm6, 32(%3,%0,8), %%xmm14, %%xmm6 \n\t" // 2 * y
|
||||
"vfmaddpd %%xmm7, 48(%3,%0,8), %%xmm15, %%xmm7 \n\t" // 2 * y
|
||||
|
||||
"vmovups %%ymm4, (%3,%0,4) \n\t" // 8 * y
|
||||
"vmovups %%ymm5, 32(%3,%0,4) \n\t" // 8 * y
|
||||
"addq $8 , %0 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"addq $16, %0 \n\t"
|
||||
"subq $16, %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
"vzeroupper \n\t"
|
||||
"vaddpd %%xmm4, %%xmm5, %%xmm4 \n\t"
|
||||
"vaddpd %%xmm6, %%xmm7, %%xmm6 \n\t"
|
||||
"vaddpd %%xmm4, %%xmm6, %%xmm4 \n\t"
|
||||
|
||||
"vhaddpd %%xmm4, %%xmm4, %%xmm4 \n\t"
|
||||
|
||||
"vmovsd %%xmm4, (%4) \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
|
|
@ -73,12 +71,10 @@ static void sgemv_kernel_16x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
|||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (ap[2]), // 6
|
||||
"r" (ap[3]) // 7
|
||||
"r" (dot) // 4
|
||||
: "cc",
|
||||
"%xmm4", "%xmm5",
|
||||
"%xmm6", "%xmm7",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
|
@ -0,0 +1,94 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#define HAVE_KERNEL_8 1
|
||||
static void ddot_kernel_8( BLASLONG n, FLOAT *x, FLOAT *y , FLOAT *dot) __attribute__ ((noinline));
|
||||
|
||||
static void ddot_kernel_8( BLASLONG n, FLOAT *x, FLOAT *y, FLOAT *dot)
|
||||
{
|
||||
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"xorpd %%xmm4, %%xmm4 \n\t"
|
||||
"xorpd %%xmm5, %%xmm5 \n\t"
|
||||
"xorpd %%xmm6, %%xmm6 \n\t"
|
||||
"xorpd %%xmm7, %%xmm7 \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
|
||||
"movups (%2,%0,8), %%xmm12 \n\t" // 2 * x
|
||||
"movups (%3,%0,8), %%xmm8 \n\t" // 2 * y
|
||||
"movups 16(%2,%0,8), %%xmm13 \n\t" // 2 * x
|
||||
"movups 16(%3,%0,8), %%xmm9 \n\t" // 2 * y
|
||||
"movups 32(%2,%0,8), %%xmm14 \n\t" // 2 * x
|
||||
"movups 32(%3,%0,8), %%xmm10 \n\t" // 2 * y
|
||||
"movups 48(%2,%0,8), %%xmm15 \n\t" // 2 * x
|
||||
"movups 48(%3,%0,8), %%xmm11 \n\t" // 2 * y
|
||||
|
||||
"mulpd %%xmm8 , %%xmm12 \n\t"
|
||||
"mulpd %%xmm9 , %%xmm13 \n\t"
|
||||
"mulpd %%xmm10, %%xmm14 \n\t"
|
||||
"mulpd %%xmm11, %%xmm15 \n\t"
|
||||
|
||||
"addpd %%xmm12, %%xmm4 \n\t"
|
||||
"addpd %%xmm13, %%xmm5 \n\t"
|
||||
"addpd %%xmm14, %%xmm6 \n\t"
|
||||
"addpd %%xmm15, %%xmm7 \n\t"
|
||||
|
||||
"addq $8 , %0 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"addpd %%xmm5, %%xmm4 \n\t"
|
||||
"addpd %%xmm7, %%xmm6 \n\t"
|
||||
"addpd %%xmm6, %%xmm4 \n\t"
|
||||
|
||||
"haddpd %%xmm4, %%xmm4 \n\t"
|
||||
|
||||
"movsd %%xmm4, (%4) \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (dot) // 4
|
||||
: "cc",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -1,206 +0,0 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
#include "common.h"
|
||||
|
||||
|
||||
#if defined(HASWELL)
|
||||
#include "dgemv_n_microk_haswell-2.c"
|
||||
#endif
|
||||
|
||||
|
||||
#define NBMAX 2048
|
||||
|
||||
#ifndef HAVE_KERNEL_16x4
|
||||
|
||||
static void dgemv_kernel_16x4(BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
BLASLONG i;
|
||||
FLOAT *a0,*a1,*a2,*a3;
|
||||
a0 = ap[0];
|
||||
a1 = ap[1];
|
||||
a2 = ap[2];
|
||||
a3 = ap[3];
|
||||
|
||||
for ( i=0; i< n; i+=4 )
|
||||
{
|
||||
y[i] += a0[i]*x[0] + a1[i]*x[1] + a2[i]*x[2] + a3[i]*x[3];
|
||||
y[i+1] += a0[i+1]*x[0] + a1[i+1]*x[1] + a2[i+1]*x[2] + a3[i+1]*x[3];
|
||||
y[i+2] += a0[i+2]*x[0] + a1[i+2]*x[1] + a2[i+2]*x[2] + a3[i+2]*x[3];
|
||||
y[i+3] += a0[i+3]*x[0] + a1[i+3]*x[1] + a2[i+3]*x[2] + a3[i+3]*x[3];
|
||||
}
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
static void dgemv_kernel_16x1(BLASLONG n, FLOAT *ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
BLASLONG i;
|
||||
FLOAT *a0;
|
||||
a0 = ap;
|
||||
|
||||
for ( i=0; i< n; i+=4 )
|
||||
{
|
||||
y[i] += a0[i]*x[0];
|
||||
y[i+1] += a0[i+1]*x[0];
|
||||
y[i+2] += a0[i+2]*x[0];
|
||||
y[i+3] += a0[i+3]*x[0];
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
static void zero_y(BLASLONG n, FLOAT *dest)
|
||||
{
|
||||
BLASLONG i;
|
||||
for ( i=0; i<n; i++ )
|
||||
{
|
||||
*dest = 0.0;
|
||||
dest++;
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
|
||||
static void add_y(BLASLONG n, FLOAT *src, FLOAT *dest, BLASLONG inc_dest)
|
||||
{
|
||||
BLASLONG i;
|
||||
if ( inc_dest == 1 )
|
||||
{
|
||||
for ( i=0; i<n; i+=4 )
|
||||
{
|
||||
dest[i] += src[i];
|
||||
dest[i+1] += src[i+1];
|
||||
dest[i+2] += src[i+2];
|
||||
dest[i+3] += src[i+3];
|
||||
}
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
for ( i=0; i<n; i++ )
|
||||
{
|
||||
*dest += *src;
|
||||
src++;
|
||||
dest += inc_dest;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
int CNAME(BLASLONG m, BLASLONG n, BLASLONG dummy1, FLOAT alpha, FLOAT *a, BLASLONG lda, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y, FLOAT *buffer)
|
||||
{
|
||||
BLASLONG i;
|
||||
BLASLONG j;
|
||||
FLOAT *a_ptr;
|
||||
FLOAT *x_ptr;
|
||||
FLOAT *y_ptr;
|
||||
FLOAT *ap[4];
|
||||
BLASLONG n1;
|
||||
BLASLONG m1;
|
||||
BLASLONG m2;
|
||||
BLASLONG n2;
|
||||
FLOAT xbuffer[4],*ybuffer;
|
||||
|
||||
if ( m < 1 ) return(0);
|
||||
if ( n < 1 ) return(0);
|
||||
|
||||
ybuffer = buffer;
|
||||
|
||||
n1 = n / 4 ;
|
||||
n2 = n % 4 ;
|
||||
|
||||
m1 = m - ( m % 16 );
|
||||
m2 = (m % NBMAX) - (m % 16) ;
|
||||
|
||||
y_ptr = y;
|
||||
|
||||
BLASLONG NB = NBMAX;
|
||||
|
||||
while ( NB == NBMAX )
|
||||
{
|
||||
|
||||
m1 -= NB;
|
||||
if ( m1 < 0)
|
||||
{
|
||||
if ( m2 == 0 ) break;
|
||||
NB = m2;
|
||||
}
|
||||
|
||||
a_ptr = a;
|
||||
x_ptr = x;
|
||||
zero_y(NB,ybuffer);
|
||||
for( i = 0; i < n1 ; i++)
|
||||
{
|
||||
xbuffer[0] = alpha * x_ptr[0];
|
||||
x_ptr += inc_x;
|
||||
xbuffer[1] = alpha * x_ptr[0];
|
||||
x_ptr += inc_x;
|
||||
xbuffer[2] = alpha * x_ptr[0];
|
||||
x_ptr += inc_x;
|
||||
xbuffer[3] = alpha * x_ptr[0];
|
||||
x_ptr += inc_x;
|
||||
ap[0] = a_ptr;
|
||||
ap[1] = a_ptr + lda;
|
||||
ap[2] = ap[1] + lda;
|
||||
ap[3] = ap[2] + lda;
|
||||
dgemv_kernel_16x4(NB,ap,xbuffer,ybuffer);
|
||||
a_ptr += 4 * lda;
|
||||
}
|
||||
|
||||
for( i = 0; i < n2 ; i++)
|
||||
{
|
||||
xbuffer[0] = alpha * x_ptr[0];
|
||||
x_ptr += inc_x;
|
||||
dgemv_kernel_16x1(NB,a_ptr,xbuffer,ybuffer);
|
||||
a_ptr += 1 * lda;
|
||||
|
||||
}
|
||||
add_y(NB,ybuffer,y_ptr,inc_y);
|
||||
a += NB;
|
||||
y_ptr += NB * inc_y;
|
||||
}
|
||||
j=0;
|
||||
while ( j < (m % 16))
|
||||
{
|
||||
a_ptr = a;
|
||||
x_ptr = x;
|
||||
FLOAT temp = 0.0;
|
||||
for( i = 0; i < n; i++ )
|
||||
{
|
||||
temp += a_ptr[0] * x_ptr[0];
|
||||
a_ptr += lda;
|
||||
x_ptr += inc_x;
|
||||
}
|
||||
y_ptr[0] += alpha * temp;
|
||||
y_ptr += inc_y;
|
||||
a++;
|
||||
j++;
|
||||
}
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,548 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
#include "common.h"
|
||||
|
||||
|
||||
#if defined(NEHALEM)
|
||||
#include "dgemv_n_microk_nehalem-4.c"
|
||||
#elif defined(HASWELL)
|
||||
#include "dgemv_n_microk_haswell-4.c"
|
||||
#endif
|
||||
|
||||
|
||||
#define NBMAX 2048
|
||||
|
||||
#ifndef HAVE_KERNEL_4x8
|
||||
|
||||
static void dgemv_kernel_4x8(BLASLONG n, FLOAT **ap, FLOAT *xo, FLOAT *y, BLASLONG lda4, FLOAT *alpha)
|
||||
{
|
||||
BLASLONG i;
|
||||
FLOAT *a0,*a1,*a2,*a3;
|
||||
FLOAT *b0,*b1,*b2,*b3;
|
||||
FLOAT *x4;
|
||||
FLOAT x[8];
|
||||
a0 = ap[0];
|
||||
a1 = ap[1];
|
||||
a2 = ap[2];
|
||||
a3 = ap[3];
|
||||
b0 = a0 + lda4 ;
|
||||
b1 = a1 + lda4 ;
|
||||
b2 = a2 + lda4 ;
|
||||
b3 = a3 + lda4 ;
|
||||
x4 = x + 4;
|
||||
|
||||
for ( i=0; i<8; i++)
|
||||
x[i] = xo[i] * *alpha;
|
||||
|
||||
for ( i=0; i< n; i+=4 )
|
||||
{
|
||||
|
||||
y[i] += a0[i]*x[0] + a1[i]*x[1] + a2[i]*x[2] + a3[i]*x[3];
|
||||
y[i+1] += a0[i+1]*x[0] + a1[i+1]*x[1] + a2[i+1]*x[2] + a3[i+1]*x[3];
|
||||
y[i+2] += a0[i+2]*x[0] + a1[i+2]*x[1] + a2[i+2]*x[2] + a3[i+2]*x[3];
|
||||
y[i+3] += a0[i+3]*x[0] + a1[i+3]*x[1] + a2[i+3]*x[2] + a3[i+3]*x[3];
|
||||
|
||||
y[i] += b0[i]*x4[0] + b1[i]*x4[1] + b2[i]*x4[2] + b3[i]*x4[3];
|
||||
y[i+1] += b0[i+1]*x4[0] + b1[i+1]*x4[1] + b2[i+1]*x4[2] + b3[i+1]*x4[3];
|
||||
y[i+2] += b0[i+2]*x4[0] + b1[i+2]*x4[1] + b2[i+2]*x4[2] + b3[i+2]*x4[3];
|
||||
y[i+3] += b0[i+3]*x4[0] + b1[i+3]*x4[1] + b2[i+3]*x4[2] + b3[i+3]*x4[3];
|
||||
|
||||
}
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
|
||||
#ifndef HAVE_KERNEL_4x4
|
||||
|
||||
static void dgemv_kernel_4x4(BLASLONG n, FLOAT **ap, FLOAT *xo, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
BLASLONG i;
|
||||
FLOAT *a0,*a1,*a2,*a3;
|
||||
FLOAT x[4];
|
||||
a0 = ap[0];
|
||||
a1 = ap[1];
|
||||
a2 = ap[2];
|
||||
a3 = ap[3];
|
||||
|
||||
for ( i=0; i<4; i++)
|
||||
x[i] = xo[i] * *alpha;
|
||||
|
||||
for ( i=0; i< n; i+=4 )
|
||||
{
|
||||
y[i] += a0[i]*x[0] + a1[i]*x[1] + a2[i]*x[2] + a3[i]*x[3];
|
||||
y[i+1] += a0[i+1]*x[0] + a1[i+1]*x[1] + a2[i+1]*x[2] + a3[i+1]*x[3];
|
||||
y[i+2] += a0[i+2]*x[0] + a1[i+2]*x[1] + a2[i+2]*x[2] + a3[i+2]*x[3];
|
||||
y[i+3] += a0[i+3]*x[0] + a1[i+3]*x[1] + a2[i+3]*x[2] + a3[i+3]*x[3];
|
||||
}
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
#ifndef HAVE_KERNEL_4x2
|
||||
|
||||
static void dgemv_kernel_4x2( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void dgemv_kernel_4x2( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movsd (%2) , %%xmm12 \n\t" // x0
|
||||
"movsd (%6) , %%xmm4 \n\t" // alpha
|
||||
"movsd 8(%2) , %%xmm13 \n\t" // x1
|
||||
"mulsd %%xmm4 , %%xmm12 \n\t" // alpha
|
||||
"mulsd %%xmm4 , %%xmm13 \n\t" // alpha
|
||||
"shufpd $0, %%xmm12, %%xmm12 \n\t"
|
||||
"shufpd $0, %%xmm13, %%xmm13 \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"movups (%3,%0,8), %%xmm4 \n\t" // 2 * y
|
||||
"movups 16(%3,%0,8), %%xmm5 \n\t" // 2 * y
|
||||
|
||||
"movups (%4,%0,8), %%xmm8 \n\t"
|
||||
"movups (%5,%0,8), %%xmm9 \n\t"
|
||||
"mulpd %%xmm12, %%xmm8 \n\t"
|
||||
"mulpd %%xmm13, %%xmm9 \n\t"
|
||||
"addpd %%xmm8 , %%xmm4 \n\t"
|
||||
"addpd %%xmm9 , %%xmm4 \n\t"
|
||||
|
||||
"movups 16(%4,%0,8), %%xmm8 \n\t"
|
||||
"movups 16(%5,%0,8), %%xmm9 \n\t"
|
||||
"mulpd %%xmm12, %%xmm8 \n\t"
|
||||
"mulpd %%xmm13, %%xmm9 \n\t"
|
||||
"addpd %%xmm8 , %%xmm5 \n\t"
|
||||
"addpd %%xmm9 , %%xmm5 \n\t"
|
||||
|
||||
"movups %%xmm4 , (%3,%0,8) \n\t" // 2 * y
|
||||
"movups %%xmm5 , 16(%3,%0,8) \n\t" // 2 * y
|
||||
|
||||
"addq $4 , %0 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (alpha) // 6
|
||||
: "cc",
|
||||
"%xmm4", "%xmm5",
|
||||
"%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
#ifndef HAVE_KERNEL_4x2
|
||||
|
||||
static void dgemv_kernel_4x1(BLASLONG n, FLOAT *ap, FLOAT *x, FLOAT *y, FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void dgemv_kernel_4x1(BLASLONG n, FLOAT *ap, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movsd (%2), %%xmm12 \n\t" // x0
|
||||
"mulsd (%5), %%xmm12 \n\t" // alpha
|
||||
"shufpd $0, %%xmm12, %%xmm12 \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"movups (%4,%0,8), %%xmm8 \n\t" // 2 * a
|
||||
"movups 16(%4,%0,8), %%xmm9 \n\t" // 2 * a
|
||||
"movups (%3,%0,8), %%xmm4 \n\t" // 2 * y
|
||||
"movups 16(%3,%0,8), %%xmm5 \n\t" // 2 * y
|
||||
"mulpd %%xmm12, %%xmm8 \n\t"
|
||||
"mulpd %%xmm12, %%xmm9 \n\t"
|
||||
"addpd %%xmm8 , %%xmm4 \n\t"
|
||||
"addpd %%xmm9 , %%xmm5 \n\t"
|
||||
|
||||
"movups %%xmm4 , (%3,%0,8) \n\t" // 2 * y
|
||||
"movups %%xmm5 , 16(%3,%0,8) \n\t" // 2 * y
|
||||
|
||||
"addq $4 , %0 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap), // 4
|
||||
"r" (alpha) // 5
|
||||
: "cc",
|
||||
"%xmm4", "%xmm5",
|
||||
"%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
static void add_y(BLASLONG n, FLOAT *src, FLOAT *dest, BLASLONG inc_dest) __attribute__ ((noinline));
|
||||
|
||||
static void add_y(BLASLONG n, FLOAT *src, FLOAT *dest, BLASLONG inc_dest)
|
||||
{
|
||||
BLASLONG i;
|
||||
if ( inc_dest != 1 )
|
||||
{
|
||||
for ( i=0; i<n; i++ )
|
||||
{
|
||||
*dest += *src;
|
||||
src++;
|
||||
dest += inc_dest;
|
||||
}
|
||||
return;
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
int CNAME(BLASLONG m, BLASLONG n, BLASLONG dummy1, FLOAT alpha, FLOAT *a, BLASLONG lda, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y, FLOAT *buffer)
|
||||
{
|
||||
BLASLONG i;
|
||||
BLASLONG j;
|
||||
FLOAT *a_ptr;
|
||||
FLOAT *x_ptr;
|
||||
FLOAT *y_ptr;
|
||||
FLOAT *ap[4];
|
||||
BLASLONG n1;
|
||||
BLASLONG m1;
|
||||
BLASLONG m2;
|
||||
BLASLONG m3;
|
||||
BLASLONG n2;
|
||||
BLASLONG lda4 = lda << 2;
|
||||
BLASLONG lda8 = lda << 3;
|
||||
FLOAT xbuffer[8],*ybuffer;
|
||||
|
||||
if ( m < 1 ) return(0);
|
||||
if ( n < 1 ) return(0);
|
||||
|
||||
ybuffer = buffer;
|
||||
|
||||
if ( inc_x == 1 )
|
||||
{
|
||||
n1 = n >> 3 ;
|
||||
n2 = n & 7 ;
|
||||
}
|
||||
else
|
||||
{
|
||||
n1 = n >> 2 ;
|
||||
n2 = n & 3 ;
|
||||
|
||||
}
|
||||
|
||||
m3 = m & 3 ;
|
||||
m1 = m & -4 ;
|
||||
m2 = (m & (NBMAX-1)) - m3 ;
|
||||
|
||||
|
||||
y_ptr = y;
|
||||
|
||||
BLASLONG NB = NBMAX;
|
||||
|
||||
while ( NB == NBMAX )
|
||||
{
|
||||
|
||||
m1 -= NB;
|
||||
if ( m1 < 0)
|
||||
{
|
||||
if ( m2 == 0 ) break;
|
||||
NB = m2;
|
||||
}
|
||||
|
||||
a_ptr = a;
|
||||
x_ptr = x;
|
||||
|
||||
ap[0] = a_ptr;
|
||||
ap[1] = a_ptr + lda;
|
||||
ap[2] = ap[1] + lda;
|
||||
ap[3] = ap[2] + lda;
|
||||
|
||||
if ( inc_y != 1 )
|
||||
memset(ybuffer,0,NB*8);
|
||||
else
|
||||
ybuffer = y_ptr;
|
||||
|
||||
if ( inc_x == 1 )
|
||||
{
|
||||
|
||||
|
||||
for( i = 0; i < n1 ; i++)
|
||||
{
|
||||
dgemv_kernel_4x8(NB,ap,x_ptr,ybuffer,lda4,&alpha);
|
||||
ap[0] += lda8;
|
||||
ap[1] += lda8;
|
||||
ap[2] += lda8;
|
||||
ap[3] += lda8;
|
||||
a_ptr += lda8;
|
||||
x_ptr += 8;
|
||||
}
|
||||
|
||||
|
||||
if ( n2 & 4 )
|
||||
{
|
||||
dgemv_kernel_4x4(NB,ap,x_ptr,ybuffer,&alpha);
|
||||
ap[0] += lda4;
|
||||
ap[1] += lda4;
|
||||
a_ptr += lda4;
|
||||
x_ptr += 4;
|
||||
}
|
||||
|
||||
if ( n2 & 2 )
|
||||
{
|
||||
dgemv_kernel_4x2(NB,ap,x_ptr,ybuffer,&alpha);
|
||||
a_ptr += lda*2;
|
||||
x_ptr += 2;
|
||||
}
|
||||
|
||||
|
||||
if ( n2 & 1 )
|
||||
{
|
||||
dgemv_kernel_4x1(NB,a_ptr,x_ptr,ybuffer,&alpha);
|
||||
a_ptr += lda;
|
||||
x_ptr += 1;
|
||||
|
||||
}
|
||||
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
|
||||
for( i = 0; i < n1 ; i++)
|
||||
{
|
||||
xbuffer[0] = x_ptr[0];
|
||||
x_ptr += inc_x;
|
||||
xbuffer[1] = x_ptr[0];
|
||||
x_ptr += inc_x;
|
||||
xbuffer[2] = x_ptr[0];
|
||||
x_ptr += inc_x;
|
||||
xbuffer[3] = x_ptr[0];
|
||||
x_ptr += inc_x;
|
||||
dgemv_kernel_4x4(NB,ap,xbuffer,ybuffer,&alpha);
|
||||
ap[0] += lda4;
|
||||
ap[1] += lda4;
|
||||
ap[2] += lda4;
|
||||
ap[3] += lda4;
|
||||
a_ptr += lda4;
|
||||
}
|
||||
|
||||
for( i = 0; i < n2 ; i++)
|
||||
{
|
||||
xbuffer[0] = x_ptr[0];
|
||||
x_ptr += inc_x;
|
||||
dgemv_kernel_4x1(NB,a_ptr,xbuffer,ybuffer,&alpha);
|
||||
a_ptr += lda;
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
a += NB;
|
||||
if ( inc_y != 1 )
|
||||
{
|
||||
add_y(NB,ybuffer,y_ptr,inc_y);
|
||||
y_ptr += NB * inc_y;
|
||||
}
|
||||
else
|
||||
y_ptr += NB ;
|
||||
|
||||
}
|
||||
|
||||
if ( m3 == 0 ) return(0);
|
||||
|
||||
if ( m3 == 3 )
|
||||
{
|
||||
a_ptr = a;
|
||||
x_ptr = x;
|
||||
FLOAT temp0 = 0.0;
|
||||
FLOAT temp1 = 0.0;
|
||||
FLOAT temp2 = 0.0;
|
||||
if ( lda == 3 && inc_x ==1 )
|
||||
{
|
||||
|
||||
for( i = 0; i < ( n & -4 ); i+=4 )
|
||||
{
|
||||
|
||||
temp0 += a_ptr[0] * x_ptr[0] + a_ptr[3] * x_ptr[1];
|
||||
temp1 += a_ptr[1] * x_ptr[0] + a_ptr[4] * x_ptr[1];
|
||||
temp2 += a_ptr[2] * x_ptr[0] + a_ptr[5] * x_ptr[1];
|
||||
|
||||
temp0 += a_ptr[6] * x_ptr[2] + a_ptr[9] * x_ptr[3];
|
||||
temp1 += a_ptr[7] * x_ptr[2] + a_ptr[10] * x_ptr[3];
|
||||
temp2 += a_ptr[8] * x_ptr[2] + a_ptr[11] * x_ptr[3];
|
||||
|
||||
a_ptr += 12;
|
||||
x_ptr += 4;
|
||||
}
|
||||
|
||||
for( ; i < n; i++ )
|
||||
{
|
||||
temp0 += a_ptr[0] * x_ptr[0];
|
||||
temp1 += a_ptr[1] * x_ptr[0];
|
||||
temp2 += a_ptr[2] * x_ptr[0];
|
||||
a_ptr += 3;
|
||||
x_ptr ++;
|
||||
}
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
|
||||
for( i = 0; i < n; i++ )
|
||||
{
|
||||
temp0 += a_ptr[0] * x_ptr[0];
|
||||
temp1 += a_ptr[1] * x_ptr[0];
|
||||
temp2 += a_ptr[2] * x_ptr[0];
|
||||
a_ptr += lda;
|
||||
x_ptr += inc_x;
|
||||
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
y_ptr[0] += alpha * temp0;
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += alpha * temp1;
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += alpha * temp2;
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
if ( m3 == 2 )
|
||||
{
|
||||
a_ptr = a;
|
||||
x_ptr = x;
|
||||
FLOAT temp0 = 0.0;
|
||||
FLOAT temp1 = 0.0;
|
||||
if ( lda == 2 && inc_x ==1 )
|
||||
{
|
||||
|
||||
for( i = 0; i < (n & -4) ; i+=4 )
|
||||
{
|
||||
temp0 += a_ptr[0] * x_ptr[0] + a_ptr[2] * x_ptr[1];
|
||||
temp1 += a_ptr[1] * x_ptr[0] + a_ptr[3] * x_ptr[1];
|
||||
temp0 += a_ptr[4] * x_ptr[2] + a_ptr[6] * x_ptr[3];
|
||||
temp1 += a_ptr[5] * x_ptr[2] + a_ptr[7] * x_ptr[3];
|
||||
a_ptr += 8;
|
||||
x_ptr += 4;
|
||||
|
||||
}
|
||||
|
||||
|
||||
for( ; i < n; i++ )
|
||||
{
|
||||
temp0 += a_ptr[0] * x_ptr[0];
|
||||
temp1 += a_ptr[1] * x_ptr[0];
|
||||
a_ptr += 2;
|
||||
x_ptr ++;
|
||||
}
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
|
||||
for( i = 0; i < n; i++ )
|
||||
{
|
||||
temp0 += a_ptr[0] * x_ptr[0];
|
||||
temp1 += a_ptr[1] * x_ptr[0];
|
||||
a_ptr += lda;
|
||||
x_ptr += inc_x;
|
||||
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
y_ptr[0] += alpha * temp0;
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += alpha * temp1;
|
||||
return(0);
|
||||
}
|
||||
|
||||
if ( m3 == 1 )
|
||||
{
|
||||
a_ptr = a;
|
||||
x_ptr = x;
|
||||
FLOAT temp = 0.0;
|
||||
if ( lda == 1 && inc_x ==1 )
|
||||
{
|
||||
|
||||
for( i = 0; i < (n & -4); i+=4 )
|
||||
{
|
||||
temp += a_ptr[i] * x_ptr[i] + a_ptr[i+1] * x_ptr[i+1] + a_ptr[i+2] * x_ptr[i+2] + a_ptr[i+3] * x_ptr[i+3];
|
||||
|
||||
}
|
||||
|
||||
for( ; i < n; i++ )
|
||||
{
|
||||
temp += a_ptr[i] * x_ptr[i];
|
||||
}
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
|
||||
for( i = 0; i < n; i++ )
|
||||
{
|
||||
temp += a_ptr[0] * x_ptr[0];
|
||||
a_ptr += lda;
|
||||
x_ptr += inc_x;
|
||||
}
|
||||
|
||||
}
|
||||
y_ptr[0] += alpha * temp;
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,247 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
|
||||
#define HAVE_KERNEL_4x8 1
|
||||
static void dgemv_kernel_4x8( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, BLASLONG lda4, FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void dgemv_kernel_4x8( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, BLASLONG lda4, FLOAT *alpha)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vzeroupper \n\t"
|
||||
"vbroadcastsd (%2), %%ymm12 \n\t" // x0
|
||||
"vbroadcastsd 8(%2), %%ymm13 \n\t" // x1
|
||||
"vbroadcastsd 16(%2), %%ymm14 \n\t" // x2
|
||||
"vbroadcastsd 24(%2), %%ymm15 \n\t" // x3
|
||||
"vbroadcastsd 32(%2), %%ymm0 \n\t" // x4
|
||||
"vbroadcastsd 40(%2), %%ymm1 \n\t" // x5
|
||||
"vbroadcastsd 48(%2), %%ymm2 \n\t" // x6
|
||||
"vbroadcastsd 56(%2), %%ymm3 \n\t" // x7
|
||||
|
||||
"vbroadcastsd (%9), %%ymm6 \n\t" // alpha
|
||||
|
||||
"testq $0x04, %1 \n\t"
|
||||
"jz .L8LABEL%= \n\t"
|
||||
|
||||
"vmovupd (%3,%0,8), %%ymm7 \n\t" // 4 * y
|
||||
"vxorpd %%ymm4 , %%ymm4, %%ymm4 \n\t"
|
||||
"vxorpd %%ymm5 , %%ymm5, %%ymm5 \n\t"
|
||||
|
||||
"vfmadd231pd (%4,%0,8), %%ymm12, %%ymm4 \n\t"
|
||||
"vfmadd231pd (%5,%0,8), %%ymm13, %%ymm5 \n\t"
|
||||
"vfmadd231pd (%6,%0,8), %%ymm14, %%ymm4 \n\t"
|
||||
"vfmadd231pd (%7,%0,8), %%ymm15, %%ymm5 \n\t"
|
||||
|
||||
"vfmadd231pd (%4,%8,8), %%ymm0 , %%ymm4 \n\t"
|
||||
"vfmadd231pd (%5,%8,8), %%ymm1 , %%ymm5 \n\t"
|
||||
"vfmadd231pd (%6,%8,8), %%ymm2 , %%ymm4 \n\t"
|
||||
"vfmadd231pd (%7,%8,8), %%ymm3 , %%ymm5 \n\t"
|
||||
|
||||
"vaddpd %%ymm4 , %%ymm5 , %%ymm5 \n\t"
|
||||
"vmulpd %%ymm6 , %%ymm5 , %%ymm5 \n\t"
|
||||
"vaddpd %%ymm7 , %%ymm5 , %%ymm5 \n\t"
|
||||
|
||||
|
||||
"vmovupd %%ymm5, (%3,%0,8) \n\t" // 4 * y
|
||||
|
||||
"addq $4 , %8 \n\t"
|
||||
"addq $4 , %0 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
|
||||
".L8LABEL%=: \n\t"
|
||||
|
||||
"cmpq $0, %1 \n\t"
|
||||
"je .L16END%= \n\t"
|
||||
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
|
||||
"vxorpd %%ymm4 , %%ymm4, %%ymm4 \n\t"
|
||||
"vxorpd %%ymm5 , %%ymm5, %%ymm5 \n\t"
|
||||
"vmovupd (%3,%0,8), %%ymm8 \n\t" // 4 * y
|
||||
"vmovupd 32(%3,%0,8), %%ymm9 \n\t" // 4 * y
|
||||
|
||||
"vfmadd231pd (%4,%0,8), %%ymm12, %%ymm4 \n\t"
|
||||
"vfmadd231pd 32(%4,%0,8), %%ymm12, %%ymm5 \n\t"
|
||||
"vfmadd231pd (%5,%0,8), %%ymm13, %%ymm4 \n\t"
|
||||
"vfmadd231pd 32(%5,%0,8), %%ymm13, %%ymm5 \n\t"
|
||||
"vfmadd231pd (%6,%0,8), %%ymm14, %%ymm4 \n\t"
|
||||
"vfmadd231pd 32(%6,%0,8), %%ymm14, %%ymm5 \n\t"
|
||||
"vfmadd231pd (%7,%0,8), %%ymm15, %%ymm4 \n\t"
|
||||
"vfmadd231pd 32(%7,%0,8), %%ymm15, %%ymm5 \n\t"
|
||||
|
||||
"vfmadd231pd (%4,%8,8), %%ymm0 , %%ymm4 \n\t"
|
||||
"addq $8 , %0 \n\t"
|
||||
"vfmadd231pd 32(%4,%8,8), %%ymm0 , %%ymm5 \n\t"
|
||||
"vfmadd231pd (%5,%8,8), %%ymm1 , %%ymm4 \n\t"
|
||||
"vfmadd231pd 32(%5,%8,8), %%ymm1 , %%ymm5 \n\t"
|
||||
"vfmadd231pd (%6,%8,8), %%ymm2 , %%ymm4 \n\t"
|
||||
"vfmadd231pd 32(%6,%8,8), %%ymm2 , %%ymm5 \n\t"
|
||||
"vfmadd231pd (%7,%8,8), %%ymm3 , %%ymm4 \n\t"
|
||||
"vfmadd231pd 32(%7,%8,8), %%ymm3 , %%ymm5 \n\t"
|
||||
|
||||
"vfmadd231pd %%ymm6 , %%ymm4 , %%ymm8 \n\t"
|
||||
"vfmadd231pd %%ymm6 , %%ymm5 , %%ymm9 \n\t"
|
||||
|
||||
"addq $8 , %8 \n\t"
|
||||
"vmovupd %%ymm8,-64(%3,%0,8) \n\t" // 4 * y
|
||||
"subq $8 , %1 \n\t"
|
||||
"vmovupd %%ymm9,-32(%3,%0,8) \n\t" // 4 * y
|
||||
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
".L16END%=: \n\t"
|
||||
"vzeroupper \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (ap[2]), // 6
|
||||
"r" (ap[3]), // 7
|
||||
"r" (lda4), // 8
|
||||
"r" (alpha) // 9
|
||||
: "cc",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm2", "%xmm3",
|
||||
"%xmm4", "%xmm5",
|
||||
"%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
||||
#define HAVE_KERNEL_4x4 1
|
||||
static void dgemv_kernel_4x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void dgemv_kernel_4x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vzeroupper \n\t"
|
||||
"vbroadcastsd (%2), %%ymm12 \n\t" // x0
|
||||
"vbroadcastsd 8(%2), %%ymm13 \n\t" // x1
|
||||
"vbroadcastsd 16(%2), %%ymm14 \n\t" // x2
|
||||
"vbroadcastsd 24(%2), %%ymm15 \n\t" // x3
|
||||
|
||||
"vbroadcastsd (%8), %%ymm6 \n\t" // alpha
|
||||
|
||||
"testq $0x04, %1 \n\t"
|
||||
"jz .L8LABEL%= \n\t"
|
||||
|
||||
"vxorpd %%ymm4 , %%ymm4, %%ymm4 \n\t"
|
||||
"vxorpd %%ymm5 , %%ymm5, %%ymm5 \n\t"
|
||||
"vmovupd (%3,%0,8), %%ymm7 \n\t" // 4 * y
|
||||
|
||||
"vfmadd231pd (%4,%0,8), %%ymm12, %%ymm4 \n\t"
|
||||
"vfmadd231pd (%5,%0,8), %%ymm13, %%ymm5 \n\t"
|
||||
"vfmadd231pd (%6,%0,8), %%ymm14, %%ymm4 \n\t"
|
||||
"vfmadd231pd (%7,%0,8), %%ymm15, %%ymm5 \n\t"
|
||||
|
||||
"vaddpd %%ymm4 , %%ymm5 , %%ymm5 \n\t"
|
||||
"vmulpd %%ymm6 , %%ymm5 , %%ymm5 \n\t"
|
||||
"vaddpd %%ymm7 , %%ymm5 , %%ymm5 \n\t"
|
||||
|
||||
"vmovupd %%ymm5, (%3,%0,8) \n\t" // 4 * y
|
||||
|
||||
"addq $4 , %0 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
|
||||
".L8LABEL%=: \n\t"
|
||||
|
||||
"cmpq $0, %1 \n\t"
|
||||
"je .L8END%= \n\t"
|
||||
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vxorpd %%ymm4 , %%ymm4, %%ymm4 \n\t"
|
||||
"vxorpd %%ymm5 , %%ymm5, %%ymm5 \n\t"
|
||||
"vmovupd (%3,%0,8), %%ymm8 \n\t" // 4 * y
|
||||
"vmovupd 32(%3,%0,8), %%ymm9 \n\t" // 4 * y
|
||||
|
||||
"vfmadd231pd (%4,%0,8), %%ymm12, %%ymm4 \n\t"
|
||||
"vfmadd231pd 32(%4,%0,8), %%ymm12, %%ymm5 \n\t"
|
||||
"vfmadd231pd (%5,%0,8), %%ymm13, %%ymm4 \n\t"
|
||||
"vfmadd231pd 32(%5,%0,8), %%ymm13, %%ymm5 \n\t"
|
||||
"vfmadd231pd (%6,%0,8), %%ymm14, %%ymm4 \n\t"
|
||||
"vfmadd231pd 32(%6,%0,8), %%ymm14, %%ymm5 \n\t"
|
||||
"vfmadd231pd (%7,%0,8), %%ymm15, %%ymm4 \n\t"
|
||||
"vfmadd231pd 32(%7,%0,8), %%ymm15, %%ymm5 \n\t"
|
||||
|
||||
"vfmadd231pd %%ymm6 , %%ymm4 , %%ymm8 \n\t"
|
||||
"vfmadd231pd %%ymm6 , %%ymm5 , %%ymm9 \n\t"
|
||||
|
||||
"vmovupd %%ymm8, (%3,%0,8) \n\t" // 4 * y
|
||||
"vmovupd %%ymm9, 32(%3,%0,8) \n\t" // 4 * y
|
||||
|
||||
"addq $8 , %0 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
".L8END%=: \n\t"
|
||||
"vzeroupper \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (ap[2]), // 6
|
||||
"r" (ap[3]), // 7
|
||||
"r" (alpha) // 8
|
||||
: "cc",
|
||||
"%xmm4", "%xmm5",
|
||||
"%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,265 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
|
||||
#define HAVE_KERNEL_4x8 1
|
||||
static void dgemv_kernel_4x8( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, BLASLONG lda4, FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void dgemv_kernel_4x8( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, BLASLONG lda4, FLOAT *alpha)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movsd (%2), %%xmm12 \n\t" // x0
|
||||
"movsd 8(%2), %%xmm13 \n\t" // x1
|
||||
"movsd 16(%2), %%xmm14 \n\t" // x2
|
||||
"movsd 24(%2), %%xmm15 \n\t" // x3
|
||||
"shufpd $0, %%xmm12, %%xmm12\n\t"
|
||||
"shufpd $0, %%xmm13, %%xmm13\n\t"
|
||||
"shufpd $0, %%xmm14, %%xmm14\n\t"
|
||||
"shufpd $0, %%xmm15, %%xmm15\n\t"
|
||||
|
||||
"movsd 32(%2), %%xmm0 \n\t" // x4
|
||||
"movsd 40(%2), %%xmm1 \n\t" // x5
|
||||
"movsd 48(%2), %%xmm2 \n\t" // x6
|
||||
"movsd 56(%2), %%xmm3 \n\t" // x7
|
||||
"shufpd $0, %%xmm0 , %%xmm0 \n\t"
|
||||
"shufpd $0, %%xmm1 , %%xmm1 \n\t"
|
||||
"shufpd $0, %%xmm2 , %%xmm2 \n\t"
|
||||
"shufpd $0, %%xmm3 , %%xmm3 \n\t"
|
||||
|
||||
"movsd (%9), %%xmm6 \n\t" // alpha
|
||||
"shufpd $0, %%xmm6 , %%xmm6 \n\t"
|
||||
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"xorpd %%xmm4 , %%xmm4 \n\t"
|
||||
"xorpd %%xmm5 , %%xmm5 \n\t"
|
||||
"movups (%3,%0,8), %%xmm7 \n\t" // 2 * y
|
||||
|
||||
".align 2 \n\t"
|
||||
"movups (%4,%0,8), %%xmm8 \n\t"
|
||||
"movups (%5,%0,8), %%xmm9 \n\t"
|
||||
"movups (%6,%0,8), %%xmm10 \n\t"
|
||||
"movups (%7,%0,8), %%xmm11 \n\t"
|
||||
".align 2 \n\t"
|
||||
"mulpd %%xmm12, %%xmm8 \n\t"
|
||||
"mulpd %%xmm13, %%xmm9 \n\t"
|
||||
"mulpd %%xmm14, %%xmm10 \n\t"
|
||||
"mulpd %%xmm15, %%xmm11 \n\t"
|
||||
"addpd %%xmm8 , %%xmm4 \n\t"
|
||||
"addpd %%xmm9 , %%xmm5 \n\t"
|
||||
"addpd %%xmm10, %%xmm4 \n\t"
|
||||
"addpd %%xmm11, %%xmm5 \n\t"
|
||||
|
||||
"movups (%4,%8,8), %%xmm8 \n\t"
|
||||
"movups (%5,%8,8), %%xmm9 \n\t"
|
||||
"movups (%6,%8,8), %%xmm10 \n\t"
|
||||
"movups (%7,%8,8), %%xmm11 \n\t"
|
||||
".align 2 \n\t"
|
||||
"mulpd %%xmm0 , %%xmm8 \n\t"
|
||||
"mulpd %%xmm1 , %%xmm9 \n\t"
|
||||
"mulpd %%xmm2 , %%xmm10 \n\t"
|
||||
"mulpd %%xmm3 , %%xmm11 \n\t"
|
||||
"addpd %%xmm8 , %%xmm4 \n\t"
|
||||
"addpd %%xmm9 , %%xmm5 \n\t"
|
||||
"addpd %%xmm10, %%xmm4 \n\t"
|
||||
"addpd %%xmm11, %%xmm5 \n\t"
|
||||
|
||||
"addpd %%xmm5 , %%xmm4 \n\t"
|
||||
"mulpd %%xmm6 , %%xmm4 \n\t"
|
||||
"addpd %%xmm4 , %%xmm7 \n\t"
|
||||
|
||||
"movups %%xmm7 , (%3,%0,8) \n\t" // 2 * y
|
||||
|
||||
"xorpd %%xmm4 , %%xmm4 \n\t"
|
||||
"xorpd %%xmm5 , %%xmm5 \n\t"
|
||||
"movups 16(%3,%0,8), %%xmm7 \n\t" // 2 * y
|
||||
|
||||
".align 2 \n\t"
|
||||
"movups 16(%4,%0,8), %%xmm8 \n\t"
|
||||
"movups 16(%5,%0,8), %%xmm9 \n\t"
|
||||
"movups 16(%6,%0,8), %%xmm10 \n\t"
|
||||
"movups 16(%7,%0,8), %%xmm11 \n\t"
|
||||
".align 2 \n\t"
|
||||
"mulpd %%xmm12, %%xmm8 \n\t"
|
||||
"mulpd %%xmm13, %%xmm9 \n\t"
|
||||
"mulpd %%xmm14, %%xmm10 \n\t"
|
||||
"mulpd %%xmm15, %%xmm11 \n\t"
|
||||
"addpd %%xmm8 , %%xmm4 \n\t"
|
||||
"addpd %%xmm9 , %%xmm5 \n\t"
|
||||
"addpd %%xmm10, %%xmm4 \n\t"
|
||||
"addpd %%xmm11, %%xmm5 \n\t"
|
||||
|
||||
"movups 16(%4,%8,8), %%xmm8 \n\t"
|
||||
"movups 16(%5,%8,8), %%xmm9 \n\t"
|
||||
"movups 16(%6,%8,8), %%xmm10 \n\t"
|
||||
"movups 16(%7,%8,8), %%xmm11 \n\t"
|
||||
".align 2 \n\t"
|
||||
"mulpd %%xmm0 , %%xmm8 \n\t"
|
||||
"mulpd %%xmm1 , %%xmm9 \n\t"
|
||||
"mulpd %%xmm2 , %%xmm10 \n\t"
|
||||
"mulpd %%xmm3 , %%xmm11 \n\t"
|
||||
"addpd %%xmm8 , %%xmm4 \n\t"
|
||||
"addpd %%xmm9 , %%xmm5 \n\t"
|
||||
"addpd %%xmm10, %%xmm4 \n\t"
|
||||
"addpd %%xmm11, %%xmm5 \n\t"
|
||||
|
||||
"addq $4 , %8 \n\t"
|
||||
"addpd %%xmm5 , %%xmm4 \n\t"
|
||||
"mulpd %%xmm6 , %%xmm4 \n\t"
|
||||
"addpd %%xmm4 , %%xmm7 \n\t"
|
||||
|
||||
"movups %%xmm7 , 16(%3,%0,8) \n\t" // 2 * y
|
||||
|
||||
"addq $4 , %0 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (ap[2]), // 6
|
||||
"r" (ap[3]), // 7
|
||||
"r" (lda4), // 8
|
||||
"r" (alpha) // 9
|
||||
: "cc",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm2", "%xmm3",
|
||||
"%xmm4", "%xmm5",
|
||||
"%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
||||
|
||||
#define HAVE_KERNEL_4x4 1
|
||||
static void dgemv_kernel_4x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void dgemv_kernel_4x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movsd (%2), %%xmm12 \n\t" // x0
|
||||
"movsd 8(%2), %%xmm13 \n\t" // x1
|
||||
"movsd 16(%2), %%xmm14 \n\t" // x2
|
||||
"movsd 24(%2), %%xmm15 \n\t" // x3
|
||||
"shufpd $0, %%xmm12, %%xmm12\n\t"
|
||||
"shufpd $0, %%xmm13, %%xmm13\n\t"
|
||||
"shufpd $0, %%xmm14, %%xmm14\n\t"
|
||||
"shufpd $0, %%xmm15, %%xmm15\n\t"
|
||||
|
||||
"movsd (%8), %%xmm6 \n\t" // alpha
|
||||
"shufpd $0, %%xmm6 , %%xmm6 \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"xorpd %%xmm4 , %%xmm4 \n\t"
|
||||
"xorpd %%xmm5 , %%xmm5 \n\t"
|
||||
"movups (%3,%0,8), %%xmm7 \n\t" // 2 * y
|
||||
|
||||
"movups (%4,%0,8), %%xmm8 \n\t"
|
||||
"movups (%5,%0,8), %%xmm9 \n\t"
|
||||
"movups (%6,%0,8), %%xmm10 \n\t"
|
||||
"movups (%7,%0,8), %%xmm11 \n\t"
|
||||
"mulpd %%xmm12, %%xmm8 \n\t"
|
||||
"mulpd %%xmm13, %%xmm9 \n\t"
|
||||
"mulpd %%xmm14, %%xmm10 \n\t"
|
||||
"mulpd %%xmm15, %%xmm11 \n\t"
|
||||
"addpd %%xmm8 , %%xmm4 \n\t"
|
||||
"addpd %%xmm9 , %%xmm4 \n\t"
|
||||
"addpd %%xmm10 , %%xmm4 \n\t"
|
||||
"addpd %%xmm4 , %%xmm11 \n\t"
|
||||
|
||||
"mulpd %%xmm6 , %%xmm11 \n\t"
|
||||
"addpd %%xmm7 , %%xmm11 \n\t"
|
||||
"movups %%xmm11, (%3,%0,8) \n\t" // 2 * y
|
||||
|
||||
"xorpd %%xmm4 , %%xmm4 \n\t"
|
||||
"xorpd %%xmm5 , %%xmm5 \n\t"
|
||||
"movups 16(%3,%0,8), %%xmm7 \n\t" // 2 * y
|
||||
|
||||
"movups 16(%4,%0,8), %%xmm8 \n\t"
|
||||
"movups 16(%5,%0,8), %%xmm9 \n\t"
|
||||
"movups 16(%6,%0,8), %%xmm10 \n\t"
|
||||
"movups 16(%7,%0,8), %%xmm11 \n\t"
|
||||
"mulpd %%xmm12, %%xmm8 \n\t"
|
||||
"mulpd %%xmm13, %%xmm9 \n\t"
|
||||
"mulpd %%xmm14, %%xmm10 \n\t"
|
||||
"mulpd %%xmm15, %%xmm11 \n\t"
|
||||
"addpd %%xmm8 , %%xmm4 \n\t"
|
||||
"addpd %%xmm9 , %%xmm4 \n\t"
|
||||
"addpd %%xmm10 , %%xmm4 \n\t"
|
||||
"addpd %%xmm4 , %%xmm11 \n\t"
|
||||
|
||||
"mulpd %%xmm6 , %%xmm11 \n\t"
|
||||
"addpd %%xmm7 , %%xmm11 \n\t"
|
||||
"movups %%xmm11, 16(%3,%0,8) \n\t" // 2 * y
|
||||
|
||||
"addq $4 , %0 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (ap[2]), // 6
|
||||
"r" (ap[3]), // 7
|
||||
"r" (alpha) // 8
|
||||
: "cc",
|
||||
"%xmm4", "%xmm5",
|
||||
"%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -1,191 +0,0 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
#include "common.h"
|
||||
|
||||
#if defined(HASWELL)
|
||||
#include "dgemv_t_microk_haswell-2.c"
|
||||
#endif
|
||||
|
||||
#define NBMAX 2048
|
||||
|
||||
#ifndef HAVE_KERNEL_16x4
|
||||
|
||||
static void dgemv_kernel_16x4(BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
BLASLONG i;
|
||||
FLOAT *a0,*a1,*a2,*a3;
|
||||
a0 = ap[0];
|
||||
a1 = ap[1];
|
||||
a2 = ap[2];
|
||||
a3 = ap[3];
|
||||
FLOAT temp0 = 0.0;
|
||||
FLOAT temp1 = 0.0;
|
||||
FLOAT temp2 = 0.0;
|
||||
FLOAT temp3 = 0.0;
|
||||
|
||||
for ( i=0; i< n; i+=4 )
|
||||
{
|
||||
temp0 += a0[i]*x[i] + a0[i+1]*x[i+1] + a0[i+2]*x[i+2] + a0[i+3]*x[i+3];
|
||||
temp1 += a1[i]*x[i] + a1[i+1]*x[i+1] + a1[i+2]*x[i+2] + a1[i+3]*x[i+3];
|
||||
temp2 += a2[i]*x[i] + a2[i+1]*x[i+1] + a2[i+2]*x[i+2] + a2[i+3]*x[i+3];
|
||||
temp3 += a3[i]*x[i] + a3[i+1]*x[i+1] + a3[i+2]*x[i+2] + a3[i+3]*x[i+3];
|
||||
}
|
||||
y[0] = temp0;
|
||||
y[1] = temp1;
|
||||
y[2] = temp2;
|
||||
y[3] = temp3;
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
static void dgemv_kernel_16x1(BLASLONG n, FLOAT *ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
BLASLONG i;
|
||||
FLOAT *a0;
|
||||
a0 = ap;
|
||||
FLOAT temp = 0.0;
|
||||
|
||||
for ( i=0; i< n; i+=4 )
|
||||
{
|
||||
temp += a0[i]*x[i] + a0[i+1]*x[i+1] + a0[i+2]*x[i+2] + a0[i+3]*x[i+3];
|
||||
}
|
||||
*y = temp;
|
||||
}
|
||||
|
||||
static void copy_x(BLASLONG n, FLOAT *src, FLOAT *dest, BLASLONG inc_src)
|
||||
{
|
||||
BLASLONG i;
|
||||
for ( i=0; i<n; i++ )
|
||||
{
|
||||
*dest = *src;
|
||||
dest++;
|
||||
src += inc_src;
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
int CNAME(BLASLONG m, BLASLONG n, BLASLONG dummy1, FLOAT alpha, FLOAT *a, BLASLONG lda, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y, FLOAT *buffer)
|
||||
{
|
||||
BLASLONG i;
|
||||
BLASLONG j;
|
||||
FLOAT *a_ptr;
|
||||
FLOAT *x_ptr;
|
||||
FLOAT *y_ptr;
|
||||
FLOAT *ap[4];
|
||||
BLASLONG n1;
|
||||
BLASLONG m1;
|
||||
BLASLONG m2;
|
||||
BLASLONG n2;
|
||||
FLOAT ybuffer[4],*xbuffer;
|
||||
|
||||
if ( m < 1 ) return(0);
|
||||
if ( n < 1 ) return(0);
|
||||
|
||||
xbuffer = buffer;
|
||||
|
||||
n1 = n / 4 ;
|
||||
n2 = n % 4 ;
|
||||
|
||||
m1 = m - ( m % 16 );
|
||||
m2 = (m % NBMAX) - (m % 16) ;
|
||||
|
||||
|
||||
BLASLONG NB = NBMAX;
|
||||
|
||||
while ( NB == NBMAX )
|
||||
{
|
||||
|
||||
m1 -= NB;
|
||||
if ( m1 < 0)
|
||||
{
|
||||
if ( m2 == 0 ) break;
|
||||
NB = m2;
|
||||
}
|
||||
|
||||
y_ptr = y;
|
||||
a_ptr = a;
|
||||
x_ptr = x;
|
||||
copy_x(NB,x_ptr,xbuffer,inc_x);
|
||||
for( i = 0; i < n1 ; i++)
|
||||
{
|
||||
ap[0] = a_ptr;
|
||||
ap[1] = a_ptr + lda;
|
||||
ap[2] = ap[1] + lda;
|
||||
ap[3] = ap[2] + lda;
|
||||
dgemv_kernel_16x4(NB,ap,xbuffer,ybuffer);
|
||||
a_ptr += 4 * lda;
|
||||
*y_ptr += ybuffer[0]*alpha;
|
||||
y_ptr += inc_y;
|
||||
*y_ptr += ybuffer[1]*alpha;
|
||||
y_ptr += inc_y;
|
||||
*y_ptr += ybuffer[2]*alpha;
|
||||
y_ptr += inc_y;
|
||||
*y_ptr += ybuffer[3]*alpha;
|
||||
y_ptr += inc_y;
|
||||
}
|
||||
|
||||
for( i = 0; i < n2 ; i++)
|
||||
{
|
||||
dgemv_kernel_16x1(NB,a_ptr,xbuffer,ybuffer);
|
||||
a_ptr += 1 * lda;
|
||||
*y_ptr += ybuffer[0]*alpha;
|
||||
y_ptr += inc_y;
|
||||
|
||||
}
|
||||
a += NB;
|
||||
x += NB * inc_x;
|
||||
}
|
||||
|
||||
BLASLONG m3 = m % 16;
|
||||
if ( m3 == 0 ) return(0);
|
||||
x_ptr = x;
|
||||
for ( i=0; i< m3; i++ )
|
||||
{
|
||||
xbuffer[i] = *x_ptr;
|
||||
x_ptr += inc_x;
|
||||
}
|
||||
j=0;
|
||||
a_ptr = a;
|
||||
y_ptr = y;
|
||||
while ( j < n)
|
||||
{
|
||||
FLOAT temp = 0.0;
|
||||
for( i = 0; i < m3; i++ )
|
||||
{
|
||||
temp += a_ptr[i] * xbuffer[i];
|
||||
}
|
||||
a_ptr += lda;
|
||||
y_ptr[0] += alpha * temp;
|
||||
y_ptr += inc_y;
|
||||
j++;
|
||||
}
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,615 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
#include "common.h"
|
||||
|
||||
#if defined(HASWELL)
|
||||
#include "dgemv_t_microk_haswell-4.c"
|
||||
#endif
|
||||
|
||||
#define NBMAX 2048
|
||||
|
||||
#ifndef HAVE_KERNEL_4x4
|
||||
|
||||
static void dgemv_kernel_4x4(BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
BLASLONG i;
|
||||
FLOAT *a0,*a1,*a2,*a3;
|
||||
a0 = ap[0];
|
||||
a1 = ap[1];
|
||||
a2 = ap[2];
|
||||
a3 = ap[3];
|
||||
FLOAT temp0 = 0.0;
|
||||
FLOAT temp1 = 0.0;
|
||||
FLOAT temp2 = 0.0;
|
||||
FLOAT temp3 = 0.0;
|
||||
|
||||
for ( i=0; i< n; i+=4 )
|
||||
{
|
||||
temp0 += a0[i]*x[i] + a0[i+1]*x[i+1] + a0[i+2]*x[i+2] + a0[i+3]*x[i+3];
|
||||
temp1 += a1[i]*x[i] + a1[i+1]*x[i+1] + a1[i+2]*x[i+2] + a1[i+3]*x[i+3];
|
||||
temp2 += a2[i]*x[i] + a2[i+1]*x[i+1] + a2[i+2]*x[i+2] + a2[i+3]*x[i+3];
|
||||
temp3 += a3[i]*x[i] + a3[i+1]*x[i+1] + a3[i+2]*x[i+2] + a3[i+3]*x[i+3];
|
||||
}
|
||||
y[0] = temp0;
|
||||
y[1] = temp1;
|
||||
y[2] = temp2;
|
||||
y[3] = temp3;
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
static void dgemv_kernel_4x2(BLASLONG n, FLOAT *ap0, FLOAT *ap1, FLOAT *x, FLOAT *y) __attribute__ ((noinline));
|
||||
|
||||
static void dgemv_kernel_4x2(BLASLONG n, FLOAT *ap0, FLOAT *ap1, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
BLASLONG i;
|
||||
|
||||
i=0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"xorpd %%xmm10 , %%xmm10 \n\t"
|
||||
"xorpd %%xmm11 , %%xmm11 \n\t"
|
||||
|
||||
"testq $2 , %1 \n\t"
|
||||
"jz .L01LABEL%= \n\t"
|
||||
|
||||
"movups (%5,%0,8) , %%xmm14 \n\t" // x
|
||||
"movups (%3,%0,8) , %%xmm12 \n\t" // ap0
|
||||
"movups (%4,%0,8) , %%xmm13 \n\t" // ap1
|
||||
"mulpd %%xmm14 , %%xmm12 \n\t"
|
||||
"mulpd %%xmm14 , %%xmm13 \n\t"
|
||||
"addq $2 , %0 \n\t"
|
||||
"addpd %%xmm12 , %%xmm10 \n\t"
|
||||
"subq $2 , %1 \n\t"
|
||||
"addpd %%xmm13 , %%xmm11 \n\t"
|
||||
|
||||
".L01LABEL%=: \n\t"
|
||||
|
||||
"cmpq $0, %1 \n\t"
|
||||
"je .L01END%= \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
|
||||
"movups (%5,%0,8) , %%xmm14 \n\t" // x
|
||||
"movups (%3,%0,8) , %%xmm12 \n\t" // ap0
|
||||
"movups (%4,%0,8) , %%xmm13 \n\t" // ap1
|
||||
"mulpd %%xmm14 , %%xmm12 \n\t"
|
||||
"mulpd %%xmm14 , %%xmm13 \n\t"
|
||||
"addpd %%xmm12 , %%xmm10 \n\t"
|
||||
"addpd %%xmm13 , %%xmm11 \n\t"
|
||||
|
||||
"movups 16(%5,%0,8) , %%xmm14 \n\t" // x
|
||||
"movups 16(%3,%0,8) , %%xmm12 \n\t" // ap0
|
||||
"movups 16(%4,%0,8) , %%xmm13 \n\t" // ap1
|
||||
"mulpd %%xmm14 , %%xmm12 \n\t"
|
||||
"mulpd %%xmm14 , %%xmm13 \n\t"
|
||||
"addpd %%xmm12 , %%xmm10 \n\t"
|
||||
"addpd %%xmm13 , %%xmm11 \n\t"
|
||||
|
||||
"addq $4 , %0 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
".L01END%=: \n\t"
|
||||
|
||||
"haddpd %%xmm10, %%xmm10 \n\t"
|
||||
"haddpd %%xmm11, %%xmm11 \n\t"
|
||||
|
||||
"movsd %%xmm10, (%2) \n\t"
|
||||
"movsd %%xmm11,8(%2) \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (y), // 2
|
||||
"r" (ap0), // 3
|
||||
"r" (ap1), // 4
|
||||
"r" (x) // 5
|
||||
: "cc",
|
||||
"%xmm4", "%xmm5", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
|
||||
}
|
||||
|
||||
static void dgemv_kernel_4x1(BLASLONG n, FLOAT *ap, FLOAT *x, FLOAT *y) __attribute__ ((noinline));
|
||||
|
||||
static void dgemv_kernel_4x1(BLASLONG n, FLOAT *ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
BLASLONG i;
|
||||
|
||||
i=0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"xorpd %%xmm9 , %%xmm9 \n\t"
|
||||
"xorpd %%xmm10 , %%xmm10 \n\t"
|
||||
|
||||
"testq $2 , %1 \n\t"
|
||||
"jz .L01LABEL%= \n\t"
|
||||
|
||||
"movups (%3,%0,8) , %%xmm12 \n\t"
|
||||
"movups (%4,%0,8) , %%xmm11 \n\t"
|
||||
"mulpd %%xmm11 , %%xmm12 \n\t"
|
||||
"addq $2 , %0 \n\t"
|
||||
"addpd %%xmm12 , %%xmm10 \n\t"
|
||||
"subq $2 , %1 \n\t"
|
||||
|
||||
".L01LABEL%=: \n\t"
|
||||
|
||||
"cmpq $0, %1 \n\t"
|
||||
"je .L01END%= \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
|
||||
"movups (%3,%0,8) , %%xmm12 \n\t"
|
||||
"movups 16(%3,%0,8) , %%xmm14 \n\t"
|
||||
"movups (%4,%0,8) , %%xmm11 \n\t"
|
||||
"movups 16(%4,%0,8) , %%xmm13 \n\t"
|
||||
"mulpd %%xmm11 , %%xmm12 \n\t"
|
||||
"mulpd %%xmm13 , %%xmm14 \n\t"
|
||||
"addq $4 , %0 \n\t"
|
||||
"addpd %%xmm12 , %%xmm10 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
"addpd %%xmm14 , %%xmm9 \n\t"
|
||||
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
".L01END%=: \n\t"
|
||||
|
||||
"addpd %%xmm9 , %%xmm10 \n\t"
|
||||
"haddpd %%xmm10, %%xmm10 \n\t"
|
||||
|
||||
"movsd %%xmm10, (%2) \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (y), // 2
|
||||
"r" (ap), // 3
|
||||
"r" (x) // 4
|
||||
: "cc",
|
||||
"%xmm9", "%xmm10" ,
|
||||
"%xmm11", "%xmm12", "%xmm13", "%xmm14",
|
||||
"memory"
|
||||
);
|
||||
|
||||
|
||||
}
|
||||
|
||||
static void copy_x(BLASLONG n, FLOAT *src, FLOAT *dest, BLASLONG inc_src)
|
||||
{
|
||||
BLASLONG i;
|
||||
for ( i=0; i<n; i++ )
|
||||
{
|
||||
*dest = *src;
|
||||
dest++;
|
||||
src += inc_src;
|
||||
}
|
||||
}
|
||||
|
||||
static void add_y(BLASLONG n, FLOAT da , FLOAT *src, FLOAT *dest, BLASLONG inc_dest) __attribute__ ((noinline));
|
||||
|
||||
static void add_y(BLASLONG n, FLOAT da , FLOAT *src, FLOAT *dest, BLASLONG inc_dest)
|
||||
{
|
||||
|
||||
BLASLONG i;
|
||||
|
||||
if ( inc_dest != 1 )
|
||||
{
|
||||
for ( i=0; i<n; i++ )
|
||||
{
|
||||
*dest += src[i] * da;
|
||||
dest += inc_dest;
|
||||
}
|
||||
return;
|
||||
}
|
||||
|
||||
i=0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movsd (%2) , %%xmm10 \n\t"
|
||||
"shufpd $0 , %%xmm10 , %%xmm10 \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
|
||||
"movups (%3,%0,8) , %%xmm12 \n\t"
|
||||
"movups (%4,%0,8) , %%xmm11 \n\t"
|
||||
"mulpd %%xmm10 , %%xmm12 \n\t"
|
||||
"addq $2 , %0 \n\t"
|
||||
"addpd %%xmm12 , %%xmm11 \n\t"
|
||||
"subq $2 , %1 \n\t"
|
||||
"movups %%xmm11, -16(%4,%0,8) \n\t"
|
||||
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (&da), // 2
|
||||
"r" (src), // 3
|
||||
"r" (dest) // 4
|
||||
: "cc",
|
||||
"%xmm10", "%xmm11", "%xmm12",
|
||||
"memory"
|
||||
);
|
||||
|
||||
|
||||
}
|
||||
|
||||
int CNAME(BLASLONG m, BLASLONG n, BLASLONG dummy1, FLOAT alpha, FLOAT *a, BLASLONG lda, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y, FLOAT *buffer)
|
||||
{
|
||||
BLASLONG register i;
|
||||
BLASLONG register j;
|
||||
FLOAT *a_ptr;
|
||||
FLOAT *x_ptr;
|
||||
FLOAT *y_ptr;
|
||||
BLASLONG n0;
|
||||
BLASLONG n1;
|
||||
BLASLONG m1;
|
||||
BLASLONG m2;
|
||||
BLASLONG m3;
|
||||
BLASLONG n2;
|
||||
FLOAT ybuffer[4],*xbuffer;
|
||||
FLOAT *ytemp;
|
||||
|
||||
if ( m < 1 ) return(0);
|
||||
if ( n < 1 ) return(0);
|
||||
|
||||
xbuffer = buffer;
|
||||
ytemp = buffer + NBMAX;
|
||||
|
||||
n0 = n / NBMAX;
|
||||
n1 = (n % NBMAX) >> 2 ;
|
||||
n2 = n & 3 ;
|
||||
|
||||
m3 = m & 3 ;
|
||||
m1 = m & -4 ;
|
||||
m2 = (m & (NBMAX-1)) - m3 ;
|
||||
|
||||
|
||||
BLASLONG NB = NBMAX;
|
||||
|
||||
while ( NB == NBMAX )
|
||||
{
|
||||
|
||||
m1 -= NB;
|
||||
if ( m1 < 0)
|
||||
{
|
||||
if ( m2 == 0 ) break;
|
||||
NB = m2;
|
||||
}
|
||||
|
||||
y_ptr = y;
|
||||
a_ptr = a;
|
||||
x_ptr = x;
|
||||
|
||||
if ( inc_x == 1 )
|
||||
xbuffer = x_ptr;
|
||||
else
|
||||
copy_x(NB,x_ptr,xbuffer,inc_x);
|
||||
|
||||
|
||||
FLOAT *ap[4];
|
||||
FLOAT *yp;
|
||||
BLASLONG register lda4 = 4 * lda;
|
||||
ap[0] = a_ptr;
|
||||
ap[1] = a_ptr + lda;
|
||||
ap[2] = ap[1] + lda;
|
||||
ap[3] = ap[2] + lda;
|
||||
|
||||
if ( n0 > 0 )
|
||||
{
|
||||
BLASLONG nb1 = NBMAX / 4;
|
||||
for( j=0; j<n0; j++)
|
||||
{
|
||||
|
||||
yp = ytemp;
|
||||
for( i = 0; i < nb1 ; i++)
|
||||
{
|
||||
dgemv_kernel_4x4(NB,ap,xbuffer,yp);
|
||||
ap[0] += lda4 ;
|
||||
ap[1] += lda4 ;
|
||||
ap[2] += lda4 ;
|
||||
ap[3] += lda4 ;
|
||||
yp += 4;
|
||||
}
|
||||
add_y(nb1*4, alpha, ytemp, y_ptr, inc_y );
|
||||
y_ptr += nb1 * inc_y * 4;
|
||||
a_ptr += nb1 * lda4 ;
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
|
||||
yp = ytemp;
|
||||
|
||||
for( i = 0; i < n1 ; i++)
|
||||
{
|
||||
dgemv_kernel_4x4(NB,ap,xbuffer,yp);
|
||||
ap[0] += lda4 ;
|
||||
ap[1] += lda4 ;
|
||||
ap[2] += lda4 ;
|
||||
ap[3] += lda4 ;
|
||||
yp += 4;
|
||||
}
|
||||
if ( n1 > 0 )
|
||||
{
|
||||
add_y(n1*4, alpha, ytemp, y_ptr, inc_y );
|
||||
y_ptr += n1 * inc_y * 4;
|
||||
a_ptr += n1 * lda4 ;
|
||||
}
|
||||
|
||||
if ( n2 & 2 )
|
||||
{
|
||||
|
||||
dgemv_kernel_4x2(NB,ap[0],ap[1],xbuffer,ybuffer);
|
||||
a_ptr += lda * 2;
|
||||
*y_ptr += ybuffer[0] * alpha;
|
||||
y_ptr += inc_y;
|
||||
*y_ptr += ybuffer[1] * alpha;
|
||||
y_ptr += inc_y;
|
||||
|
||||
}
|
||||
|
||||
if ( n2 & 1 )
|
||||
{
|
||||
|
||||
dgemv_kernel_4x1(NB,a_ptr,xbuffer,ybuffer);
|
||||
a_ptr += lda;
|
||||
*y_ptr += ybuffer[0] * alpha;
|
||||
y_ptr += inc_y;
|
||||
|
||||
}
|
||||
a += NB;
|
||||
x += NB * inc_x;
|
||||
}
|
||||
|
||||
if ( m3 == 0 ) return(0);
|
||||
|
||||
x_ptr = x;
|
||||
a_ptr = a;
|
||||
if ( m3 == 3 )
|
||||
{
|
||||
FLOAT xtemp0 = *x_ptr * alpha;
|
||||
x_ptr += inc_x;
|
||||
FLOAT xtemp1 = *x_ptr * alpha;
|
||||
x_ptr += inc_x;
|
||||
FLOAT xtemp2 = *x_ptr * alpha;
|
||||
|
||||
FLOAT *aj = a_ptr;
|
||||
y_ptr = y;
|
||||
|
||||
if ( lda == 3 && inc_y == 1 )
|
||||
{
|
||||
|
||||
for ( j=0; j< ( n & -4) ; j+=4 )
|
||||
{
|
||||
|
||||
y_ptr[j] += aj[0] * xtemp0 + aj[1] * xtemp1 + aj[2] * xtemp2;
|
||||
y_ptr[j+1] += aj[3] * xtemp0 + aj[4] * xtemp1 + aj[5] * xtemp2;
|
||||
y_ptr[j+2] += aj[6] * xtemp0 + aj[7] * xtemp1 + aj[8] * xtemp2;
|
||||
y_ptr[j+3] += aj[9] * xtemp0 + aj[10] * xtemp1 + aj[11] * xtemp2;
|
||||
aj += 12;
|
||||
}
|
||||
|
||||
for ( ; j<n; j++ )
|
||||
{
|
||||
y_ptr[j] += aj[0] * xtemp0 + aj[1] * xtemp1 + aj[2] * xtemp2;
|
||||
aj += 3;
|
||||
}
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
|
||||
if ( inc_y == 1 )
|
||||
{
|
||||
|
||||
BLASLONG register lda2 = lda << 1;
|
||||
BLASLONG register lda4 = lda << 2;
|
||||
BLASLONG register lda3 = lda2 + lda;
|
||||
|
||||
for ( j=0; j< ( n & -4 ); j+=4 )
|
||||
{
|
||||
|
||||
y_ptr[j] += *aj * xtemp0 + *(aj+1) * xtemp1 + *(aj+2) * xtemp2;
|
||||
y_ptr[j+1] += *(aj+lda) * xtemp0 + *(aj+lda+1) * xtemp1 + *(aj+lda+2) * xtemp2;
|
||||
y_ptr[j+2] += *(aj+lda2) * xtemp0 + *(aj+lda2+1) * xtemp1 + *(aj+lda2+2) * xtemp2;
|
||||
y_ptr[j+3] += *(aj+lda3) * xtemp0 + *(aj+lda3+1) * xtemp1 + *(aj+lda3+2) * xtemp2;
|
||||
aj += lda4;
|
||||
}
|
||||
|
||||
for ( ; j< n ; j++ )
|
||||
{
|
||||
|
||||
y_ptr[j] += *aj * xtemp0 + *(aj+1) * xtemp1 + *(aj+2) * xtemp2 ;
|
||||
aj += lda;
|
||||
}
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
|
||||
for ( j=0; j<n; j++ )
|
||||
{
|
||||
*y_ptr += *aj * xtemp0 + *(aj+1) * xtemp1 + *(aj+2) * xtemp2;
|
||||
y_ptr += inc_y;
|
||||
aj += lda;
|
||||
}
|
||||
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
return(0);
|
||||
}
|
||||
|
||||
if ( m3 == 2 )
|
||||
{
|
||||
FLOAT xtemp0 = *x_ptr * alpha;
|
||||
x_ptr += inc_x;
|
||||
FLOAT xtemp1 = *x_ptr * alpha;
|
||||
|
||||
FLOAT *aj = a_ptr;
|
||||
y_ptr = y;
|
||||
|
||||
if ( lda == 2 && inc_y == 1 )
|
||||
{
|
||||
|
||||
for ( j=0; j< ( n & -4) ; j+=4 )
|
||||
{
|
||||
y_ptr[j] += aj[0] * xtemp0 + aj[1] * xtemp1 ;
|
||||
y_ptr[j+1] += aj[2] * xtemp0 + aj[3] * xtemp1 ;
|
||||
y_ptr[j+2] += aj[4] * xtemp0 + aj[5] * xtemp1 ;
|
||||
y_ptr[j+3] += aj[6] * xtemp0 + aj[7] * xtemp1 ;
|
||||
aj += 8;
|
||||
|
||||
}
|
||||
|
||||
for ( ; j<n; j++ )
|
||||
{
|
||||
y_ptr[j] += aj[0] * xtemp0 + aj[1] * xtemp1 ;
|
||||
aj += 2;
|
||||
}
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
if ( inc_y == 1 )
|
||||
{
|
||||
|
||||
BLASLONG register lda2 = lda << 1;
|
||||
BLASLONG register lda4 = lda << 2;
|
||||
BLASLONG register lda3 = lda2 + lda;
|
||||
|
||||
for ( j=0; j< ( n & -4 ); j+=4 )
|
||||
{
|
||||
|
||||
y_ptr[j] += *aj * xtemp0 + *(aj+1) * xtemp1 ;
|
||||
y_ptr[j+1] += *(aj+lda) * xtemp0 + *(aj+lda+1) * xtemp1 ;
|
||||
y_ptr[j+2] += *(aj+lda2) * xtemp0 + *(aj+lda2+1) * xtemp1 ;
|
||||
y_ptr[j+3] += *(aj+lda3) * xtemp0 + *(aj+lda3+1) * xtemp1 ;
|
||||
aj += lda4;
|
||||
}
|
||||
|
||||
for ( ; j< n ; j++ )
|
||||
{
|
||||
|
||||
y_ptr[j] += *aj * xtemp0 + *(aj+1) * xtemp1 ;
|
||||
aj += lda;
|
||||
}
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
for ( j=0; j<n; j++ )
|
||||
{
|
||||
*y_ptr += *aj * xtemp0 + *(aj+1) * xtemp1 ;
|
||||
y_ptr += inc_y;
|
||||
aj += lda;
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
return(0);
|
||||
|
||||
}
|
||||
|
||||
FLOAT xtemp = *x_ptr * alpha;
|
||||
FLOAT *aj = a_ptr;
|
||||
y_ptr = y;
|
||||
if ( lda == 1 && inc_y == 1 )
|
||||
{
|
||||
for ( j=0; j< ( n & -4) ; j+=4 )
|
||||
{
|
||||
y_ptr[j] += aj[j] * xtemp;
|
||||
y_ptr[j+1] += aj[j+1] * xtemp;
|
||||
y_ptr[j+2] += aj[j+2] * xtemp;
|
||||
y_ptr[j+3] += aj[j+3] * xtemp;
|
||||
}
|
||||
for ( ; j<n ; j++ )
|
||||
{
|
||||
y_ptr[j] += aj[j] * xtemp;
|
||||
}
|
||||
|
||||
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
if ( inc_y == 1 )
|
||||
{
|
||||
|
||||
BLASLONG register lda2 = lda << 1;
|
||||
BLASLONG register lda4 = lda << 2;
|
||||
BLASLONG register lda3 = lda2 + lda;
|
||||
for ( j=0; j< ( n & -4 ); j+=4 )
|
||||
{
|
||||
y_ptr[j] += *aj * xtemp;
|
||||
y_ptr[j+1] += *(aj+lda) * xtemp;
|
||||
y_ptr[j+2] += *(aj+lda2) * xtemp;
|
||||
y_ptr[j+3] += *(aj+lda3) * xtemp;
|
||||
aj += lda4 ;
|
||||
}
|
||||
|
||||
for ( ; j<n; j++ )
|
||||
{
|
||||
y_ptr[j] += *aj * xtemp;
|
||||
aj += lda;
|
||||
}
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
for ( j=0; j<n; j++ )
|
||||
{
|
||||
*y_ptr += *aj * xtemp;
|
||||
y_ptr += inc_y;
|
||||
aj += lda;
|
||||
}
|
||||
|
||||
}
|
||||
}
|
||||
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -25,10 +25,10 @@ OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
|||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#define HAVE_KERNEL_16x4 1
|
||||
static void dgemv_kernel_16x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y) __attribute__ ((noinline));
|
||||
#define HAVE_KERNEL_4x4 1
|
||||
static void dgemv_kernel_4x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y) __attribute__ ((noinline));
|
||||
|
||||
static void dgemv_kernel_16x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
||||
static void dgemv_kernel_4x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
|
@ -41,29 +41,49 @@ static void dgemv_kernel_16x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
|||
"vxorpd %%ymm6 , %%ymm6, %%ymm6 \n\t"
|
||||
"vxorpd %%ymm7 , %%ymm7, %%ymm7 \n\t"
|
||||
|
||||
"testq $0x04, %1 \n\t"
|
||||
"jz .L08LABEL%= \n\t"
|
||||
|
||||
"vmovups (%2,%0,8), %%ymm12 \n\t" // 4 * x
|
||||
|
||||
"vfmadd231pd (%4,%0,8), %%ymm12, %%ymm4 \n\t"
|
||||
"vfmadd231pd (%5,%0,8), %%ymm12, %%ymm5 \n\t"
|
||||
"vfmadd231pd (%6,%0,8), %%ymm12, %%ymm6 \n\t"
|
||||
"vfmadd231pd (%7,%0,8), %%ymm12, %%ymm7 \n\t"
|
||||
|
||||
"addq $4 , %0 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
|
||||
".L08LABEL%=: \n\t"
|
||||
|
||||
"cmpq $0, %1 \n\t"
|
||||
"je .L16END%= \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"prefetcht0 384(%2,%0,8) \n\t"
|
||||
// "prefetcht0 384(%2,%0,8) \n\t"
|
||||
"vmovups (%2,%0,8), %%ymm12 \n\t" // 4 * x
|
||||
"vmovups 32(%2,%0,8), %%ymm13 \n\t" // 4 * x
|
||||
|
||||
"prefetcht0 384(%4,%0,8) \n\t"
|
||||
// "prefetcht0 384(%4,%0,8) \n\t"
|
||||
"vfmadd231pd (%4,%0,8), %%ymm12, %%ymm4 \n\t"
|
||||
"vfmadd231pd (%5,%0,8), %%ymm12, %%ymm5 \n\t"
|
||||
"prefetcht0 384(%5,%0,8) \n\t"
|
||||
"vfmadd231pd 32(%4,%0,8), %%ymm13, %%ymm4 \n\t"
|
||||
"vfmadd231pd 32(%5,%0,8), %%ymm13, %%ymm5 \n\t"
|
||||
"prefetcht0 384(%6,%0,8) \n\t"
|
||||
// "prefetcht0 384(%5,%0,8) \n\t"
|
||||
"vfmadd231pd (%6,%0,8), %%ymm12, %%ymm6 \n\t"
|
||||
"vfmadd231pd (%7,%0,8), %%ymm12, %%ymm7 \n\t"
|
||||
"prefetcht0 384(%7,%0,8) \n\t"
|
||||
"vfmadd231pd 32(%6,%0,8), %%ymm13, %%ymm6 \n\t"
|
||||
"vfmadd231pd 32(%7,%0,8), %%ymm13, %%ymm7 \n\t"
|
||||
// "prefetcht0 384(%6,%0,8) \n\t"
|
||||
"vfmadd231pd 32(%4,%0,8), %%ymm13, %%ymm4 \n\t"
|
||||
"vfmadd231pd 32(%5,%0,8), %%ymm13, %%ymm5 \n\t"
|
||||
"addq $8 , %0 \n\t"
|
||||
// "prefetcht0 384(%7,%0,8) \n\t"
|
||||
"vfmadd231pd -32(%6,%0,8), %%ymm13, %%ymm6 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
"vfmadd231pd -32(%7,%0,8), %%ymm13, %%ymm7 \n\t"
|
||||
|
||||
"addq $8 , %0 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
".L16END%=: \n\t"
|
||||
|
||||
"vextractf128 $1 , %%ymm4, %%xmm12 \n\t"
|
||||
"vextractf128 $1 , %%ymm5, %%xmm13 \n\t"
|
||||
"vextractf128 $1 , %%ymm6, %%xmm14 \n\t"
|
||||
|
|
@ -0,0 +1,299 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2013, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
#include "common.h"
|
||||
|
||||
#if defined(BULLDOZER)
|
||||
#include "dsymv_L_microk_bulldozer-2.c"
|
||||
#elif defined(NEHALEM)
|
||||
#include "dsymv_L_microk_nehalem-2.c"
|
||||
#endif
|
||||
|
||||
|
||||
#ifndef HAVE_KERNEL_4x4
|
||||
|
||||
static void dsymv_kernel_4x4(BLASLONG from, BLASLONG to, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *tmp1, FLOAT *temp2)
|
||||
{
|
||||
FLOAT tmp2[4] = { 0.0, 0.0, 0.0, 0.0 };
|
||||
BLASLONG i;
|
||||
|
||||
for (i=from; i<to; i+=4)
|
||||
{
|
||||
|
||||
y[i] += tmp1[0] * ap[0][i];
|
||||
tmp2[0] += ap[0][i] * x[i];
|
||||
y[i] += tmp1[1] * ap[1][i];
|
||||
tmp2[1] += ap[1][i] * x[i];
|
||||
y[i] += tmp1[2] * ap[2][i];
|
||||
tmp2[2] += ap[2][i] * x[i];
|
||||
y[i] += tmp1[3] * ap[3][i];
|
||||
tmp2[3] += ap[3][i] * x[i];
|
||||
|
||||
y[i+1] += tmp1[0] * ap[0][i+1];
|
||||
tmp2[0] += ap[0][i+1] * x[i+1];
|
||||
y[i+1] += tmp1[1] * ap[1][i+1];
|
||||
tmp2[1] += ap[1][i+1] * x[i+1];
|
||||
y[i+1] += tmp1[2] * ap[2][i+1];
|
||||
tmp2[2] += ap[2][i+1] * x[i+1];
|
||||
y[i+1] += tmp1[3] * ap[3][i+1];
|
||||
tmp2[3] += ap[3][i+1] * x[i+1];
|
||||
|
||||
y[i+2] += tmp1[0] * ap[0][i+2];
|
||||
tmp2[0] += ap[0][i+2] * x[i+2];
|
||||
y[i+2] += tmp1[1] * ap[1][i+2];
|
||||
tmp2[1] += ap[1][i+2] * x[i+2];
|
||||
y[i+2] += tmp1[2] * ap[2][i+2];
|
||||
tmp2[2] += ap[2][i+2] * x[i+2];
|
||||
y[i+2] += tmp1[3] * ap[3][i+2];
|
||||
tmp2[3] += ap[3][i+2] * x[i+2];
|
||||
|
||||
y[i+3] += tmp1[0] * ap[0][i+3];
|
||||
tmp2[0] += ap[0][i+3] * x[i+3];
|
||||
y[i+3] += tmp1[1] * ap[1][i+3];
|
||||
tmp2[1] += ap[1][i+3] * x[i+3];
|
||||
y[i+3] += tmp1[2] * ap[2][i+3];
|
||||
tmp2[2] += ap[2][i+3] * x[i+3];
|
||||
y[i+3] += tmp1[3] * ap[3][i+3];
|
||||
tmp2[3] += ap[3][i+3] * x[i+3];
|
||||
|
||||
}
|
||||
|
||||
temp2[0] += tmp2[0];
|
||||
temp2[1] += tmp2[1];
|
||||
temp2[2] += tmp2[2];
|
||||
temp2[3] += tmp2[3];
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
|
||||
|
||||
|
||||
int CNAME(BLASLONG m, BLASLONG offset, FLOAT alpha, FLOAT *a, BLASLONG lda, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y, FLOAT *buffer)
|
||||
{
|
||||
BLASLONG i;
|
||||
BLASLONG ix,iy;
|
||||
BLASLONG jx,jy;
|
||||
BLASLONG j;
|
||||
FLOAT temp1;
|
||||
FLOAT temp2;
|
||||
FLOAT tmp1[4];
|
||||
FLOAT tmp2[4];
|
||||
FLOAT *ap[4];
|
||||
|
||||
#if 0
|
||||
if ( m != offset )
|
||||
printf("Symv_L: m=%d offset=%d\n",m,offset);
|
||||
#endif
|
||||
|
||||
|
||||
if ( (inc_x != 1) || (inc_y != 1) )
|
||||
{
|
||||
|
||||
jx = 0;
|
||||
jy = 0;
|
||||
|
||||
for (j=0; j<offset; j++)
|
||||
{
|
||||
temp1 = alpha * x[jx];
|
||||
temp2 = 0.0;
|
||||
y[jy] += temp1 * a[j*lda+j];
|
||||
iy = jy;
|
||||
ix = jx;
|
||||
for (i=j+1; i<m; i++)
|
||||
{
|
||||
ix += inc_x;
|
||||
iy += inc_y;
|
||||
y[iy] += temp1 * a[j*lda+i];
|
||||
temp2 += a[j*lda+i] * x[ix];
|
||||
|
||||
}
|
||||
y[jy] += alpha * temp2;
|
||||
jx += inc_x;
|
||||
jy += inc_y;
|
||||
}
|
||||
return(0);
|
||||
}
|
||||
|
||||
BLASLONG offset1 = (offset/4)*4;
|
||||
|
||||
for (j=0; j<offset1; j+=4)
|
||||
{
|
||||
tmp1[0] = alpha * x[j];
|
||||
tmp1[1] = alpha * x[j+1];
|
||||
tmp1[2] = alpha * x[j+2];
|
||||
tmp1[3] = alpha * x[j+3];
|
||||
tmp2[0] = 0.0;
|
||||
tmp2[1] = 0.0;
|
||||
tmp2[2] = 0.0;
|
||||
tmp2[3] = 0.0;
|
||||
ap[0] = &a[j*lda];
|
||||
ap[1] = ap[0] + lda;
|
||||
ap[2] = ap[1] + lda;
|
||||
ap[3] = ap[2] + lda;
|
||||
y[j] += tmp1[0] * ap[0][j];
|
||||
y[j+1] += tmp1[1] * ap[1][j+1];
|
||||
y[j+2] += tmp1[2] * ap[2][j+2];
|
||||
y[j+3] += tmp1[3] * ap[3][j+3];
|
||||
BLASLONG from = j+1;
|
||||
if ( m - from >=12 )
|
||||
{
|
||||
BLASLONG m2 = (m/4)*4;
|
||||
for (i=j+1; i<j+4; i++)
|
||||
{
|
||||
y[i] += tmp1[0] * ap[0][i];
|
||||
tmp2[0] += ap[0][i] * x[i];
|
||||
}
|
||||
|
||||
for (i=j+2; i<j+4; i++)
|
||||
{
|
||||
y[i] += tmp1[1] * ap[1][i];
|
||||
tmp2[1] += ap[1][i] * x[i];
|
||||
}
|
||||
|
||||
for (i=j+3; i<j+4; i++)
|
||||
{
|
||||
y[i] += tmp1[2] * ap[2][i];
|
||||
tmp2[2] += ap[2][i] * x[i];
|
||||
}
|
||||
|
||||
if ( m2 > j+4 )
|
||||
dsymv_kernel_4x4(j+4,m2,ap,x,y,tmp1,tmp2);
|
||||
|
||||
|
||||
for (i=m2; i<m; i++)
|
||||
{
|
||||
y[i] += tmp1[0] * ap[0][i];
|
||||
tmp2[0] += ap[0][i] * x[i];
|
||||
|
||||
y[i] += tmp1[1] * ap[1][i];
|
||||
tmp2[1] += ap[1][i] * x[i];
|
||||
|
||||
y[i] += tmp1[2] * ap[2][i];
|
||||
tmp2[2] += ap[2][i] * x[i];
|
||||
|
||||
y[i] += tmp1[3] * ap[3][i];
|
||||
tmp2[3] += ap[3][i] * x[i];
|
||||
|
||||
}
|
||||
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
|
||||
for (i=j+1; i<j+4; i++)
|
||||
{
|
||||
y[i] += tmp1[0] * ap[0][i];
|
||||
tmp2[0] += ap[0][i] * x[i];
|
||||
}
|
||||
|
||||
for (i=j+2; i<j+4; i++)
|
||||
{
|
||||
y[i] += tmp1[1] * ap[1][i];
|
||||
tmp2[1] += ap[1][i] * x[i];
|
||||
}
|
||||
|
||||
for (i=j+3; i<j+4; i++)
|
||||
{
|
||||
y[i] += tmp1[2] * ap[2][i];
|
||||
tmp2[2] += ap[2][i] * x[i];
|
||||
}
|
||||
|
||||
for (i=j+4; i<m; i++)
|
||||
{
|
||||
y[i] += tmp1[0] * ap[0][i];
|
||||
tmp2[0] += ap[0][i] * x[i];
|
||||
|
||||
y[i] += tmp1[1] * ap[1][i];
|
||||
tmp2[1] += ap[1][i] * x[i];
|
||||
|
||||
y[i] += tmp1[2] * ap[2][i];
|
||||
tmp2[2] += ap[2][i] * x[i];
|
||||
|
||||
y[i] += tmp1[3] * ap[3][i];
|
||||
tmp2[3] += ap[3][i] * x[i];
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
y[j] += alpha * tmp2[0];
|
||||
y[j+1] += alpha * tmp2[1];
|
||||
y[j+2] += alpha * tmp2[2];
|
||||
y[j+3] += alpha * tmp2[3];
|
||||
}
|
||||
|
||||
|
||||
for (j=offset1; j<offset; j++)
|
||||
{
|
||||
temp1 = alpha * x[j];
|
||||
temp2 = 0.0;
|
||||
y[j] += temp1 * a[j*lda+j];
|
||||
BLASLONG from = j+1;
|
||||
if ( m - from >=8 )
|
||||
{
|
||||
BLASLONG j1 = ((from + 4)/4)*4;
|
||||
BLASLONG j2 = (m/4)*4;
|
||||
for (i=from; i<j1; i++)
|
||||
{
|
||||
y[i] += temp1 * a[j*lda+i];
|
||||
temp2 += a[j*lda+i] * x[i];
|
||||
|
||||
}
|
||||
|
||||
for (i=j1; i<j2; i++)
|
||||
{
|
||||
y[i] += temp1 * a[j*lda+i];
|
||||
temp2 += a[j*lda+i] * x[i];
|
||||
|
||||
}
|
||||
|
||||
for (i=j2; i<m; i++)
|
||||
{
|
||||
y[i] += temp1 * a[j*lda+i];
|
||||
temp2 += a[j*lda+i] * x[i];
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
for (i=from; i<m; i++)
|
||||
{
|
||||
y[i] += temp1 * a[j*lda+i];
|
||||
temp2 += a[j*lda+i] * x[i];
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
y[j] += alpha * temp2;
|
||||
}
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,137 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#define HAVE_KERNEL_4x4 1
|
||||
static void dsymv_kernel_4x4( BLASLONG from, BLASLONG to, FLOAT **a, FLOAT *x, FLOAT *y, FLOAT *temp1, FLOAT *temp2) __attribute__ ((noinline));
|
||||
|
||||
static void dsymv_kernel_4x4(BLASLONG from, BLASLONG to, FLOAT **a, FLOAT *x, FLOAT *y, FLOAT *temp1, FLOAT *temp2)
|
||||
{
|
||||
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vxorpd %%xmm0 , %%xmm0 , %%xmm0 \n\t" // temp2[0]
|
||||
"vxorpd %%xmm1 , %%xmm1 , %%xmm1 \n\t" // temp2[1]
|
||||
"vxorpd %%xmm2 , %%xmm2 , %%xmm2 \n\t" // temp2[2]
|
||||
"vxorpd %%xmm3 , %%xmm3 , %%xmm3 \n\t" // temp2[3]
|
||||
"vmovddup (%8), %%xmm4 \n\t" // temp1[0]
|
||||
"vmovddup 8(%8), %%xmm5 \n\t" // temp1[1]
|
||||
"vmovddup 16(%8), %%xmm6 \n\t" // temp1[1]
|
||||
"vmovddup 24(%8), %%xmm7 \n\t" // temp1[1]
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
|
||||
"vmovups (%4,%0,8), %%xmm12 \n\t" // 2 * a
|
||||
"vmovups (%2,%0,8), %%xmm8 \n\t" // 2 * x
|
||||
"vmovups (%3,%0,8), %%xmm9 \n\t" // 2 * y
|
||||
|
||||
"vmovups (%5,%0,8), %%xmm13 \n\t" // 2 * a
|
||||
|
||||
"vfmaddpd %%xmm0 , %%xmm8, %%xmm12 , %%xmm0 \n\t" // temp2 += x * a
|
||||
"vfmaddpd %%xmm9 , %%xmm4, %%xmm12 , %%xmm9 \n\t" // y += temp1 * a
|
||||
"vmovups (%6,%0,8), %%xmm14 \n\t" // 2 * a
|
||||
|
||||
"vfmaddpd %%xmm1 , %%xmm8, %%xmm13 , %%xmm1 \n\t" // temp2 += x * a
|
||||
"vfmaddpd %%xmm9 , %%xmm5, %%xmm13 , %%xmm9 \n\t" // y += temp1 * a
|
||||
"vmovups (%7,%0,8), %%xmm15 \n\t" // 2 * a
|
||||
|
||||
"vmovups 16(%3,%0,8), %%xmm11 \n\t" // 2 * y
|
||||
"vfmaddpd %%xmm2 , %%xmm8, %%xmm14 , %%xmm2 \n\t" // temp2 += x * a
|
||||
"vmovups 16(%4,%0,8), %%xmm12 \n\t" // 2 * a
|
||||
"vfmaddpd %%xmm9 , %%xmm6, %%xmm14 , %%xmm9 \n\t" // y += temp1 * a
|
||||
"vmovups 16(%2,%0,8), %%xmm10 \n\t" // 2 * x
|
||||
|
||||
"vfmaddpd %%xmm3 , %%xmm8, %%xmm15 , %%xmm3 \n\t" // temp2 += x * a
|
||||
"vfmaddpd %%xmm9 , %%xmm7, %%xmm15 , %%xmm9 \n\t" // y += temp1 * a
|
||||
|
||||
"vmovups 16(%5,%0,8), %%xmm13 \n\t" // 2 * a
|
||||
"vmovups 16(%6,%0,8), %%xmm14 \n\t" // 2 * a
|
||||
|
||||
"vfmaddpd %%xmm0 , %%xmm10, %%xmm12 , %%xmm0 \n\t" // temp2 += x * a
|
||||
"vfmaddpd %%xmm11 , %%xmm4, %%xmm12 , %%xmm11 \n\t" // y += temp1 * a
|
||||
|
||||
"vmovups 16(%7,%0,8), %%xmm15 \n\t" // 2 * a
|
||||
"vfmaddpd %%xmm1 , %%xmm10, %%xmm13 , %%xmm1 \n\t" // temp2 += x * a
|
||||
"vfmaddpd %%xmm11 , %%xmm5, %%xmm13 , %%xmm11 \n\t" // y += temp1 * a
|
||||
|
||||
"vfmaddpd %%xmm2 , %%xmm10, %%xmm14 , %%xmm2 \n\t" // temp2 += x * a
|
||||
"vfmaddpd %%xmm11 , %%xmm6, %%xmm14 , %%xmm11 \n\t" // y += temp1 * a
|
||||
|
||||
"vfmaddpd %%xmm3 , %%xmm10, %%xmm15 , %%xmm3 \n\t" // temp2 += x * a
|
||||
"vfmaddpd %%xmm11 , %%xmm7, %%xmm15 , %%xmm11 \n\t" // y += temp1 * a
|
||||
"addq $4 , %0 \n\t"
|
||||
|
||||
"vmovups %%xmm9 , -32(%3,%0,8) \n\t"
|
||||
"vmovups %%xmm11 , -16(%3,%0,8) \n\t"
|
||||
|
||||
"cmpq %0 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmovsd (%9), %%xmm4 \n\t"
|
||||
"vmovsd 8(%9), %%xmm5 \n\t"
|
||||
"vmovsd 16(%9), %%xmm6 \n\t"
|
||||
"vmovsd 24(%9), %%xmm7 \n\t"
|
||||
|
||||
"vhaddpd %%xmm0, %%xmm0, %%xmm0 \n\t"
|
||||
"vhaddpd %%xmm1, %%xmm1, %%xmm1 \n\t"
|
||||
"vhaddpd %%xmm2, %%xmm2, %%xmm2 \n\t"
|
||||
"vhaddpd %%xmm3, %%xmm3, %%xmm3 \n\t"
|
||||
|
||||
"vaddsd %%xmm4, %%xmm0, %%xmm0 \n\t"
|
||||
"vaddsd %%xmm5, %%xmm1, %%xmm1 \n\t"
|
||||
"vaddsd %%xmm6, %%xmm2, %%xmm2 \n\t"
|
||||
"vaddsd %%xmm7, %%xmm3, %%xmm3 \n\t"
|
||||
|
||||
"vmovsd %%xmm0 , (%9) \n\t" // save temp2
|
||||
"vmovsd %%xmm1 , 8(%9) \n\t" // save temp2
|
||||
"vmovsd %%xmm2 ,16(%9) \n\t" // save temp2
|
||||
"vmovsd %%xmm3 ,24(%9) \n\t" // save temp2
|
||||
|
||||
:
|
||||
:
|
||||
"r" (from), // 0
|
||||
"r" (to), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (a[0]), // 4
|
||||
"r" (a[1]), // 5
|
||||
"r" (a[2]), // 6
|
||||
"r" (a[3]), // 8
|
||||
"r" (temp1), // 8
|
||||
"r" (temp2) // 9
|
||||
: "cc",
|
||||
"%xmm0", "%xmm1", "%xmm2", "%xmm3",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,132 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#define HAVE_KERNEL_4x4 1
|
||||
static void dsymv_kernel_4x4( BLASLONG from, BLASLONG to, FLOAT **a, FLOAT *x, FLOAT *y, FLOAT *temp1, FLOAT *temp2) __attribute__ ((noinline));
|
||||
|
||||
static void dsymv_kernel_4x4(BLASLONG from, BLASLONG to, FLOAT **a, FLOAT *x, FLOAT *y, FLOAT *temp1, FLOAT *temp2)
|
||||
{
|
||||
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"xorpd %%xmm0 , %%xmm0 \n\t" // temp2[0]
|
||||
"xorpd %%xmm1 , %%xmm1 \n\t" // temp2[1]
|
||||
"xorpd %%xmm2 , %%xmm2 \n\t" // temp2[2]
|
||||
"xorpd %%xmm3 , %%xmm3 \n\t" // temp2[3]
|
||||
"movsd (%8), %%xmm4 \n\t" // temp1[0]
|
||||
"movsd 8(%8), %%xmm5 \n\t" // temp1[1]
|
||||
"movsd 16(%8), %%xmm6 \n\t" // temp1[2]
|
||||
"movsd 24(%8), %%xmm7 \n\t" // temp1[3]
|
||||
"shufpd $0, %%xmm4, %%xmm4 \n\t"
|
||||
"shufpd $0, %%xmm5, %%xmm5 \n\t"
|
||||
"shufpd $0, %%xmm6, %%xmm6 \n\t"
|
||||
"shufpd $0, %%xmm7, %%xmm7 \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"movups (%4,%0,8), %%xmm12 \n\t" // 2 * a
|
||||
"movups (%2,%0,8), %%xmm8 \n\t" // 2 * x
|
||||
"movups %%xmm12 , %%xmm11 \n\t"
|
||||
"movups (%3,%0,8), %%xmm9 \n\t" // 2 * y
|
||||
"movups (%5,%0,8), %%xmm13 \n\t" // 2 * a
|
||||
|
||||
"mulpd %%xmm4 , %%xmm11 \n\t" // temp1 * a
|
||||
"addpd %%xmm11 , %%xmm9 \n\t" // y += temp1 * a
|
||||
"mulpd %%xmm8 , %%xmm12 \n\t" // a * x
|
||||
"addpd %%xmm12 , %%xmm0 \n\t" // temp2 += x * a
|
||||
|
||||
"movups (%6,%0,8), %%xmm14 \n\t" // 2 * a
|
||||
"movups (%7,%0,8), %%xmm15 \n\t" // 2 * a
|
||||
|
||||
"movups %%xmm13 , %%xmm11 \n\t"
|
||||
"mulpd %%xmm5 , %%xmm11 \n\t" // temp1 * a
|
||||
"addpd %%xmm11 , %%xmm9 \n\t" // y += temp1 * a
|
||||
"mulpd %%xmm8 , %%xmm13 \n\t" // a * x
|
||||
"addpd %%xmm13 , %%xmm1 \n\t" // temp2 += x * a
|
||||
|
||||
"movups %%xmm14 , %%xmm11 \n\t"
|
||||
"mulpd %%xmm6 , %%xmm11 \n\t" // temp1 * a
|
||||
"addpd %%xmm11 , %%xmm9 \n\t" // y += temp1 * a
|
||||
"mulpd %%xmm8 , %%xmm14 \n\t" // a * x
|
||||
"addpd %%xmm14 , %%xmm2 \n\t" // temp2 += x * a
|
||||
|
||||
"addq $2 , %0 \n\t"
|
||||
"movups %%xmm15 , %%xmm11 \n\t"
|
||||
"mulpd %%xmm7 , %%xmm11 \n\t" // temp1 * a
|
||||
"addpd %%xmm11 , %%xmm9 \n\t" // y += temp1 * a
|
||||
"mulpd %%xmm8 , %%xmm15 \n\t" // a * x
|
||||
"addpd %%xmm15 , %%xmm3 \n\t" // temp2 += x * a
|
||||
|
||||
"movups %%xmm9,-16(%3,%0,8) \n\t" // 2 * y
|
||||
|
||||
"cmpq %0 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"movsd (%9), %%xmm4 \n\t" // temp1[0]
|
||||
"movsd 8(%9), %%xmm5 \n\t" // temp1[1]
|
||||
"movsd 16(%9), %%xmm6 \n\t" // temp1[2]
|
||||
"movsd 24(%9), %%xmm7 \n\t" // temp1[3]
|
||||
|
||||
"haddpd %%xmm0, %%xmm0 \n\t"
|
||||
"haddpd %%xmm1, %%xmm1 \n\t"
|
||||
"haddpd %%xmm2, %%xmm2 \n\t"
|
||||
"haddpd %%xmm3, %%xmm3 \n\t"
|
||||
|
||||
"addsd %%xmm4, %%xmm0 \n\t"
|
||||
"addsd %%xmm5, %%xmm1 \n\t"
|
||||
"addsd %%xmm6, %%xmm2 \n\t"
|
||||
"addsd %%xmm7, %%xmm3 \n\t"
|
||||
|
||||
"movsd %%xmm0 , (%9) \n\t" // save temp2
|
||||
"movsd %%xmm1 , 8(%9) \n\t" // save temp2
|
||||
"movsd %%xmm2 , 16(%9) \n\t" // save temp2
|
||||
"movsd %%xmm3 , 24(%9) \n\t" // save temp2
|
||||
|
||||
:
|
||||
:
|
||||
"r" (from), // 0
|
||||
"r" (to), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (a[0]), // 4
|
||||
"r" (a[1]), // 5
|
||||
"r" (a[2]), // 6
|
||||
"r" (a[3]), // 7
|
||||
"r" (temp1), // 8
|
||||
"r" (temp2) // 9
|
||||
: "cc",
|
||||
"%xmm0", "%xmm1", "%xmm2", "%xmm3",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,273 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2013, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
#include "common.h"
|
||||
|
||||
|
||||
#if defined(BULLDOZER)
|
||||
#include "dsymv_U_microk_bulldozer-2.c"
|
||||
#elif defined(NEHALEM)
|
||||
#include "dsymv_U_microk_nehalem-2.c"
|
||||
#endif
|
||||
|
||||
#ifndef HAVE_KERNEL_4x4
|
||||
|
||||
static void dsymv_kernel_4x4(BLASLONG n, FLOAT *a0, FLOAT *a1, FLOAT *a2, FLOAT *a3, FLOAT *xp, FLOAT *yp, FLOAT *temp1, FLOAT *temp2)
|
||||
{
|
||||
FLOAT at0,at1,at2,at3;
|
||||
FLOAT x;
|
||||
FLOAT tmp2[4] = { 0.0, 0.0, 0.0, 0.0 };
|
||||
FLOAT tp0;
|
||||
FLOAT tp1;
|
||||
FLOAT tp2;
|
||||
FLOAT tp3;
|
||||
BLASLONG i;
|
||||
|
||||
tp0 = temp1[0];
|
||||
tp1 = temp1[1];
|
||||
tp2 = temp1[2];
|
||||
tp3 = temp1[3];
|
||||
|
||||
for (i=0; i<n; i++)
|
||||
{
|
||||
at0 = a0[i];
|
||||
at1 = a1[i];
|
||||
at2 = a2[i];
|
||||
at3 = a3[i];
|
||||
x = xp[i];
|
||||
yp[i] += tp0 * at0 + tp1 *at1 + tp2 * at2 + tp3 * at3;
|
||||
tmp2[0] += at0 * x;
|
||||
tmp2[1] += at1 * x;
|
||||
tmp2[2] += at2 * x;
|
||||
tmp2[3] += at3 * x;
|
||||
|
||||
}
|
||||
|
||||
temp2[0] += tmp2[0];
|
||||
temp2[1] += tmp2[1];
|
||||
temp2[2] += tmp2[2];
|
||||
temp2[3] += tmp2[3];
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
|
||||
#ifndef HAVE_KERNEL_1x4
|
||||
|
||||
static void dsymv_kernel_1x4(BLASLONG from, BLASLONG to, FLOAT *a0, FLOAT *a1, FLOAT *a2, FLOAT *a3, FLOAT *xp, FLOAT *yp, FLOAT *temp1, FLOAT *temp2)
|
||||
{
|
||||
FLOAT at0,at1,at2,at3;
|
||||
FLOAT x;
|
||||
FLOAT tmp2[4] = { 0.0, 0.0, 0.0, 0.0 };
|
||||
FLOAT tp0;
|
||||
FLOAT tp1;
|
||||
FLOAT tp2;
|
||||
FLOAT tp3;
|
||||
BLASLONG i;
|
||||
|
||||
tp0 = temp1[0];
|
||||
tp1 = temp1[1];
|
||||
tp2 = temp1[2];
|
||||
tp3 = temp1[3];
|
||||
|
||||
for (i=from; i<to; i++)
|
||||
{
|
||||
at0 = a0[i];
|
||||
at1 = a1[i];
|
||||
at2 = a2[i];
|
||||
at3 = a3[i];
|
||||
x = xp[i];
|
||||
yp[i] += tp0 * at0 + tp1 *at1 + tp2 * at2 + tp3 * at3;
|
||||
tmp2[0] += at0 * x;
|
||||
tmp2[1] += at1 * x;
|
||||
tmp2[2] += at2 * x;
|
||||
tmp2[3] += at3 * x;
|
||||
|
||||
}
|
||||
|
||||
temp2[0] += tmp2[0];
|
||||
temp2[1] += tmp2[1];
|
||||
temp2[2] += tmp2[2];
|
||||
temp2[3] += tmp2[3];
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
|
||||
static void dsymv_kernel_8x1(BLASLONG n, FLOAT *a0, FLOAT *xp, FLOAT *yp, FLOAT *temp1, FLOAT *temp2)
|
||||
{
|
||||
FLOAT at0,at1,at2,at3;
|
||||
FLOAT temp = 0.0;
|
||||
FLOAT t1 = *temp1;
|
||||
BLASLONG i;
|
||||
|
||||
for (i=0; i<(n/4)*4; i+=4)
|
||||
{
|
||||
at0 = a0[i];
|
||||
at1 = a0[i+1];
|
||||
at2 = a0[i+2];
|
||||
at3 = a0[i+3];
|
||||
|
||||
yp[i] += t1 * at0;
|
||||
temp += at0 * xp[i];
|
||||
yp[i+1] += t1 * at1;
|
||||
temp += at1 * xp[i+1];
|
||||
|
||||
yp[i+2] += t1 * at2;
|
||||
temp += at2 * xp[i+2];
|
||||
yp[i+3] += t1 * at3;
|
||||
temp += at3 * xp[i+3];
|
||||
|
||||
}
|
||||
*temp2 = temp;
|
||||
}
|
||||
|
||||
int CNAME(BLASLONG m, BLASLONG offset, FLOAT alpha, FLOAT *a, BLASLONG lda, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y, FLOAT *buffer)
|
||||
{
|
||||
BLASLONG i;
|
||||
BLASLONG ix,iy;
|
||||
BLASLONG jx,jy;
|
||||
BLASLONG j;
|
||||
BLASLONG j1;
|
||||
BLASLONG j2;
|
||||
BLASLONG m2;
|
||||
FLOAT temp1;
|
||||
FLOAT temp2;
|
||||
FLOAT *xp, *yp;
|
||||
FLOAT *a0,*a1,*a2,*a3;
|
||||
FLOAT at0,at1,at2,at3;
|
||||
FLOAT tmp1[4];
|
||||
FLOAT tmp2[4];
|
||||
|
||||
#if 0
|
||||
if( m != offset )
|
||||
printf("Symv_U: m=%d offset=%d\n",m,offset);
|
||||
#endif
|
||||
|
||||
BLASLONG m1 = m - offset;
|
||||
BLASLONG mrange = m -m1;
|
||||
|
||||
if ( (inc_x!=1) || (inc_y!=1) || (mrange<16) )
|
||||
{
|
||||
|
||||
jx = m1 * inc_x;
|
||||
jy = m1 * inc_y;
|
||||
|
||||
for (j=m1; j<m; j++)
|
||||
{
|
||||
temp1 = alpha * x[jx];
|
||||
temp2 = 0.0;
|
||||
iy = 0;
|
||||
ix = 0;
|
||||
for (i=0; i<j; i++)
|
||||
{
|
||||
y[iy] += temp1 * a[j*lda+i];
|
||||
temp2 += a[j*lda+i] * x[ix];
|
||||
ix += inc_x;
|
||||
iy += inc_y;
|
||||
|
||||
}
|
||||
y[jy] += temp1 * a[j*lda+j] + alpha * temp2;
|
||||
jx += inc_x;
|
||||
jy += inc_y;
|
||||
}
|
||||
return(0);
|
||||
}
|
||||
|
||||
xp = x;
|
||||
yp = y;
|
||||
|
||||
m2 = m - ( mrange % 4 );
|
||||
|
||||
for (j=m1; j<m2; j+=4)
|
||||
{
|
||||
tmp1[0] = alpha * xp[j];
|
||||
tmp1[1] = alpha * xp[j+1];
|
||||
tmp1[2] = alpha * xp[j+2];
|
||||
tmp1[3] = alpha * xp[j+3];
|
||||
tmp2[0] = 0.0;
|
||||
tmp2[1] = 0.0;
|
||||
tmp2[2] = 0.0;
|
||||
tmp2[3] = 0.0;
|
||||
a0 = &a[j*lda];
|
||||
a1 = a0+lda;
|
||||
a2 = a1+lda;
|
||||
a3 = a2+lda;
|
||||
j1 = (j/8)*8;
|
||||
if ( j1 )
|
||||
dsymv_kernel_4x4(j1, a0, a1, a2, a3, xp, yp, tmp1, tmp2);
|
||||
if ( j1 < j )
|
||||
dsymv_kernel_1x4(j1, j, a0, a1, a2, a3, xp, yp, tmp1, tmp2);
|
||||
|
||||
j2 = 0;
|
||||
for ( j1 = j ; j1 < j+4 ; j1++ )
|
||||
{
|
||||
temp1 = tmp1[j2];
|
||||
temp2 = tmp2[j2];
|
||||
a0 = &a[j1*lda];
|
||||
for ( i=j ; i<j1; i++ )
|
||||
{
|
||||
yp[i] += temp1 * a0[i];
|
||||
temp2 += a0[i] * xp[i];
|
||||
|
||||
}
|
||||
y[j1] += temp1 * a0[j1] + alpha * temp2;
|
||||
j2++;
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
for ( ; j<m; j++)
|
||||
{
|
||||
temp1 = alpha * xp[j];
|
||||
temp2 = 0.0;
|
||||
a0 = &a[j*lda];
|
||||
FLOAT at0;
|
||||
j1 = (j/8)*8;
|
||||
|
||||
if ( j1 )
|
||||
dsymv_kernel_8x1(j1, a0, xp, yp, &temp1, &temp2);
|
||||
|
||||
for (i=j1 ; i<j; i++)
|
||||
{
|
||||
at0 = a0[i];
|
||||
yp[i] += temp1 * at0;
|
||||
temp2 += at0 * xp[i];
|
||||
|
||||
}
|
||||
|
||||
yp[j] += temp1 * a0[j] + alpha * temp2;
|
||||
}
|
||||
|
||||
return(0);
|
||||
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,130 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#define HAVE_KERNEL_4x4 1
|
||||
static void dsymv_kernel_4x4( BLASLONG n, FLOAT *a0, FLOAT *a1, FLOAT *a2, FLOAT *a3, FLOAT *x, FLOAT *y, FLOAT *temp1, FLOAT *temp2) __attribute__ ((noinline));
|
||||
|
||||
static void dsymv_kernel_4x4(BLASLONG n, FLOAT *a0, FLOAT *a1, FLOAT *a2, FLOAT *a3, FLOAT *x, FLOAT *y, FLOAT *temp1, FLOAT *temp2)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vxorpd %%xmm0 , %%xmm0 , %%xmm0 \n\t" // temp2[0]
|
||||
"vxorpd %%xmm1 , %%xmm1 , %%xmm1 \n\t" // temp2[1]
|
||||
"vxorpd %%xmm2 , %%xmm2 , %%xmm2 \n\t" // temp2[2]
|
||||
"vxorpd %%xmm3 , %%xmm3 , %%xmm3 \n\t" // temp2[3]
|
||||
"vmovddup (%8), %%xmm4 \n\t" // temp1[0]
|
||||
"vmovddup 8(%8), %%xmm5 \n\t" // temp1[1]
|
||||
"vmovddup 16(%8), %%xmm6 \n\t" // temp1[1]
|
||||
"vmovddup 24(%8), %%xmm7 \n\t" // temp1[1]
|
||||
|
||||
"xorq %0,%0 \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
|
||||
"vmovups (%4,%0,8), %%xmm12 \n\t" // 2 * a
|
||||
"vmovups (%2,%0,8), %%xmm8 \n\t" // 2 * x
|
||||
"vmovups (%3,%0,8), %%xmm9 \n\t" // 2 * y
|
||||
|
||||
"vmovups (%5,%0,8), %%xmm13 \n\t" // 2 * a
|
||||
|
||||
"vfmaddpd %%xmm0 , %%xmm8, %%xmm12 , %%xmm0 \n\t" // temp2 += x * a
|
||||
"vfmaddpd %%xmm9 , %%xmm4, %%xmm12 , %%xmm9 \n\t" // y += temp1 * a
|
||||
"vmovups (%6,%0,8), %%xmm14 \n\t" // 2 * a
|
||||
|
||||
"vfmaddpd %%xmm1 , %%xmm8, %%xmm13 , %%xmm1 \n\t" // temp2 += x * a
|
||||
"vfmaddpd %%xmm9 , %%xmm5, %%xmm13 , %%xmm9 \n\t" // y += temp1 * a
|
||||
"vmovups (%7,%0,8), %%xmm15 \n\t" // 2 * a
|
||||
|
||||
"vmovups 16(%3,%0,8), %%xmm11 \n\t" // 2 * y
|
||||
"vfmaddpd %%xmm2 , %%xmm8, %%xmm14 , %%xmm2 \n\t" // temp2 += x * a
|
||||
"vmovups 16(%4,%0,8), %%xmm12 \n\t" // 2 * a
|
||||
"vfmaddpd %%xmm9 , %%xmm6, %%xmm14 , %%xmm9 \n\t" // y += temp1 * a
|
||||
"vmovups 16(%2,%0,8), %%xmm10 \n\t" // 2 * x
|
||||
|
||||
"vfmaddpd %%xmm3 , %%xmm8, %%xmm15 , %%xmm3 \n\t" // temp2 += x * a
|
||||
"vfmaddpd %%xmm9 , %%xmm7, %%xmm15 , %%xmm9 \n\t" // y += temp1 * a
|
||||
|
||||
"vmovups 16(%5,%0,8), %%xmm13 \n\t" // 2 * a
|
||||
"vmovups 16(%6,%0,8), %%xmm14 \n\t" // 2 * a
|
||||
|
||||
"vfmaddpd %%xmm0 , %%xmm10, %%xmm12 , %%xmm0 \n\t" // temp2 += x * a
|
||||
"vfmaddpd %%xmm11 , %%xmm4, %%xmm12 , %%xmm11 \n\t" // y += temp1 * a
|
||||
|
||||
"vmovups 16(%7,%0,8), %%xmm15 \n\t" // 2 * a
|
||||
"vfmaddpd %%xmm1 , %%xmm10, %%xmm13 , %%xmm1 \n\t" // temp2 += x * a
|
||||
"vfmaddpd %%xmm11 , %%xmm5, %%xmm13 , %%xmm11 \n\t" // y += temp1 * a
|
||||
|
||||
"vfmaddpd %%xmm2 , %%xmm10, %%xmm14 , %%xmm2 \n\t" // temp2 += x * a
|
||||
"addq $4 , %0 \n\t"
|
||||
"vfmaddpd %%xmm11 , %%xmm6, %%xmm14 , %%xmm11 \n\t" // y += temp1 * a
|
||||
|
||||
"vfmaddpd %%xmm3 , %%xmm10, %%xmm15 , %%xmm3 \n\t" // temp2 += x * a
|
||||
"vfmaddpd %%xmm11 , %%xmm7, %%xmm15 , %%xmm11 \n\t" // y += temp1 * a
|
||||
"subq $4 , %1 \n\t"
|
||||
|
||||
"vmovups %%xmm9 , -32(%3,%0,8) \n\t"
|
||||
"vmovups %%xmm11 , -16(%3,%0,8) \n\t"
|
||||
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vhaddpd %%xmm0, %%xmm0, %%xmm0 \n\t"
|
||||
"vhaddpd %%xmm1, %%xmm1, %%xmm1 \n\t"
|
||||
"vhaddpd %%xmm2, %%xmm2, %%xmm2 \n\t"
|
||||
"vhaddpd %%xmm3, %%xmm3, %%xmm3 \n\t"
|
||||
|
||||
"vmovsd %%xmm0 , (%9) \n\t" // save temp2
|
||||
"vmovsd %%xmm1 , 8(%9) \n\t" // save temp2
|
||||
"vmovsd %%xmm2 ,16(%9) \n\t" // save temp2
|
||||
"vmovsd %%xmm3 ,24(%9) \n\t" // save temp2
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (a0), // 4
|
||||
"r" (a1), // 5
|
||||
"r" (a2), // 6
|
||||
"r" (a3), // 7
|
||||
"r" (temp1), // 8
|
||||
"r" (temp2) // 9
|
||||
: "cc",
|
||||
"%xmm0", "%xmm1", "%xmm2", "%xmm3",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,125 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#define HAVE_KERNEL_4x4 1
|
||||
static void dsymv_kernel_4x4( BLASLONG n, FLOAT *a0, FLOAT *a1, FLOAT *a2, FLOAT *a3, FLOAT *x, FLOAT *y, FLOAT *temp1, FLOAT *temp2) __attribute__ ((noinline));
|
||||
|
||||
static void dsymv_kernel_4x4(BLASLONG n, FLOAT *a0, FLOAT *a1, FLOAT *a2, FLOAT *a3, FLOAT *x, FLOAT *y, FLOAT *temp1, FLOAT *temp2)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"xorpd %%xmm0 , %%xmm0 \n\t" // temp2[0]
|
||||
"xorpd %%xmm1 , %%xmm1 \n\t" // temp2[1]
|
||||
"xorpd %%xmm2 , %%xmm2 \n\t" // temp2[2]
|
||||
"xorpd %%xmm3 , %%xmm3 \n\t" // temp2[3]
|
||||
"movsd (%8), %%xmm4 \n\t" // temp1[0]
|
||||
"movsd 8(%8), %%xmm5 \n\t" // temp1[1]
|
||||
"movsd 16(%8), %%xmm6 \n\t" // temp1[2]
|
||||
"movsd 24(%8), %%xmm7 \n\t" // temp1[3]
|
||||
"shufpd $0, %%xmm4, %%xmm4 \n\t"
|
||||
"shufpd $0, %%xmm5, %%xmm5 \n\t"
|
||||
"shufpd $0, %%xmm6, %%xmm6 \n\t"
|
||||
"shufpd $0, %%xmm7, %%xmm7 \n\t"
|
||||
|
||||
"xorq %0,%0 \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"movups (%4,%0,8), %%xmm12 \n\t" // 2 * a
|
||||
"movups (%2,%0,8), %%xmm8 \n\t" // 2 * x
|
||||
"movups %%xmm12 , %%xmm11 \n\t"
|
||||
"movups (%3,%0,8), %%xmm9 \n\t" // 2 * y
|
||||
"movups (%5,%0,8), %%xmm13 \n\t" // 2 * a
|
||||
|
||||
"mulpd %%xmm4 , %%xmm11 \n\t" // temp1 * a
|
||||
"addpd %%xmm11 , %%xmm9 \n\t" // y += temp1 * a
|
||||
"mulpd %%xmm8 , %%xmm12 \n\t" // a * x
|
||||
"addpd %%xmm12 , %%xmm0 \n\t" // temp2 += x * a
|
||||
|
||||
"movups (%6,%0,8), %%xmm14 \n\t" // 2 * a
|
||||
"movups (%7,%0,8), %%xmm15 \n\t" // 2 * a
|
||||
|
||||
"movups %%xmm13 , %%xmm11 \n\t"
|
||||
"mulpd %%xmm5 , %%xmm11 \n\t" // temp1 * a
|
||||
"addpd %%xmm11 , %%xmm9 \n\t" // y += temp1 * a
|
||||
"mulpd %%xmm8 , %%xmm13 \n\t" // a * x
|
||||
"addpd %%xmm13 , %%xmm1 \n\t" // temp2 += x * a
|
||||
|
||||
"movups %%xmm14 , %%xmm11 \n\t"
|
||||
"mulpd %%xmm6 , %%xmm11 \n\t" // temp1 * a
|
||||
"addpd %%xmm11 , %%xmm9 \n\t" // y += temp1 * a
|
||||
"mulpd %%xmm8 , %%xmm14 \n\t" // a * x
|
||||
"addpd %%xmm14 , %%xmm2 \n\t" // temp2 += x * a
|
||||
|
||||
"addq $2 , %0 \n\t"
|
||||
"movups %%xmm15 , %%xmm11 \n\t"
|
||||
"mulpd %%xmm7 , %%xmm11 \n\t" // temp1 * a
|
||||
"addpd %%xmm11 , %%xmm9 \n\t" // y += temp1 * a
|
||||
"mulpd %%xmm8 , %%xmm15 \n\t" // a * x
|
||||
"addpd %%xmm15 , %%xmm3 \n\t" // temp2 += x * a
|
||||
|
||||
"movups %%xmm9,-16(%3,%0,8) \n\t" // 2 * y
|
||||
|
||||
"subq $2 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"haddpd %%xmm0, %%xmm0 \n\t"
|
||||
"haddpd %%xmm1, %%xmm1 \n\t"
|
||||
"haddpd %%xmm2, %%xmm2 \n\t"
|
||||
"haddpd %%xmm3, %%xmm3 \n\t"
|
||||
|
||||
"movsd %%xmm0 , (%9) \n\t" // save temp2
|
||||
"movsd %%xmm1 , 8(%9) \n\t" // save temp2
|
||||
"movsd %%xmm2 , 16(%9) \n\t" // save temp2
|
||||
"movsd %%xmm3 , 24(%9) \n\t" // save temp2
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (a0), // 4
|
||||
"r" (a1), // 5
|
||||
"r" (a2), // 6
|
||||
"r" (a3), // 7
|
||||
"r" (temp1), // 8
|
||||
"r" (temp2) // 9
|
||||
: "cc",
|
||||
"%xmm0", "%xmm1", "%xmm2", "%xmm3",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,103 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
#include "common.h"
|
||||
|
||||
|
||||
#if defined(NEHALEM)
|
||||
#include "saxpy_microk_nehalem-2.c"
|
||||
#endif
|
||||
|
||||
|
||||
#ifndef HAVE_KERNEL_16
|
||||
|
||||
static void saxpy_kernel_16(BLASLONG n, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
BLASLONG register i = 0;
|
||||
FLOAT a = *alpha;
|
||||
|
||||
while(i < n)
|
||||
{
|
||||
y[i] += a * x[i];
|
||||
y[i+1] += a * x[i+1];
|
||||
y[i+2] += a * x[i+2];
|
||||
y[i+3] += a * x[i+3];
|
||||
y[i+4] += a * x[i+4];
|
||||
y[i+5] += a * x[i+5];
|
||||
y[i+6] += a * x[i+6];
|
||||
y[i+7] += a * x[i+7];
|
||||
i+=8 ;
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
int CNAME(BLASLONG n, BLASLONG dummy0, BLASLONG dummy1, FLOAT da, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y, FLOAT *dummy, BLASLONG dummy2)
|
||||
{
|
||||
BLASLONG i=0;
|
||||
BLASLONG ix=0,iy=0;
|
||||
|
||||
if ( n <= 0 ) return(0);
|
||||
|
||||
if ( (inc_x == 1) && (inc_y == 1) )
|
||||
{
|
||||
|
||||
int n1 = n & -16;
|
||||
|
||||
if ( n1 )
|
||||
saxpy_kernel_16(n1, x, y , &da );
|
||||
|
||||
i = n1;
|
||||
while(i < n)
|
||||
{
|
||||
|
||||
y[i] += da * x[i] ;
|
||||
i++ ;
|
||||
|
||||
}
|
||||
return(0);
|
||||
|
||||
|
||||
}
|
||||
|
||||
while(i < n)
|
||||
{
|
||||
|
||||
y[iy] += da * x[ix] ;
|
||||
ix += inc_x ;
|
||||
iy += inc_y ;
|
||||
i++ ;
|
||||
|
||||
}
|
||||
return(0);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,91 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#define HAVE_KERNEL_16 1
|
||||
static void saxpy_kernel_16( BLASLONG n, FLOAT *x, FLOAT *y , FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void saxpy_kernel_16( BLASLONG n, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movss (%4), %%xmm0 \n\t" // alpha
|
||||
"shufps $0, %%xmm0, %%xmm0 \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
// "prefetcht0 192(%2,%0,4) \n\t"
|
||||
// "prefetcht0 192(%3,%0,4) \n\t"
|
||||
|
||||
"movups (%2,%0,4), %%xmm12 \n\t" // 4 * x
|
||||
"movups 16(%2,%0,4), %%xmm13 \n\t" // 4 * x
|
||||
"movups 32(%2,%0,4), %%xmm14 \n\t" // 4 * x
|
||||
"movups 48(%2,%0,4), %%xmm15 \n\t" // 4 * x
|
||||
"movups (%3,%0,4), %%xmm8 \n\t" // 4 * y
|
||||
"movups 16(%3,%0,4), %%xmm9 \n\t" // 4 * y
|
||||
"movups 32(%3,%0,4), %%xmm10 \n\t" // 4 * y
|
||||
"movups 48(%3,%0,4), %%xmm11 \n\t" // 4 * y
|
||||
|
||||
"mulps %%xmm0 , %%xmm12 \n\t" // alpha * x
|
||||
"mulps %%xmm0 , %%xmm13 \n\t"
|
||||
"mulps %%xmm0 , %%xmm14 \n\t"
|
||||
"mulps %%xmm0 , %%xmm15 \n\t"
|
||||
|
||||
"addps %%xmm12, %%xmm8 \n\t" // y += alpha *x
|
||||
"addps %%xmm13, %%xmm9 \n\t"
|
||||
"addps %%xmm14, %%xmm10 \n\t"
|
||||
"addps %%xmm15, %%xmm11 \n\t"
|
||||
|
||||
"movups %%xmm8 , (%3,%0,4) \n\t"
|
||||
"movups %%xmm9 , 16(%3,%0,4) \n\t"
|
||||
"movups %%xmm10, 32(%3,%0,4) \n\t"
|
||||
"movups %%xmm11, 48(%3,%0,4) \n\t"
|
||||
|
||||
"addq $16, %0 \n\t"
|
||||
"subq $16, %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (alpha) // 4
|
||||
: "cc",
|
||||
"%xmm0",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,109 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
#include "common.h"
|
||||
|
||||
#if defined(BULLDOZER) || defined(PILEDRIVER)
|
||||
#include "sdot_microk_bulldozer-2.c"
|
||||
#elif defined(NEHALEM)
|
||||
#include "sdot_microk_nehalem-2.c"
|
||||
#endif
|
||||
|
||||
|
||||
#ifndef HAVE_KERNEL_16
|
||||
|
||||
static void sdot_kernel_16(BLASLONG n, FLOAT *x, FLOAT *y, FLOAT *d)
|
||||
{
|
||||
BLASLONG register i = 0;
|
||||
FLOAT dot = 0.0;
|
||||
|
||||
while(i < n)
|
||||
{
|
||||
dot += y[i] * x[i]
|
||||
+ y[i+1] * x[i+1]
|
||||
+ y[i+2] * x[i+2]
|
||||
+ y[i+3] * x[i+3]
|
||||
+ y[i+4] * x[i+4]
|
||||
+ y[i+5] * x[i+5]
|
||||
+ y[i+6] * x[i+6]
|
||||
+ y[i+7] * x[i+7] ;
|
||||
|
||||
i+=8 ;
|
||||
|
||||
}
|
||||
*d += dot;
|
||||
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
FLOAT CNAME(BLASLONG n, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y)
|
||||
{
|
||||
BLASLONG i=0;
|
||||
BLASLONG ix=0,iy=0;
|
||||
|
||||
FLOAT dot = 0.0 ;
|
||||
|
||||
if ( n <= 0 ) return(dot);
|
||||
|
||||
if ( (inc_x == 1) && (inc_y == 1) )
|
||||
{
|
||||
|
||||
int n1 = n & -16;
|
||||
|
||||
if ( n1 )
|
||||
sdot_kernel_16(n1, x, y , &dot );
|
||||
|
||||
|
||||
i = n1;
|
||||
while(i < n)
|
||||
{
|
||||
|
||||
dot += y[i] * x[i] ;
|
||||
i++ ;
|
||||
|
||||
}
|
||||
return(dot);
|
||||
|
||||
|
||||
}
|
||||
|
||||
while(i < n)
|
||||
{
|
||||
|
||||
dot += y[iy] * x[ix] ;
|
||||
ix += inc_x ;
|
||||
iy += inc_y ;
|
||||
i++ ;
|
||||
|
||||
}
|
||||
return(dot);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -25,48 +25,46 @@ OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
|||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#define HAVE_KERNEL_16x4 1
|
||||
static void dgemv_kernel_16x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y) __attribute__ ((noinline));
|
||||
#define HAVE_KERNEL_16 1
|
||||
static void sdot_kernel_16( BLASLONG n, FLOAT *x, FLOAT *y , FLOAT *dot) __attribute__ ((noinline));
|
||||
|
||||
static void dgemv_kernel_16x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
||||
static void sdot_kernel_16( BLASLONG n, FLOAT *x, FLOAT *y, FLOAT *dot)
|
||||
{
|
||||
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vzeroupper \n\t"
|
||||
"vbroadcastsd (%2), %%ymm12 \n\t" // x0
|
||||
"vbroadcastsd 8(%2), %%ymm13 \n\t" // x1
|
||||
"vbroadcastsd 16(%2), %%ymm14 \n\t" // x2
|
||||
"vbroadcastsd 24(%2), %%ymm15 \n\t" // x3
|
||||
"vxorps %%xmm4, %%xmm4, %%xmm4 \n\t"
|
||||
"vxorps %%xmm5, %%xmm5, %%xmm5 \n\t"
|
||||
"vxorps %%xmm6, %%xmm6, %%xmm6 \n\t"
|
||||
"vxorps %%xmm7, %%xmm7, %%xmm7 \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"prefetcht0 192(%3,%0,8) \n\t"
|
||||
"vmovups (%3,%0,8), %%ymm4 \n\t" // 4 * y
|
||||
"vmovups 32(%3,%0,8), %%ymm5 \n\t" // 4 * y
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vmovups (%2,%0,4), %%xmm12 \n\t" // 4 * x
|
||||
"vmovups 16(%2,%0,4), %%xmm13 \n\t" // 4 * x
|
||||
"vmovups 32(%2,%0,4), %%xmm14 \n\t" // 4 * x
|
||||
"vmovups 48(%2,%0,4), %%xmm15 \n\t" // 4 * x
|
||||
|
||||
"prefetcht0 192(%4,%0,8) \n\t"
|
||||
"vfmadd231pd (%4,%0,8), %%ymm12, %%ymm4 \n\t"
|
||||
"vfmadd231pd 32(%4,%0,8), %%ymm12, %%ymm5 \n\t"
|
||||
"prefetcht0 192(%5,%0,8) \n\t"
|
||||
"vfmadd231pd (%5,%0,8), %%ymm13, %%ymm4 \n\t"
|
||||
"vfmadd231pd 32(%5,%0,8), %%ymm13, %%ymm5 \n\t"
|
||||
"prefetcht0 192(%6,%0,8) \n\t"
|
||||
"vfmadd231pd (%6,%0,8), %%ymm14, %%ymm4 \n\t"
|
||||
"vfmadd231pd 32(%6,%0,8), %%ymm14, %%ymm5 \n\t"
|
||||
"prefetcht0 192(%7,%0,8) \n\t"
|
||||
"vfmadd231pd (%7,%0,8), %%ymm15, %%ymm4 \n\t"
|
||||
"vfmadd231pd 32(%7,%0,8), %%ymm15, %%ymm5 \n\t"
|
||||
"vfmaddps %%xmm4, (%3,%0,4), %%xmm12, %%xmm4 \n\t" // 4 * y
|
||||
"vfmaddps %%xmm5, 16(%3,%0,4), %%xmm13, %%xmm5 \n\t" // 4 * y
|
||||
"vfmaddps %%xmm6, 32(%3,%0,4), %%xmm14, %%xmm6 \n\t" // 4 * y
|
||||
"vfmaddps %%xmm7, 48(%3,%0,4), %%xmm15, %%xmm7 \n\t" // 4 * y
|
||||
|
||||
"vmovups %%ymm4, (%3,%0,8) \n\t" // 4 * y
|
||||
"vmovups %%ymm5, 32(%3,%0,8) \n\t" // 4 * y
|
||||
"addq $16, %0 \n\t"
|
||||
"subq $16, %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"addq $8 , %0 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
"vzeroupper \n\t"
|
||||
"vaddps %%xmm4, %%xmm5, %%xmm4 \n\t"
|
||||
"vaddps %%xmm6, %%xmm7, %%xmm6 \n\t"
|
||||
"vaddps %%xmm4, %%xmm6, %%xmm4 \n\t"
|
||||
|
||||
"vhaddps %%xmm4, %%xmm4, %%xmm4 \n\t"
|
||||
"vhaddps %%xmm4, %%xmm4, %%xmm4 \n\t"
|
||||
|
||||
"vmovss %%xmm4, (%4) \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
|
|
@ -74,12 +72,10 @@ static void dgemv_kernel_16x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
|||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (ap[2]), // 6
|
||||
"r" (ap[3]) // 7
|
||||
"r" (dot) // 4
|
||||
: "cc",
|
||||
"%xmm4", "%xmm5",
|
||||
"%xmm6", "%xmm7",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
|
@ -0,0 +1,94 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#define HAVE_KERNEL_16 1
|
||||
static void sdot_kernel_16( BLASLONG n, FLOAT *x, FLOAT *y , FLOAT *dot) __attribute__ ((noinline));
|
||||
|
||||
static void sdot_kernel_16( BLASLONG n, FLOAT *x, FLOAT *y, FLOAT *dot)
|
||||
{
|
||||
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"xorps %%xmm4, %%xmm4 \n\t"
|
||||
"xorps %%xmm5, %%xmm5 \n\t"
|
||||
"xorps %%xmm6, %%xmm6 \n\t"
|
||||
"xorps %%xmm7, %%xmm7 \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"movups (%2,%0,4), %%xmm12 \n\t" // 4 * x
|
||||
"movups (%3,%0,4), %%xmm8 \n\t" // 4 * x
|
||||
"movups 16(%2,%0,4), %%xmm13 \n\t" // 4 * x
|
||||
"movups 16(%3,%0,4), %%xmm9 \n\t" // 4 * x
|
||||
"movups 32(%2,%0,4), %%xmm14 \n\t" // 4 * x
|
||||
"movups 32(%3,%0,4), %%xmm10 \n\t" // 4 * x
|
||||
"movups 48(%2,%0,4), %%xmm15 \n\t" // 4 * x
|
||||
"movups 48(%3,%0,4), %%xmm11 \n\t" // 4 * x
|
||||
|
||||
"mulps %%xmm8 , %%xmm12 \n\t"
|
||||
"mulps %%xmm9 , %%xmm13 \n\t"
|
||||
"mulps %%xmm10, %%xmm14 \n\t"
|
||||
"mulps %%xmm11, %%xmm15 \n\t"
|
||||
|
||||
"addps %%xmm12, %%xmm4 \n\t"
|
||||
"addps %%xmm13, %%xmm5 \n\t"
|
||||
"addps %%xmm14, %%xmm6 \n\t"
|
||||
"addps %%xmm15, %%xmm7 \n\t"
|
||||
|
||||
"addq $16, %0 \n\t"
|
||||
"subq $16, %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"addps %%xmm5, %%xmm4 \n\t"
|
||||
"addps %%xmm7, %%xmm6 \n\t"
|
||||
"addps %%xmm6, %%xmm4 \n\t"
|
||||
|
||||
"haddps %%xmm4, %%xmm4 \n\t"
|
||||
"haddps %%xmm4, %%xmm4 \n\t"
|
||||
|
||||
"movss %%xmm4, (%4) \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (dot) // 4
|
||||
: "cc",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -181,8 +181,8 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|||
VFMADD231PS_( %ymm14,%ymm3,%ymm0 )
|
||||
VFMADD231PS_( %ymm15,%ymm3,%ymm1 )
|
||||
|
||||
addq $6*SIZE, BO
|
||||
addq $16*SIZE, AO
|
||||
addq $ 6*SIZE, BO
|
||||
addq $ 16*SIZE, AO
|
||||
decq %rax
|
||||
.endm
|
||||
|
||||
|
|
@ -268,8 +268,8 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|||
VFMADD231PS_( %ymm12,%ymm2,%ymm0 )
|
||||
VFMADD231PS_( %ymm14,%ymm3,%ymm0 )
|
||||
|
||||
addq $6*SIZE, BO
|
||||
addq $8*SIZE, AO
|
||||
addq $ 6*SIZE, BO
|
||||
addq $ 8*SIZE, AO
|
||||
decq %rax
|
||||
.endm
|
||||
|
||||
|
|
@ -327,8 +327,8 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|||
VFMADD231PS_( %xmm12,%xmm2,%xmm0 )
|
||||
VFMADD231PS_( %xmm14,%xmm3,%xmm0 )
|
||||
|
||||
addq $6*SIZE, BO
|
||||
addq $4*SIZE, AO
|
||||
addq $ 6*SIZE, BO
|
||||
addq $ 4*SIZE, AO
|
||||
decq %rax
|
||||
.endm
|
||||
|
||||
|
|
@ -392,8 +392,8 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|||
VFMADD231SS_( %xmm14,%xmm3,%xmm0 )
|
||||
VFMADD231SS_( %xmm15,%xmm3,%xmm1 )
|
||||
|
||||
addq $6*SIZE, BO
|
||||
addq $2*SIZE, AO
|
||||
addq $ 6*SIZE, BO
|
||||
addq $ 2*SIZE, AO
|
||||
decq %rax
|
||||
.endm
|
||||
|
||||
|
|
@ -478,8 +478,8 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|||
VFMADD231SS_( %xmm12,%xmm2,%xmm0 )
|
||||
VFMADD231SS_( %xmm14,%xmm3,%xmm0 )
|
||||
|
||||
addq $6*SIZE, BO
|
||||
addq $1*SIZE, AO
|
||||
addq $ 6*SIZE, BO
|
||||
addq $ 1*SIZE, AO
|
||||
decq %rax
|
||||
.endm
|
||||
|
||||
|
|
|
|||
|
|
@ -29,17 +29,6 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|||
#include "common.h"
|
||||
|
||||
|
||||
#if defined(BULLDOZER) || defined(PILEDRIVER)
|
||||
#include "sgemv_n_microk_bulldozer-2.c"
|
||||
#elif defined(HASWELL)
|
||||
#include "sgemv_n_microk_haswell-2.c"
|
||||
#elif defined(SANDYBRIDGE)
|
||||
#include "sgemv_n_microk_sandy-2.c"
|
||||
#elif defined(NEHALEM)
|
||||
#include "sgemv_n_microk_nehalem-2.c"
|
||||
#endif
|
||||
|
||||
|
||||
#define NBMAX 4096
|
||||
|
||||
#ifndef HAVE_KERNEL_16x4
|
||||
|
|
|
|||
|
|
@ -0,0 +1,591 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
#include "common.h"
|
||||
|
||||
|
||||
#if defined(BULLDOZER) || defined(PILEDRIVER)
|
||||
#include "sgemv_n_microk_bulldozer-4.c"
|
||||
#elif defined(NEHALEM)
|
||||
#include "sgemv_n_microk_nehalem-4.c"
|
||||
#elif defined(SANDYBRIDGE)
|
||||
#include "sgemv_n_microk_sandy-4.c"
|
||||
#elif defined(HASWELL)
|
||||
#include "sgemv_n_microk_haswell-4.c"
|
||||
#endif
|
||||
|
||||
|
||||
#define NBMAX 4096
|
||||
|
||||
#ifndef HAVE_KERNEL_4x8
|
||||
|
||||
static void sgemv_kernel_4x8(BLASLONG n, FLOAT **ap, FLOAT *xo, FLOAT *y, BLASLONG lda4, FLOAT *alpha)
|
||||
{
|
||||
BLASLONG i;
|
||||
FLOAT *a0,*a1,*a2,*a3;
|
||||
FLOAT *b0,*b1,*b2,*b3;
|
||||
FLOAT *x4;
|
||||
FLOAT x[8];
|
||||
a0 = ap[0];
|
||||
a1 = ap[1];
|
||||
a2 = ap[2];
|
||||
a3 = ap[3];
|
||||
b0 = a0 + lda4 ;
|
||||
b1 = a1 + lda4 ;
|
||||
b2 = a2 + lda4 ;
|
||||
b3 = a3 + lda4 ;
|
||||
x4 = x + 4;
|
||||
|
||||
for ( i=0; i<8; i++)
|
||||
x[i] = xo[i] * *alpha;
|
||||
|
||||
for ( i=0; i< n; i+=4 )
|
||||
{
|
||||
|
||||
y[i] += a0[i]*x[0] + a1[i]*x[1] + a2[i]*x[2] + a3[i]*x[3];
|
||||
y[i+1] += a0[i+1]*x[0] + a1[i+1]*x[1] + a2[i+1]*x[2] + a3[i+1]*x[3];
|
||||
y[i+2] += a0[i+2]*x[0] + a1[i+2]*x[1] + a2[i+2]*x[2] + a3[i+2]*x[3];
|
||||
y[i+3] += a0[i+3]*x[0] + a1[i+3]*x[1] + a2[i+3]*x[2] + a3[i+3]*x[3];
|
||||
|
||||
y[i] += b0[i]*x4[0] + b1[i]*x4[1] + b2[i]*x4[2] + b3[i]*x4[3];
|
||||
y[i+1] += b0[i+1]*x4[0] + b1[i+1]*x4[1] + b2[i+1]*x4[2] + b3[i+1]*x4[3];
|
||||
y[i+2] += b0[i+2]*x4[0] + b1[i+2]*x4[1] + b2[i+2]*x4[2] + b3[i+2]*x4[3];
|
||||
y[i+3] += b0[i+3]*x4[0] + b1[i+3]*x4[1] + b2[i+3]*x4[2] + b3[i+3]*x4[3];
|
||||
|
||||
}
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
|
||||
#ifndef HAVE_KERNEL_4x4
|
||||
|
||||
static void sgemv_kernel_4x4(BLASLONG n, FLOAT **ap, FLOAT *xo, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
BLASLONG i;
|
||||
FLOAT *a0,*a1,*a2,*a3;
|
||||
FLOAT x[4];
|
||||
a0 = ap[0];
|
||||
a1 = ap[1];
|
||||
a2 = ap[2];
|
||||
a3 = ap[3];
|
||||
|
||||
for ( i=0; i<4; i++)
|
||||
x[i] = xo[i] * *alpha;
|
||||
|
||||
for ( i=0; i< n; i+=4 )
|
||||
{
|
||||
y[i] += a0[i]*x[0] + a1[i]*x[1] + a2[i]*x[2] + a3[i]*x[3];
|
||||
y[i+1] += a0[i+1]*x[0] + a1[i+1]*x[1] + a2[i+1]*x[2] + a3[i+1]*x[3];
|
||||
y[i+2] += a0[i+2]*x[0] + a1[i+2]*x[1] + a2[i+2]*x[2] + a3[i+2]*x[3];
|
||||
y[i+3] += a0[i+3]*x[0] + a1[i+3]*x[1] + a2[i+3]*x[2] + a3[i+3]*x[3];
|
||||
}
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
#ifndef HAVE_KERNEL_4x2
|
||||
|
||||
static void sgemv_kernel_4x2( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void sgemv_kernel_4x2( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movss (%2) , %%xmm12 \n\t" // x0
|
||||
"movss (%6) , %%xmm4 \n\t" // alpha
|
||||
"movss 4(%2) , %%xmm13 \n\t" // x1
|
||||
"mulss %%xmm4 , %%xmm12 \n\t" // alpha
|
||||
"mulss %%xmm4 , %%xmm13 \n\t" // alpha
|
||||
"shufps $0, %%xmm12, %%xmm12 \n\t"
|
||||
"shufps $0, %%xmm13, %%xmm13 \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"movups (%3,%0,4), %%xmm4 \n\t" // 4 * y
|
||||
|
||||
"movups (%4,%0,4), %%xmm8 \n\t"
|
||||
"movups (%5,%0,4), %%xmm9 \n\t"
|
||||
"mulps %%xmm12, %%xmm8 \n\t"
|
||||
"mulps %%xmm13, %%xmm9 \n\t"
|
||||
"addps %%xmm8 , %%xmm4 \n\t"
|
||||
"addq $4 , %0 \n\t"
|
||||
"addps %%xmm9 , %%xmm4 \n\t"
|
||||
|
||||
"movups %%xmm4 , -16(%3,%0,4) \n\t" // 4 * y
|
||||
|
||||
"subq $4 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (alpha) // 6
|
||||
: "cc",
|
||||
"%xmm4", "%xmm5",
|
||||
"%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
#ifndef HAVE_KERNEL_4x2
|
||||
|
||||
static void sgemv_kernel_4x1(BLASLONG n, FLOAT *ap, FLOAT *x, FLOAT *y, FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void sgemv_kernel_4x1(BLASLONG n, FLOAT *ap, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
BLASLONG register n1 = n & -8 ;
|
||||
BLASLONG register n2 = n & 4 ;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movss (%2), %%xmm12 \n\t" // x0
|
||||
"mulss (%6), %%xmm12 \n\t" // alpha
|
||||
"shufps $0, %%xmm12, %%xmm12 \n\t"
|
||||
|
||||
"cmpq $0, %1 \n\t"
|
||||
"je .L16END%= \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"movups (%3,%0,4), %%xmm4 \n\t" // 4 * y
|
||||
"movups 16(%3,%0,4), %%xmm5 \n\t" // 4 * y
|
||||
"movups (%4,%0,4), %%xmm8 \n\t" // 4 * a
|
||||
"movups 16(%4,%0,4), %%xmm9 \n\t" // 4 * a
|
||||
"mulps %%xmm12, %%xmm8 \n\t"
|
||||
"mulps %%xmm12, %%xmm9 \n\t"
|
||||
"addps %%xmm4 , %%xmm8 \n\t"
|
||||
"addps %%xmm5 , %%xmm9 \n\t"
|
||||
|
||||
"addq $8 , %0 \n\t"
|
||||
"movups %%xmm8 , -32(%3,%0,4) \n\t" // 4 * y
|
||||
"movups %%xmm9 , -16(%3,%0,4) \n\t" // 4 * y
|
||||
|
||||
"subq $8 , %1 \n\t"
|
||||
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
".L16END%=: \n\t"
|
||||
|
||||
"testq $0x04, %5 \n\t"
|
||||
"jz .L08LABEL%= \n\t"
|
||||
|
||||
"movups (%3,%0,4), %%xmm4 \n\t" // 4 * y
|
||||
"movups (%4,%0,4), %%xmm8 \n\t" // 4 * a
|
||||
"mulps %%xmm12, %%xmm8 \n\t"
|
||||
"addps %%xmm8 , %%xmm4 \n\t"
|
||||
"movups %%xmm4 , (%3,%0,4) \n\t" // 4 * y
|
||||
"addq $4 , %0 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
|
||||
".L08LABEL%=: \n\t"
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n1), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap), // 4
|
||||
"r" (n2), // 5
|
||||
"r" (alpha) // 6
|
||||
: "cc",
|
||||
"%xmm4", "%xmm5",
|
||||
"%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
static void add_y(BLASLONG n, FLOAT *src, FLOAT *dest, BLASLONG inc_dest) __attribute__ ((noinline));
|
||||
|
||||
static void add_y(BLASLONG n, FLOAT *src, FLOAT *dest, BLASLONG inc_dest)
|
||||
{
|
||||
BLASLONG i;
|
||||
if ( inc_dest != 1 )
|
||||
{
|
||||
for ( i=0; i<n; i++ )
|
||||
{
|
||||
*dest += *src;
|
||||
src++;
|
||||
dest += inc_dest;
|
||||
}
|
||||
return;
|
||||
}
|
||||
|
||||
i=0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
|
||||
"movups (%2,%0,4) , %%xmm12 \n\t"
|
||||
"movups (%3,%0,4) , %%xmm11 \n\t"
|
||||
"addps %%xmm12 , %%xmm11 \n\t"
|
||||
"addq $4 , %0 \n\t"
|
||||
"movups %%xmm11, -16(%3,%0,4) \n\t"
|
||||
|
||||
"subq $4 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (src), // 2
|
||||
"r" (dest) // 3
|
||||
: "cc",
|
||||
"%xmm10", "%xmm11", "%xmm12",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
int CNAME(BLASLONG m, BLASLONG n, BLASLONG dummy1, FLOAT alpha, FLOAT *a, BLASLONG lda, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y, FLOAT *buffer)
|
||||
{
|
||||
BLASLONG i;
|
||||
BLASLONG j;
|
||||
FLOAT *a_ptr;
|
||||
FLOAT *x_ptr;
|
||||
FLOAT *y_ptr;
|
||||
FLOAT *ap[4];
|
||||
BLASLONG n1;
|
||||
BLASLONG m1;
|
||||
BLASLONG m2;
|
||||
BLASLONG m3;
|
||||
BLASLONG n2;
|
||||
BLASLONG lda4 = lda << 2;
|
||||
BLASLONG lda8 = lda << 3;
|
||||
FLOAT xbuffer[8],*ybuffer;
|
||||
|
||||
if ( m < 1 ) return(0);
|
||||
if ( n < 1 ) return(0);
|
||||
|
||||
ybuffer = buffer;
|
||||
|
||||
if ( inc_x == 1 )
|
||||
{
|
||||
n1 = n >> 3 ;
|
||||
n2 = n & 7 ;
|
||||
}
|
||||
else
|
||||
{
|
||||
n1 = n >> 2 ;
|
||||
n2 = n & 3 ;
|
||||
|
||||
}
|
||||
|
||||
m3 = m & 3 ;
|
||||
m1 = m & -4 ;
|
||||
m2 = (m & (NBMAX-1)) - m3 ;
|
||||
|
||||
|
||||
y_ptr = y;
|
||||
|
||||
BLASLONG NB = NBMAX;
|
||||
|
||||
while ( NB == NBMAX )
|
||||
{
|
||||
|
||||
m1 -= NB;
|
||||
if ( m1 < 0)
|
||||
{
|
||||
if ( m2 == 0 ) break;
|
||||
NB = m2;
|
||||
}
|
||||
|
||||
a_ptr = a;
|
||||
x_ptr = x;
|
||||
|
||||
ap[0] = a_ptr;
|
||||
ap[1] = a_ptr + lda;
|
||||
ap[2] = ap[1] + lda;
|
||||
ap[3] = ap[2] + lda;
|
||||
|
||||
if ( inc_y != 1 )
|
||||
memset(ybuffer,0,NB*4);
|
||||
else
|
||||
ybuffer = y_ptr;
|
||||
|
||||
if ( inc_x == 1 )
|
||||
{
|
||||
|
||||
|
||||
for( i = 0; i < n1 ; i++)
|
||||
{
|
||||
sgemv_kernel_4x8(NB,ap,x_ptr,ybuffer,lda4,&alpha);
|
||||
ap[0] += lda8;
|
||||
ap[1] += lda8;
|
||||
ap[2] += lda8;
|
||||
ap[3] += lda8;
|
||||
a_ptr += lda8;
|
||||
x_ptr += 8;
|
||||
}
|
||||
|
||||
|
||||
if ( n2 & 4 )
|
||||
{
|
||||
sgemv_kernel_4x4(NB,ap,x_ptr,ybuffer,&alpha);
|
||||
ap[0] += lda4;
|
||||
ap[1] += lda4;
|
||||
a_ptr += lda4;
|
||||
x_ptr += 4;
|
||||
}
|
||||
|
||||
if ( n2 & 2 )
|
||||
{
|
||||
sgemv_kernel_4x2(NB,ap,x_ptr,ybuffer,&alpha);
|
||||
a_ptr += lda*2;
|
||||
x_ptr += 2;
|
||||
}
|
||||
|
||||
|
||||
if ( n2 & 1 )
|
||||
{
|
||||
sgemv_kernel_4x1(NB,a_ptr,x_ptr,ybuffer,&alpha);
|
||||
a_ptr += lda;
|
||||
x_ptr += 1;
|
||||
|
||||
}
|
||||
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
|
||||
for( i = 0; i < n1 ; i++)
|
||||
{
|
||||
xbuffer[0] = x_ptr[0];
|
||||
x_ptr += inc_x;
|
||||
xbuffer[1] = x_ptr[0];
|
||||
x_ptr += inc_x;
|
||||
xbuffer[2] = x_ptr[0];
|
||||
x_ptr += inc_x;
|
||||
xbuffer[3] = x_ptr[0];
|
||||
x_ptr += inc_x;
|
||||
sgemv_kernel_4x4(NB,ap,xbuffer,ybuffer,&alpha);
|
||||
ap[0] += lda4;
|
||||
ap[1] += lda4;
|
||||
ap[2] += lda4;
|
||||
ap[3] += lda4;
|
||||
a_ptr += lda4;
|
||||
}
|
||||
|
||||
for( i = 0; i < n2 ; i++)
|
||||
{
|
||||
xbuffer[0] = x_ptr[0];
|
||||
x_ptr += inc_x;
|
||||
sgemv_kernel_4x1(NB,a_ptr,xbuffer,ybuffer,&alpha);
|
||||
a_ptr += lda;
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
a += NB;
|
||||
if ( inc_y != 1 )
|
||||
{
|
||||
add_y(NB,ybuffer,y_ptr,inc_y);
|
||||
y_ptr += NB * inc_y;
|
||||
}
|
||||
else
|
||||
y_ptr += NB ;
|
||||
|
||||
}
|
||||
|
||||
if ( m3 == 0 ) return(0);
|
||||
|
||||
if ( m3 == 3 )
|
||||
{
|
||||
a_ptr = a;
|
||||
x_ptr = x;
|
||||
FLOAT temp0 = 0.0;
|
||||
FLOAT temp1 = 0.0;
|
||||
FLOAT temp2 = 0.0;
|
||||
if ( lda == 3 && inc_x ==1 )
|
||||
{
|
||||
|
||||
for( i = 0; i < ( n & -4 ); i+=4 )
|
||||
{
|
||||
|
||||
temp0 += a_ptr[0] * x_ptr[0] + a_ptr[3] * x_ptr[1];
|
||||
temp1 += a_ptr[1] * x_ptr[0] + a_ptr[4] * x_ptr[1];
|
||||
temp2 += a_ptr[2] * x_ptr[0] + a_ptr[5] * x_ptr[1];
|
||||
|
||||
temp0 += a_ptr[6] * x_ptr[2] + a_ptr[9] * x_ptr[3];
|
||||
temp1 += a_ptr[7] * x_ptr[2] + a_ptr[10] * x_ptr[3];
|
||||
temp2 += a_ptr[8] * x_ptr[2] + a_ptr[11] * x_ptr[3];
|
||||
|
||||
a_ptr += 12;
|
||||
x_ptr += 4;
|
||||
}
|
||||
|
||||
for( ; i < n; i++ )
|
||||
{
|
||||
temp0 += a_ptr[0] * x_ptr[0];
|
||||
temp1 += a_ptr[1] * x_ptr[0];
|
||||
temp2 += a_ptr[2] * x_ptr[0];
|
||||
a_ptr += 3;
|
||||
x_ptr ++;
|
||||
}
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
|
||||
for( i = 0; i < n; i++ )
|
||||
{
|
||||
temp0 += a_ptr[0] * x_ptr[0];
|
||||
temp1 += a_ptr[1] * x_ptr[0];
|
||||
temp2 += a_ptr[2] * x_ptr[0];
|
||||
a_ptr += lda;
|
||||
x_ptr += inc_x;
|
||||
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
y_ptr[0] += alpha * temp0;
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += alpha * temp1;
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += alpha * temp2;
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
if ( m3 == 2 )
|
||||
{
|
||||
a_ptr = a;
|
||||
x_ptr = x;
|
||||
FLOAT temp0 = 0.0;
|
||||
FLOAT temp1 = 0.0;
|
||||
if ( lda == 2 && inc_x ==1 )
|
||||
{
|
||||
|
||||
for( i = 0; i < (n & -4) ; i+=4 )
|
||||
{
|
||||
temp0 += a_ptr[0] * x_ptr[0] + a_ptr[2] * x_ptr[1];
|
||||
temp1 += a_ptr[1] * x_ptr[0] + a_ptr[3] * x_ptr[1];
|
||||
temp0 += a_ptr[4] * x_ptr[2] + a_ptr[6] * x_ptr[3];
|
||||
temp1 += a_ptr[5] * x_ptr[2] + a_ptr[7] * x_ptr[3];
|
||||
a_ptr += 8;
|
||||
x_ptr += 4;
|
||||
|
||||
}
|
||||
|
||||
|
||||
for( ; i < n; i++ )
|
||||
{
|
||||
temp0 += a_ptr[0] * x_ptr[0];
|
||||
temp1 += a_ptr[1] * x_ptr[0];
|
||||
a_ptr += 2;
|
||||
x_ptr ++;
|
||||
}
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
|
||||
for( i = 0; i < n; i++ )
|
||||
{
|
||||
temp0 += a_ptr[0] * x_ptr[0];
|
||||
temp1 += a_ptr[1] * x_ptr[0];
|
||||
a_ptr += lda;
|
||||
x_ptr += inc_x;
|
||||
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
y_ptr[0] += alpha * temp0;
|
||||
y_ptr += inc_y;
|
||||
y_ptr[0] += alpha * temp1;
|
||||
return(0);
|
||||
}
|
||||
|
||||
if ( m3 == 1 )
|
||||
{
|
||||
a_ptr = a;
|
||||
x_ptr = x;
|
||||
FLOAT temp = 0.0;
|
||||
if ( lda == 1 && inc_x ==1 )
|
||||
{
|
||||
|
||||
for( i = 0; i < (n & -4); i+=4 )
|
||||
{
|
||||
temp += a_ptr[i] * x_ptr[i] + a_ptr[i+1] * x_ptr[i+1] + a_ptr[i+2] * x_ptr[i+2] + a_ptr[i+3] * x_ptr[i+3];
|
||||
|
||||
}
|
||||
|
||||
for( ; i < n; i++ )
|
||||
{
|
||||
temp += a_ptr[i] * x_ptr[i];
|
||||
}
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
|
||||
for( i = 0; i < n; i++ )
|
||||
{
|
||||
temp += a_ptr[0] * x_ptr[0];
|
||||
a_ptr += lda;
|
||||
x_ptr += inc_x;
|
||||
}
|
||||
|
||||
}
|
||||
y_ptr[0] += alpha * temp;
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -1,218 +0,0 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
#include "common.h"
|
||||
|
||||
#if defined(BULLDOZER) || defined(PILEDRIVER)
|
||||
#include "sgemv_n_microk_bulldozer.c"
|
||||
#elif defined(HASWELL)
|
||||
#include "sgemv_n_microk_haswell.c"
|
||||
#else
|
||||
#include "sgemv_n_microk_sandy.c"
|
||||
#endif
|
||||
|
||||
static void copy_x(BLASLONG n, FLOAT *src, FLOAT *dest, BLASLONG inc_src)
|
||||
{
|
||||
BLASLONG i;
|
||||
for ( i=0; i<n; i++ )
|
||||
{
|
||||
*dest = *src;
|
||||
dest++;
|
||||
src += inc_src;
|
||||
}
|
||||
}
|
||||
|
||||
static void add_y(BLASLONG n, FLOAT *src, FLOAT *dest, BLASLONG inc_dest)
|
||||
{
|
||||
BLASLONG i;
|
||||
for ( i=0; i<n; i++ )
|
||||
{
|
||||
*dest += *src;
|
||||
src++;
|
||||
dest += inc_dest;
|
||||
}
|
||||
}
|
||||
|
||||
int CNAME(BLASLONG m, BLASLONG n, BLASLONG dummy1, FLOAT alpha, FLOAT *a, BLASLONG lda, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y, FLOAT *buffer)
|
||||
{
|
||||
BLASLONG i;
|
||||
BLASLONG j;
|
||||
FLOAT *a_ptr;
|
||||
FLOAT *x_ptr;
|
||||
FLOAT *y_ptr;
|
||||
BLASLONG n1;
|
||||
BLASLONG m1;
|
||||
BLASLONG register m2;
|
||||
BLASLONG register n2;
|
||||
FLOAT *xbuffer,*ybuffer;
|
||||
xbuffer = buffer;
|
||||
ybuffer = xbuffer + 2048 + 256;
|
||||
|
||||
n1 = n / 512 ;
|
||||
n2 = n % 512 ;
|
||||
|
||||
m1 = m / 64;
|
||||
m2 = m % 64;
|
||||
|
||||
y_ptr = y;
|
||||
x_ptr = x;
|
||||
|
||||
for (j=0; j<n1; j++)
|
||||
{
|
||||
|
||||
if ( inc_x == 1 )
|
||||
xbuffer = x_ptr;
|
||||
else
|
||||
copy_x(512,x_ptr,xbuffer,inc_x);
|
||||
|
||||
a_ptr = a + j * 512 * lda;
|
||||
y_ptr = y;
|
||||
|
||||
for(i = 0; i<m1; i++ )
|
||||
{
|
||||
sgemv_kernel_64(512,alpha,a_ptr,lda,xbuffer,ybuffer);
|
||||
add_y(64,ybuffer,y_ptr,inc_y);
|
||||
y_ptr += 64 * inc_y;
|
||||
a_ptr += 64;
|
||||
|
||||
}
|
||||
|
||||
if ( m2 & 32 )
|
||||
{
|
||||
sgemv_kernel_32(512,alpha,a_ptr,lda,xbuffer,ybuffer);
|
||||
add_y(32,ybuffer,y_ptr,inc_y);
|
||||
y_ptr += 32 * inc_y;
|
||||
a_ptr += 32;
|
||||
|
||||
}
|
||||
|
||||
if ( m2 & 16 )
|
||||
{
|
||||
sgemv_kernel_16(512,alpha,a_ptr,lda,xbuffer,ybuffer);
|
||||
add_y(16,ybuffer,y_ptr,inc_y);
|
||||
y_ptr += 16 * inc_y;
|
||||
a_ptr += 16;
|
||||
}
|
||||
if ( m2 & 8 )
|
||||
{
|
||||
sgemv_kernel_8(512,alpha,a_ptr,lda,xbuffer,ybuffer);
|
||||
add_y(8,ybuffer,y_ptr,inc_y);
|
||||
y_ptr += 8 * inc_y;
|
||||
a_ptr += 8;
|
||||
}
|
||||
if ( m2 & 4 )
|
||||
{
|
||||
sgemv_kernel_4(512,alpha,a_ptr,lda,xbuffer,ybuffer);
|
||||
add_y(4,ybuffer,y_ptr,inc_y);
|
||||
y_ptr += 4 * inc_y;
|
||||
a_ptr += 4;
|
||||
}
|
||||
if ( m2 & 2 )
|
||||
{
|
||||
sgemv_kernel_2(512,alpha,a_ptr,lda,xbuffer,ybuffer);
|
||||
add_y(2,ybuffer,y_ptr,inc_y);
|
||||
y_ptr += 2 * inc_y;
|
||||
a_ptr += 2;
|
||||
}
|
||||
if ( m2 & 1 )
|
||||
{
|
||||
sgemv_kernel_1(512,alpha,a_ptr,lda,xbuffer,ybuffer);
|
||||
add_y(1,ybuffer,y_ptr,inc_y);
|
||||
}
|
||||
x_ptr += 512 * inc_x;
|
||||
|
||||
}
|
||||
|
||||
if ( n2 > 0 )
|
||||
{
|
||||
|
||||
if ( inc_x == 1 )
|
||||
xbuffer = x_ptr;
|
||||
else
|
||||
copy_x(n2,x_ptr,xbuffer,inc_x);
|
||||
|
||||
a_ptr = a + n1 * 512 * lda;
|
||||
y_ptr = y;
|
||||
|
||||
for(i = 0; i<m1; i++ )
|
||||
{
|
||||
sgemv_kernel_64(n2,alpha,a_ptr,lda,xbuffer,ybuffer);
|
||||
add_y(64,ybuffer,y_ptr,inc_y);
|
||||
y_ptr += 64 * inc_y;
|
||||
a_ptr += 64;
|
||||
|
||||
}
|
||||
|
||||
if ( m2 & 32 )
|
||||
{
|
||||
sgemv_kernel_32(n2,alpha,a_ptr,lda,xbuffer,ybuffer);
|
||||
add_y(32,ybuffer,y_ptr,inc_y);
|
||||
y_ptr += 32 * inc_y;
|
||||
a_ptr += 32;
|
||||
|
||||
}
|
||||
if ( m2 & 16 )
|
||||
{
|
||||
sgemv_kernel_16(n2,alpha,a_ptr,lda,xbuffer,ybuffer);
|
||||
add_y(16,ybuffer,y_ptr,inc_y);
|
||||
y_ptr += 16 * inc_y;
|
||||
a_ptr += 16;
|
||||
}
|
||||
if ( m2 & 8 )
|
||||
{
|
||||
sgemv_kernel_8(n2,alpha,a_ptr,lda,xbuffer,ybuffer);
|
||||
add_y(8,ybuffer,y_ptr,inc_y);
|
||||
y_ptr += 8 * inc_y;
|
||||
a_ptr += 8;
|
||||
}
|
||||
if ( m2 & 4 )
|
||||
{
|
||||
sgemv_kernel_4(n2,alpha,a_ptr,lda,xbuffer,ybuffer);
|
||||
add_y(4,ybuffer,y_ptr,inc_y);
|
||||
y_ptr += 4 * inc_y;
|
||||
a_ptr += 4;
|
||||
}
|
||||
if ( m2 & 2 )
|
||||
{
|
||||
sgemv_kernel_2(n2,alpha,a_ptr,lda,xbuffer,ybuffer);
|
||||
add_y(2,ybuffer,y_ptr,inc_y);
|
||||
y_ptr += 2 * inc_y;
|
||||
a_ptr += 2;
|
||||
}
|
||||
if ( m2 & 1 )
|
||||
{
|
||||
sgemv_kernel_1(n2,alpha,a_ptr,lda,xbuffer,ybuffer);
|
||||
add_y(1,ybuffer,y_ptr,inc_y);
|
||||
}
|
||||
|
||||
|
||||
}
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -1,99 +0,0 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#define HAVE_KERNEL_16x4 1
|
||||
static void sgemv_kernel_16x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y) __attribute__ ((noinline));
|
||||
|
||||
static void sgemv_kernel_16x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vbroadcastss (%2), %%xmm12 \n\t" // x0
|
||||
"vbroadcastss 4(%2), %%xmm13 \n\t" // x1
|
||||
"vbroadcastss 8(%2), %%xmm14 \n\t" // x2
|
||||
"vbroadcastss 12(%2), %%xmm15 \n\t" // x3
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vmovups (%3,%0,4), %%xmm4 \n\t" // 4 * y
|
||||
"vmovups 16(%3,%0,4), %%xmm5 \n\t" // 4 * y
|
||||
"vmovups 32(%3,%0,4), %%xmm6 \n\t" // 4 * y
|
||||
"vmovups 48(%3,%0,4), %%xmm7 \n\t" // 4 * y
|
||||
|
||||
"prefetcht0 192(%4,%0,4) \n\t"
|
||||
"vfmaddps %%xmm4, (%4,%0,4), %%xmm12, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%4,%0,4), %%xmm12, %%xmm5 \n\t"
|
||||
"vfmaddps %%xmm6, 32(%4,%0,4), %%xmm12, %%xmm6 \n\t"
|
||||
"vfmaddps %%xmm7, 48(%4,%0,4), %%xmm12, %%xmm7 \n\t"
|
||||
"prefetcht0 192(%5,%0,4) \n\t"
|
||||
"vfmaddps %%xmm4, (%5,%0,4), %%xmm13, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%5,%0,4), %%xmm13, %%xmm5 \n\t"
|
||||
"vfmaddps %%xmm6, 32(%5,%0,4), %%xmm13, %%xmm6 \n\t"
|
||||
"vfmaddps %%xmm7, 48(%5,%0,4), %%xmm13, %%xmm7 \n\t"
|
||||
"prefetcht0 192(%6,%0,4) \n\t"
|
||||
"vfmaddps %%xmm4, (%6,%0,4), %%xmm14, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%6,%0,4), %%xmm14, %%xmm5 \n\t"
|
||||
"vfmaddps %%xmm6, 32(%6,%0,4), %%xmm14, %%xmm6 \n\t"
|
||||
"vfmaddps %%xmm7, 48(%6,%0,4), %%xmm14, %%xmm7 \n\t"
|
||||
"prefetcht0 192(%7,%0,4) \n\t"
|
||||
"vfmaddps %%xmm4, (%7,%0,4), %%xmm15, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%7,%0,4), %%xmm15, %%xmm5 \n\t"
|
||||
"vfmaddps %%xmm6, 32(%7,%0,4), %%xmm15, %%xmm6 \n\t"
|
||||
"vfmaddps %%xmm7, 48(%7,%0,4), %%xmm15, %%xmm7 \n\t"
|
||||
|
||||
"vmovups %%xmm4, (%3,%0,4) \n\t" // 4 * y
|
||||
"vmovups %%xmm5, 16(%3,%0,4) \n\t" // 4 * y
|
||||
"vmovups %%xmm6, 32(%3,%0,4) \n\t" // 4 * y
|
||||
"vmovups %%xmm7, 48(%3,%0,4) \n\t" // 4 * y
|
||||
|
||||
"addq $16, %0 \n\t"
|
||||
"subq $16, %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (ap[2]), // 6
|
||||
"r" (ap[3]) // 7
|
||||
: "cc",
|
||||
"%xmm4", "%xmm5",
|
||||
"%xmm6", "%xmm7",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,269 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
|
||||
#define HAVE_KERNEL_4x8 1
|
||||
static void sgemv_kernel_4x8( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, BLASLONG lda4, FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void sgemv_kernel_4x8( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, BLASLONG lda4, FLOAT *alpha)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vbroadcastss (%2), %%xmm12 \n\t" // x0
|
||||
"vbroadcastss 4(%2), %%xmm13 \n\t" // x1
|
||||
"vbroadcastss 8(%2), %%xmm14 \n\t" // x2
|
||||
"vbroadcastss 12(%2), %%xmm15 \n\t" // x3
|
||||
"vbroadcastss 16(%2), %%xmm0 \n\t" // x4
|
||||
"vbroadcastss 20(%2), %%xmm1 \n\t" // x5
|
||||
"vbroadcastss 24(%2), %%xmm2 \n\t" // x6
|
||||
"vbroadcastss 28(%2), %%xmm3 \n\t" // x7
|
||||
|
||||
"vbroadcastss (%9), %%xmm8 \n\t" // alpha
|
||||
|
||||
"testq $0x04, %1 \n\t"
|
||||
"jz .L08LABEL%= \n\t"
|
||||
|
||||
"vxorps %%xmm4, %%xmm4 , %%xmm4 \n\t"
|
||||
"vxorps %%xmm5, %%xmm5 , %%xmm5 \n\t"
|
||||
|
||||
"vfmaddps %%xmm4, (%4,%0,4), %%xmm12, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, (%5,%0,4), %%xmm13, %%xmm5 \n\t"
|
||||
"vfmaddps %%xmm4, (%6,%0,4), %%xmm14, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, (%7,%0,4), %%xmm15, %%xmm5 \n\t"
|
||||
"addq $4 , %0 \n\t"
|
||||
|
||||
"vfmaddps %%xmm4, (%4,%8,4), %%xmm0 , %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, (%5,%8,4), %%xmm1 , %%xmm5 \n\t"
|
||||
"vfmaddps %%xmm4, (%6,%8,4), %%xmm2 , %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, (%7,%8,4), %%xmm3 , %%xmm5 \n\t"
|
||||
"addq $4 , %8 \n\t"
|
||||
|
||||
"vaddps %%xmm5 , %%xmm4, %%xmm4 \n\t"
|
||||
"vfmaddps -16(%3,%0,4) , %%xmm4, %%xmm8,%%xmm6 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
"vmovups %%xmm6, -16(%3,%0,4) \n\t" // 4 * y
|
||||
|
||||
".L08LABEL%=: \n\t"
|
||||
|
||||
"testq $0x08, %1 \n\t"
|
||||
"jz .L16LABEL%= \n\t"
|
||||
|
||||
"vxorps %%xmm4, %%xmm4 , %%xmm4 \n\t"
|
||||
"vxorps %%xmm5, %%xmm5 , %%xmm5 \n\t"
|
||||
|
||||
"vfmaddps %%xmm4, (%4,%0,4), %%xmm12, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%4,%0,4), %%xmm12, %%xmm5 \n\t"
|
||||
"vfmaddps %%xmm4, (%5,%0,4), %%xmm13, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%5,%0,4), %%xmm13, %%xmm5 \n\t"
|
||||
"vfmaddps %%xmm4, (%6,%0,4), %%xmm14, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%6,%0,4), %%xmm14, %%xmm5 \n\t"
|
||||
"vfmaddps %%xmm4, (%7,%0,4), %%xmm15, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%7,%0,4), %%xmm15, %%xmm5 \n\t"
|
||||
|
||||
"vfmaddps %%xmm4, (%4,%8,4), %%xmm0 , %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%4,%8,4), %%xmm0 , %%xmm5 \n\t"
|
||||
"vfmaddps %%xmm4, (%5,%8,4), %%xmm1 , %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%5,%8,4), %%xmm1 , %%xmm5 \n\t"
|
||||
"vfmaddps %%xmm4, (%6,%8,4), %%xmm2 , %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%6,%8,4), %%xmm2 , %%xmm5 \n\t"
|
||||
"vfmaddps %%xmm4, (%7,%8,4), %%xmm3 , %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%7,%8,4), %%xmm3 , %%xmm5 \n\t"
|
||||
|
||||
"vfmaddps (%3,%0,4) , %%xmm4,%%xmm8,%%xmm4 \n\t"
|
||||
"vfmaddps 16(%3,%0,4) , %%xmm5,%%xmm8,%%xmm5 \n\t"
|
||||
"vmovups %%xmm4, (%3,%0,4) \n\t" // 4 * y
|
||||
"vmovups %%xmm5, 16(%3,%0,4) \n\t" // 4 * y
|
||||
|
||||
"addq $8 , %0 \n\t"
|
||||
"addq $8 , %8 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
|
||||
|
||||
".L16LABEL%=: \n\t"
|
||||
|
||||
"cmpq $0, %1 \n\t"
|
||||
"je .L16END%= \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
|
||||
"vxorps %%xmm4, %%xmm4 , %%xmm4 \n\t"
|
||||
"vxorps %%xmm5, %%xmm5 , %%xmm5 \n\t"
|
||||
"vxorps %%xmm6, %%xmm6 , %%xmm6 \n\t"
|
||||
"vxorps %%xmm7, %%xmm7 , %%xmm7 \n\t"
|
||||
|
||||
"prefetcht0 192(%4,%0,4) \n\t"
|
||||
"vfmaddps %%xmm4, (%4,%0,4), %%xmm12, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%4,%0,4), %%xmm12, %%xmm5 \n\t"
|
||||
"prefetcht0 192(%5,%0,4) \n\t"
|
||||
"vfmaddps %%xmm4, (%5,%0,4), %%xmm13, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%5,%0,4), %%xmm13, %%xmm5 \n\t"
|
||||
"prefetcht0 192(%6,%0,4) \n\t"
|
||||
"vfmaddps %%xmm4, (%6,%0,4), %%xmm14, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%6,%0,4), %%xmm14, %%xmm5 \n\t"
|
||||
"prefetcht0 192(%7,%0,4) \n\t"
|
||||
"vfmaddps %%xmm4, (%7,%0,4), %%xmm15, %%xmm4 \n\t"
|
||||
".align 2 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%7,%0,4), %%xmm15, %%xmm5 \n\t"
|
||||
|
||||
"vfmaddps %%xmm6, 32(%4,%0,4), %%xmm12, %%xmm6 \n\t"
|
||||
"vfmaddps %%xmm7, 48(%4,%0,4), %%xmm12, %%xmm7 \n\t"
|
||||
"vfmaddps %%xmm6, 32(%5,%0,4), %%xmm13, %%xmm6 \n\t"
|
||||
"vfmaddps %%xmm7, 48(%5,%0,4), %%xmm13, %%xmm7 \n\t"
|
||||
"vfmaddps %%xmm6, 32(%6,%0,4), %%xmm14, %%xmm6 \n\t"
|
||||
"vfmaddps %%xmm7, 48(%6,%0,4), %%xmm14, %%xmm7 \n\t"
|
||||
"vfmaddps %%xmm6, 32(%7,%0,4), %%xmm15, %%xmm6 \n\t"
|
||||
"vfmaddps %%xmm7, 48(%7,%0,4), %%xmm15, %%xmm7 \n\t"
|
||||
|
||||
"prefetcht0 192(%4,%8,4) \n\t"
|
||||
"vfmaddps %%xmm4, (%4,%8,4), %%xmm0 , %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%4,%8,4), %%xmm0 , %%xmm5 \n\t"
|
||||
"prefetcht0 192(%5,%8,4) \n\t"
|
||||
"vfmaddps %%xmm4, (%5,%8,4), %%xmm1 , %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%5,%8,4), %%xmm1 , %%xmm5 \n\t"
|
||||
"prefetcht0 192(%6,%8,4) \n\t"
|
||||
"vfmaddps %%xmm4, (%6,%8,4), %%xmm2 , %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%6,%8,4), %%xmm2 , %%xmm5 \n\t"
|
||||
"prefetcht0 192(%7,%8,4) \n\t"
|
||||
"vfmaddps %%xmm4, (%7,%8,4), %%xmm3 , %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%7,%8,4), %%xmm3 , %%xmm5 \n\t"
|
||||
|
||||
"vfmaddps %%xmm6, 32(%4,%8,4), %%xmm0 , %%xmm6 \n\t"
|
||||
"vfmaddps %%xmm7, 48(%4,%8,4), %%xmm0 , %%xmm7 \n\t"
|
||||
"vfmaddps %%xmm6, 32(%5,%8,4), %%xmm1 , %%xmm6 \n\t"
|
||||
"vfmaddps %%xmm7, 48(%5,%8,4), %%xmm1 , %%xmm7 \n\t"
|
||||
"vfmaddps %%xmm6, 32(%6,%8,4), %%xmm2 , %%xmm6 \n\t"
|
||||
"vfmaddps %%xmm7, 48(%6,%8,4), %%xmm2 , %%xmm7 \n\t"
|
||||
"vfmaddps %%xmm6, 32(%7,%8,4), %%xmm3 , %%xmm6 \n\t"
|
||||
"vfmaddps %%xmm7, 48(%7,%8,4), %%xmm3 , %%xmm7 \n\t"
|
||||
|
||||
"vfmaddps (%3,%0,4) , %%xmm4,%%xmm8,%%xmm4 \n\t"
|
||||
"vfmaddps 16(%3,%0,4) , %%xmm5,%%xmm8,%%xmm5 \n\t"
|
||||
"vfmaddps 32(%3,%0,4) , %%xmm6,%%xmm8,%%xmm6 \n\t"
|
||||
"vfmaddps 48(%3,%0,4) , %%xmm7,%%xmm8,%%xmm7 \n\t"
|
||||
|
||||
"addq $16, %0 \n\t"
|
||||
"vmovups %%xmm4,-64(%3,%0,4) \n\t" // 4 * y
|
||||
"vmovups %%xmm5,-48(%3,%0,4) \n\t" // 4 * y
|
||||
"addq $16, %8 \n\t"
|
||||
"vmovups %%xmm6,-32(%3,%0,4) \n\t" // 4 * y
|
||||
"vmovups %%xmm7,-16(%3,%0,4) \n\t" // 4 * y
|
||||
|
||||
"subq $16, %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
".L16END%=: \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (ap[2]), // 6
|
||||
"r" (ap[3]), // 7
|
||||
"r" (lda4), // 8
|
||||
"r" (alpha) // 9
|
||||
: "cc",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm2", "%xmm3",
|
||||
"%xmm4", "%xmm5",
|
||||
"%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
||||
|
||||
#define HAVE_KERNEL_4x4 1
|
||||
static void sgemv_kernel_4x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void sgemv_kernel_4x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vbroadcastss (%2), %%xmm12 \n\t" // x0
|
||||
"vbroadcastss 4(%2), %%xmm13 \n\t" // x1
|
||||
"vbroadcastss 8(%2), %%xmm14 \n\t" // x2
|
||||
"vbroadcastss 12(%2), %%xmm15 \n\t" // x3
|
||||
|
||||
"vbroadcastss (%8), %%xmm8 \n\t" // alpha
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vxorps %%xmm4, %%xmm4 , %%xmm4 \n\t"
|
||||
"vxorps %%xmm5, %%xmm5 , %%xmm5 \n\t"
|
||||
|
||||
"vfmaddps %%xmm4, (%4,%0,4), %%xmm12, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, (%5,%0,4), %%xmm13, %%xmm5 \n\t"
|
||||
"vfmaddps %%xmm4, (%6,%0,4), %%xmm14, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, (%7,%0,4), %%xmm15, %%xmm5 \n\t"
|
||||
|
||||
"vaddps %%xmm4, %%xmm5, %%xmm4 \n\t"
|
||||
|
||||
"vfmaddps (%3,%0,4) , %%xmm4,%%xmm8,%%xmm6 \n\t"
|
||||
"vmovups %%xmm6, (%3,%0,4) \n\t" // 4 * y
|
||||
|
||||
"addq $4 , %0 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (ap[2]), // 6
|
||||
"r" (ap[3]), // 7
|
||||
"r" (alpha) // 8
|
||||
: "cc",
|
||||
"%xmm4", "%xmm5",
|
||||
"%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -1,451 +0,0 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
static void sgemv_kernel_64( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
|
||||
float *pre = a + lda*3;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vbroadcastss %1, %%ymm1\n\t" // alpha -> ymm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
"movq %6, %%r8\n\t" // address for prefetch
|
||||
"prefetcht0 (%%r8)\n\t" // Prefetch
|
||||
"prefetcht0 64(%%r8)\n\t" // Prefetch
|
||||
|
||||
"vxorps %%ymm8 , %%ymm8 , %%ymm8 \n\t" // set to zero
|
||||
"vxorps %%ymm9 , %%ymm9 , %%ymm9 \n\t" // set to zero
|
||||
"vxorps %%ymm10, %%ymm10, %%ymm10\n\t" // set to zero
|
||||
"vxorps %%ymm11, %%ymm11, %%ymm11\n\t" // set to zero
|
||||
"vxorps %%ymm12, %%ymm12, %%ymm12\n\t" // set to zero
|
||||
"vxorps %%ymm13, %%ymm13, %%ymm13\n\t" // set to zero
|
||||
"vxorps %%ymm14, %%ymm14, %%ymm14\n\t" // set to zero
|
||||
"vxorps %%ymm15, %%ymm15, %%ymm15\n\t" // set to zero
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vbroadcastss (%%rdi), %%ymm0 \n\t" // load values of c
|
||||
"nop \n\t"
|
||||
"leaq (%%r8 , %%rcx, 4), %%r8 \n\t" // add lda to pointer for prefetch
|
||||
|
||||
"prefetcht0 (%%r8)\n\t" // Prefetch
|
||||
"vfmaddps %%ymm8 , 0*4(%%rsi), %%ymm0, %%ymm8 \n\t" // multiply a and c and add to temp
|
||||
"prefetcht0 64(%%r8)\n\t" // Prefetch
|
||||
"vfmaddps %%ymm9 , 8*4(%%rsi), %%ymm0, %%ymm9 \n\t" // multiply a and c and add to temp
|
||||
"prefetcht0 128(%%r8)\n\t" // Prefetch
|
||||
"vfmaddps %%ymm10, 16*4(%%rsi), %%ymm0, %%ymm10\n\t" // multiply a and c and add to temp
|
||||
"vfmaddps %%ymm11, 24*4(%%rsi), %%ymm0, %%ymm11\n\t" // multiply a and c and add to temp
|
||||
"prefetcht0 192(%%r8)\n\t" // Prefetch
|
||||
"vfmaddps %%ymm12, 32*4(%%rsi), %%ymm0, %%ymm12\n\t" // multiply a and c and add to temp
|
||||
"vfmaddps %%ymm13, 40*4(%%rsi), %%ymm0, %%ymm13\n\t" // multiply a and c and add to temp
|
||||
"vfmaddps %%ymm14, 48*4(%%rsi), %%ymm0, %%ymm14\n\t" // multiply a and c and add to temp
|
||||
"vfmaddps %%ymm15, 56*4(%%rsi), %%ymm0, %%ymm15\n\t" // multiply a and c and add to temp
|
||||
|
||||
"addq $4 , %%rdi \n\t" // increment pointer of c
|
||||
"leaq (%%rsi, %%rcx, 4), %%rsi \n\t" // add lda to pointer of a
|
||||
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmulps %%ymm8 , %%ymm1, %%ymm8 \n\t" // scale by alpha
|
||||
"vmulps %%ymm9 , %%ymm1, %%ymm9 \n\t" // scale by alpha
|
||||
"vmulps %%ymm10, %%ymm1, %%ymm10\n\t" // scale by alpha
|
||||
"vmulps %%ymm11, %%ymm1, %%ymm11\n\t" // scale by alpha
|
||||
"vmulps %%ymm12, %%ymm1, %%ymm12\n\t" // scale by alpha
|
||||
"vmulps %%ymm13, %%ymm1, %%ymm13\n\t" // scale by alpha
|
||||
"vmulps %%ymm14, %%ymm1, %%ymm14\n\t" // scale by alpha
|
||||
"vmulps %%ymm15, %%ymm1, %%ymm15\n\t" // scale by alpha
|
||||
|
||||
"vmovups %%ymm8 , (%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm9 , 8*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm10, 16*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm11, 24*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm12, 32*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm13, 40*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm14, 48*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm15, 56*4(%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y), // 5
|
||||
"m" (pre) // 6
|
||||
: "%rax", "%rcx", "%rdx", "%rsi", "%rdi", "%r8",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
||||
static void sgemv_kernel_32( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
|
||||
float *pre = a + lda*3;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vbroadcastss %1, %%xmm1\n\t" // alpha -> xmm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
"movq %6, %%r8\n\t" // address for prefetch
|
||||
"prefetcht0 (%%r8)\n\t" // Prefetch
|
||||
"prefetcht0 64(%%r8)\n\t" // Prefetch
|
||||
|
||||
"vxorps %%xmm8 , %%xmm8 , %%xmm8 \n\t" // set to zero
|
||||
"vxorps %%xmm9 , %%xmm9 , %%xmm9 \n\t" // set to zero
|
||||
"vxorps %%xmm10, %%xmm10, %%xmm10\n\t" // set to zero
|
||||
"vxorps %%xmm11, %%xmm11, %%xmm11\n\t" // set to zero
|
||||
"vxorps %%xmm12, %%xmm12, %%xmm12\n\t" // set to zero
|
||||
"vxorps %%xmm13, %%xmm13, %%xmm13\n\t" // set to zero
|
||||
"vxorps %%xmm14, %%xmm14, %%xmm14\n\t" // set to zero
|
||||
"vxorps %%xmm15, %%xmm15, %%xmm15\n\t" // set to zero
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vbroadcastss (%%rdi), %%xmm0 \n\t" // load values of c
|
||||
"nop \n\t"
|
||||
"leaq (%%r8 , %%rcx, 4), %%r8 \n\t" // add lda to pointer for prefetch
|
||||
|
||||
"prefetcht0 (%%r8)\n\t" // Prefetch
|
||||
"vfmaddps %%xmm8 , 0*4(%%rsi), %%xmm0, %%xmm8 \n\t" // multiply a and c and add to temp
|
||||
"prefetcht0 64(%%r8)\n\t" // Prefetch
|
||||
"vfmaddps %%xmm9 , 4*4(%%rsi), %%xmm0, %%xmm9 \n\t" // multiply a and c and add to temp
|
||||
"vfmaddps %%xmm10, 8*4(%%rsi), %%xmm0, %%xmm10\n\t" // multiply a and c and add to temp
|
||||
"vfmaddps %%xmm11, 12*4(%%rsi), %%xmm0, %%xmm11\n\t" // multiply a and c and add to temp
|
||||
"vfmaddps %%xmm12, 16*4(%%rsi), %%xmm0, %%xmm12\n\t" // multiply a and c and add to temp
|
||||
"vfmaddps %%xmm13, 20*4(%%rsi), %%xmm0, %%xmm13\n\t" // multiply a and c and add to temp
|
||||
"vfmaddps %%xmm14, 24*4(%%rsi), %%xmm0, %%xmm14\n\t" // multiply a and c and add to temp
|
||||
"vfmaddps %%xmm15, 28*4(%%rsi), %%xmm0, %%xmm15\n\t" // multiply a and c and add to temp
|
||||
|
||||
"addq $4 , %%rdi \n\t" // increment pointer of c
|
||||
"leaq (%%rsi, %%rcx, 4), %%rsi \n\t" // add lda to pointer of a
|
||||
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmulps %%xmm8 , %%xmm1, %%xmm8 \n\t" // scale by alpha
|
||||
"vmulps %%xmm9 , %%xmm1, %%xmm9 \n\t" // scale by alpha
|
||||
"vmulps %%xmm10, %%xmm1, %%xmm10\n\t" // scale by alpha
|
||||
"vmulps %%xmm11, %%xmm1, %%xmm11\n\t" // scale by alpha
|
||||
"vmulps %%xmm12, %%xmm1, %%xmm12\n\t" // scale by alpha
|
||||
"vmulps %%xmm13, %%xmm1, %%xmm13\n\t" // scale by alpha
|
||||
"vmulps %%xmm14, %%xmm1, %%xmm14\n\t" // scale by alpha
|
||||
"vmulps %%xmm15, %%xmm1, %%xmm15\n\t" // scale by alpha
|
||||
|
||||
"vmovups %%xmm8 , (%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%xmm9 , 4*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%xmm10, 8*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%xmm11, 12*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%xmm12, 16*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%xmm13, 20*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%xmm14, 24*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%xmm15, 28*4(%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y), // 5
|
||||
"m" (pre) // 6
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
static void sgemv_kernel_16( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
float *pre = a + lda*3;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vbroadcastss %1, %%ymm1\n\t" // alpha -> ymm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
"movq %6, %%r8\n\t" // address for prefetch
|
||||
"prefetcht0 (%%r8)\n\t" // Prefetch
|
||||
|
||||
"vxorps %%ymm12, %%ymm12, %%ymm12\n\t" // set to zero
|
||||
"vxorps %%ymm13, %%ymm13, %%ymm13\n\t" // set to zero
|
||||
|
||||
".L01LOOP%=: \n\t"
|
||||
"vbroadcastss (%%rdi), %%ymm0 \n\t" // load values of c
|
||||
"addq $4 , %%rdi \n\t" // increment pointer of c
|
||||
|
||||
"leaq (%%r8 , %%rcx, 4), %%r8 \n\t" // add lda to pointer for prefetch
|
||||
"prefetcht0 (%%r8)\n\t" // Prefetch
|
||||
|
||||
"vfmaddps %%ymm12, 0*4(%%rsi), %%ymm0, %%ymm12\n\t" // multiply a and c and add to temp
|
||||
"vfmaddps %%ymm13, 8*4(%%rsi), %%ymm0, %%ymm13\n\t" // multiply a and c and add to temp
|
||||
|
||||
"leaq (%%rsi, %%rcx, 4), %%rsi \n\t" // add lda to pointer of a
|
||||
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmulps %%ymm12, %%ymm1, %%ymm12\n\t" // scale by alpha
|
||||
"vmulps %%ymm13, %%ymm1, %%ymm13\n\t" // scale by alpha
|
||||
|
||||
"vmovups %%ymm12, (%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm13, 8*4(%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y), // 5
|
||||
"m" (pre) // 6
|
||||
: "%rax", "%rcx", "%rdx", "%rsi", "%rdi", "%r8",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
static void sgemv_kernel_8( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vbroadcastss %1, %%ymm1\n\t" // alpha -> ymm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
|
||||
"vxorps %%ymm12, %%ymm12, %%ymm12\n\t" // set to zero
|
||||
|
||||
".L01LOOP%=: \n\t"
|
||||
"vbroadcastss (%%rdi), %%ymm0 \n\t" // load values of c
|
||||
"addq $4 , %%rdi \n\t" // increment pointer of c
|
||||
|
||||
"vfmaddps %%ymm12, 0*4(%%rsi), %%ymm0, %%ymm12\n\t" // multiply a and c and add to temp
|
||||
|
||||
"leaq (%%rsi, %%rcx, 4), %%rsi \n\t" // add lda to pointer of a
|
||||
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmulps %%ymm12, %%ymm1, %%ymm12\n\t" // scale by alpha
|
||||
|
||||
"vmovups %%ymm12, (%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y) // 5
|
||||
: "%rax", "%rcx", "%rdx", "%rsi", "%rdi", "%r8",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
static void sgemv_kernel_4( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vbroadcastss %1, %%xmm1\n\t" // alpha -> xmm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
|
||||
"vxorps %%xmm12, %%xmm12, %%xmm12\n\t" // set to zero
|
||||
|
||||
".L01LOOP%=: \n\t"
|
||||
"vbroadcastss (%%rdi), %%xmm0 \n\t" // load values of c
|
||||
"addq $4 , %%rdi \n\t" // increment pointer of c
|
||||
|
||||
"vfmaddps %%xmm12, 0*4(%%rsi), %%xmm0, %%xmm12\n\t" // multiply a and c and add to temp
|
||||
|
||||
"leaq (%%rsi, %%rcx, 4), %%rsi \n\t" // add lda to pointer of a
|
||||
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmulps %%xmm12, %%xmm1, %%xmm12\n\t" // scale by alpha
|
||||
|
||||
"vmovups %%xmm12, (%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y) // 5
|
||||
: "%rax", "%rcx", "%rdx", "%rsi", "%rdi", "%r8",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
static void sgemv_kernel_2( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vmovss %1, %%xmm1\n\t" // alpha -> xmm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
|
||||
"vxorps %%xmm12, %%xmm12, %%xmm12\n\t" // set to zero
|
||||
"vxorps %%xmm13, %%xmm13, %%xmm13\n\t" // set to zero
|
||||
|
||||
".L01LOOP%=: \n\t"
|
||||
"vmovss (%%rdi), %%xmm0 \n\t" // load values of c
|
||||
"addq $4 , %%rdi \n\t" // increment pointer of c
|
||||
|
||||
"vfmaddss %%xmm12, 0*4(%%rsi), %%xmm0, %%xmm12\n\t" // multiply a and c and add to temp
|
||||
"vfmaddss %%xmm13, 1*4(%%rsi), %%xmm0, %%xmm13\n\t" // multiply a and c and add to temp
|
||||
|
||||
"leaq (%%rsi, %%rcx, 4), %%rsi \n\t" // add lda to pointer of a
|
||||
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmulss %%xmm12, %%xmm1, %%xmm12\n\t" // scale by alpha
|
||||
"vmulss %%xmm13, %%xmm1, %%xmm13\n\t" // scale by alpha
|
||||
|
||||
"vmovss %%xmm12, (%%rdx) \n\t" // store temp -> y
|
||||
"vmovss %%xmm13, 4(%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y) // 5
|
||||
: "%rax", "%rcx", "%rdx", "%rsi", "%rdi", "%r8",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
||||
static void sgemv_kernel_1( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vmovss %1, %%xmm1\n\t" // alpha -> xmm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
|
||||
"vxorps %%xmm12, %%xmm12, %%xmm12\n\t" // set to zero
|
||||
|
||||
".L01LOOP%=: \n\t"
|
||||
"vmovss (%%rdi), %%xmm0 \n\t" // load values of c
|
||||
"addq $4 , %%rdi \n\t" // increment pointer of c
|
||||
|
||||
"vfmaddss %%xmm12, 0*4(%%rsi), %%xmm0, %%xmm12\n\t" // multiply a and c and add to temp
|
||||
|
||||
"leaq (%%rsi, %%rcx, 4), %%rsi \n\t" // add lda to pointer of a
|
||||
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmulss %%xmm12, %%xmm1, %%xmm12\n\t" // scale by alpha
|
||||
|
||||
"vmovss %%xmm12, (%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y) // 5
|
||||
: "%rax", "%rcx", "%rdx", "%rsi", "%rdi", "%r8",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,299 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
|
||||
#define HAVE_KERNEL_4x8 1
|
||||
static void sgemv_kernel_4x8( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, BLASLONG lda4, FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void sgemv_kernel_4x8( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, BLASLONG lda4, FLOAT *alpha)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vzeroupper \n\t"
|
||||
"vbroadcastss (%2), %%ymm12 \n\t" // x0
|
||||
"vbroadcastss 4(%2), %%ymm13 \n\t" // x1
|
||||
"vbroadcastss 8(%2), %%ymm14 \n\t" // x2
|
||||
"vbroadcastss 12(%2), %%ymm15 \n\t" // x3
|
||||
"vbroadcastss 16(%2), %%ymm0 \n\t" // x4
|
||||
"vbroadcastss 20(%2), %%ymm1 \n\t" // x5
|
||||
"vbroadcastss 24(%2), %%ymm2 \n\t" // x6
|
||||
"vbroadcastss 28(%2), %%ymm3 \n\t" // x7
|
||||
|
||||
"vbroadcastss (%9), %%ymm6 \n\t" // alpha
|
||||
|
||||
"testq $0x04, %1 \n\t"
|
||||
"jz .L08LABEL%= \n\t"
|
||||
|
||||
"vmovups (%3,%0,4), %%xmm7 \n\t" // 4 * y
|
||||
"vxorps %%xmm4 , %%xmm4, %%xmm4 \n\t"
|
||||
"vxorps %%xmm5 , %%xmm5, %%xmm5 \n\t"
|
||||
|
||||
"vfmadd231ps (%4,%0,4), %%xmm12, %%xmm4 \n\t"
|
||||
"vfmadd231ps (%5,%0,4), %%xmm13, %%xmm5 \n\t"
|
||||
"vfmadd231ps (%6,%0,4), %%xmm14, %%xmm4 \n\t"
|
||||
"vfmadd231ps (%7,%0,4), %%xmm15, %%xmm5 \n\t"
|
||||
|
||||
"vfmadd231ps (%4,%8,4), %%xmm0 , %%xmm4 \n\t"
|
||||
"vfmadd231ps (%5,%8,4), %%xmm1 , %%xmm5 \n\t"
|
||||
"vfmadd231ps (%6,%8,4), %%xmm2 , %%xmm4 \n\t"
|
||||
"vfmadd231ps (%7,%8,4), %%xmm3 , %%xmm5 \n\t"
|
||||
|
||||
"vaddps %%xmm4 , %%xmm5 , %%xmm5 \n\t"
|
||||
"vmulps %%xmm6 , %%xmm5 , %%xmm5 \n\t"
|
||||
"vaddps %%xmm7 , %%xmm5 , %%xmm5 \n\t"
|
||||
|
||||
"vmovups %%xmm5, (%3,%0,4) \n\t" // 4 * y
|
||||
|
||||
"addq $4 , %8 \n\t"
|
||||
"addq $4 , %0 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
|
||||
".L08LABEL%=: \n\t"
|
||||
|
||||
"testq $0x08, %1 \n\t"
|
||||
"jz .L16LABEL%= \n\t"
|
||||
|
||||
"vmovups (%3,%0,4), %%ymm7 \n\t" // 8 * y
|
||||
"vxorps %%ymm4 , %%ymm4, %%ymm4 \n\t"
|
||||
"vxorps %%ymm5 , %%ymm5, %%ymm5 \n\t"
|
||||
|
||||
"vfmadd231ps (%4,%0,4), %%ymm12, %%ymm4 \n\t"
|
||||
"vfmadd231ps (%5,%0,4), %%ymm13, %%ymm5 \n\t"
|
||||
"vfmadd231ps (%6,%0,4), %%ymm14, %%ymm4 \n\t"
|
||||
"vfmadd231ps (%7,%0,4), %%ymm15, %%ymm5 \n\t"
|
||||
|
||||
"vfmadd231ps (%4,%8,4), %%ymm0 , %%ymm4 \n\t"
|
||||
"vfmadd231ps (%5,%8,4), %%ymm1 , %%ymm5 \n\t"
|
||||
"vfmadd231ps (%6,%8,4), %%ymm2 , %%ymm4 \n\t"
|
||||
"vfmadd231ps (%7,%8,4), %%ymm3 , %%ymm5 \n\t"
|
||||
|
||||
"vaddps %%ymm4 , %%ymm5 , %%ymm5 \n\t"
|
||||
"vmulps %%ymm6 , %%ymm5 , %%ymm5 \n\t"
|
||||
"vaddps %%ymm7 , %%ymm5 , %%ymm5 \n\t"
|
||||
|
||||
|
||||
"vmovups %%ymm5, (%3,%0,4) \n\t" // 8 * y
|
||||
|
||||
"addq $8 , %8 \n\t"
|
||||
"addq $8 , %0 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
|
||||
".L16LABEL%=: \n\t"
|
||||
|
||||
"cmpq $0, %1 \n\t"
|
||||
"je .L16END%= \n\t"
|
||||
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
|
||||
"vxorps %%ymm4 , %%ymm4, %%ymm4 \n\t"
|
||||
"vxorps %%ymm5 , %%ymm5, %%ymm5 \n\t"
|
||||
"vmovups (%3,%0,4), %%ymm8 \n\t" // 8 * y
|
||||
"vmovups 32(%3,%0,4), %%ymm9 \n\t" // 8 * y
|
||||
|
||||
"vfmadd231ps (%4,%0,4), %%ymm12, %%ymm4 \n\t"
|
||||
"vfmadd231ps 32(%4,%0,4), %%ymm12, %%ymm5 \n\t"
|
||||
"vfmadd231ps (%5,%0,4), %%ymm13, %%ymm4 \n\t"
|
||||
"vfmadd231ps 32(%5,%0,4), %%ymm13, %%ymm5 \n\t"
|
||||
"vfmadd231ps (%6,%0,4), %%ymm14, %%ymm4 \n\t"
|
||||
"vfmadd231ps 32(%6,%0,4), %%ymm14, %%ymm5 \n\t"
|
||||
"vfmadd231ps (%7,%0,4), %%ymm15, %%ymm4 \n\t"
|
||||
"vfmadd231ps 32(%7,%0,4), %%ymm15, %%ymm5 \n\t"
|
||||
|
||||
"vfmadd231ps (%4,%8,4), %%ymm0 , %%ymm4 \n\t"
|
||||
"addq $16, %0 \n\t"
|
||||
"vfmadd231ps 32(%4,%8,4), %%ymm0 , %%ymm5 \n\t"
|
||||
"vfmadd231ps (%5,%8,4), %%ymm1 , %%ymm4 \n\t"
|
||||
"vfmadd231ps 32(%5,%8,4), %%ymm1 , %%ymm5 \n\t"
|
||||
"vfmadd231ps (%6,%8,4), %%ymm2 , %%ymm4 \n\t"
|
||||
"vfmadd231ps 32(%6,%8,4), %%ymm2 , %%ymm5 \n\t"
|
||||
"vfmadd231ps (%7,%8,4), %%ymm3 , %%ymm4 \n\t"
|
||||
"vfmadd231ps 32(%7,%8,4), %%ymm3 , %%ymm5 \n\t"
|
||||
|
||||
"vfmadd231ps %%ymm6 , %%ymm4 , %%ymm8 \n\t"
|
||||
"vfmadd231ps %%ymm6 , %%ymm5 , %%ymm9 \n\t"
|
||||
|
||||
"addq $16, %8 \n\t"
|
||||
"vmovups %%ymm8,-64(%3,%0,4) \n\t" // 8 * y
|
||||
"subq $16, %1 \n\t"
|
||||
"vmovups %%ymm9,-32(%3,%0,4) \n\t" // 8 * y
|
||||
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
".L16END%=: \n\t"
|
||||
"vzeroupper \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (ap[2]), // 6
|
||||
"r" (ap[3]), // 7
|
||||
"r" (lda4), // 8
|
||||
"r" (alpha) // 9
|
||||
: "cc",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm2", "%xmm3",
|
||||
"%xmm4", "%xmm5",
|
||||
"%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
||||
#define HAVE_KERNEL_4x4 1
|
||||
static void sgemv_kernel_4x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void sgemv_kernel_4x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vzeroupper \n\t"
|
||||
"vbroadcastss (%2), %%ymm12 \n\t" // x0
|
||||
"vbroadcastss 4(%2), %%ymm13 \n\t" // x1
|
||||
"vbroadcastss 8(%2), %%ymm14 \n\t" // x2
|
||||
"vbroadcastss 12(%2), %%ymm15 \n\t" // x3
|
||||
|
||||
"vbroadcastss (%8), %%ymm6 \n\t" // alpha
|
||||
|
||||
"testq $0x04, %1 \n\t"
|
||||
"jz .L08LABEL%= \n\t"
|
||||
|
||||
"vxorps %%ymm4 , %%ymm4, %%ymm4 \n\t"
|
||||
"vxorps %%ymm5 , %%ymm5, %%ymm5 \n\t"
|
||||
"vmovups (%3,%0,4), %%xmm7 \n\t" // 4 * y
|
||||
|
||||
"vfmadd231ps (%4,%0,4), %%xmm12, %%xmm4 \n\t"
|
||||
"vfmadd231ps (%5,%0,4), %%xmm13, %%xmm5 \n\t"
|
||||
"vfmadd231ps (%6,%0,4), %%xmm14, %%xmm4 \n\t"
|
||||
"vfmadd231ps (%7,%0,4), %%xmm15, %%xmm5 \n\t"
|
||||
|
||||
"vaddps %%xmm4 , %%xmm5 , %%xmm5 \n\t"
|
||||
"vmulps %%xmm6 , %%xmm5 , %%xmm5 \n\t"
|
||||
"vaddps %%xmm7 , %%xmm5 , %%xmm5 \n\t"
|
||||
|
||||
"vmovups %%xmm5, (%3,%0,4) \n\t" // 4 * y
|
||||
|
||||
"addq $4 , %0 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
|
||||
".L08LABEL%=: \n\t"
|
||||
|
||||
"testq $0x08, %1 \n\t"
|
||||
"jz .L16LABEL%= \n\t"
|
||||
|
||||
"vxorps %%ymm4 , %%ymm4, %%ymm4 \n\t"
|
||||
"vxorps %%ymm5 , %%ymm5, %%ymm5 \n\t"
|
||||
"vmovups (%3,%0,4), %%ymm7 \n\t" // 8 * y
|
||||
|
||||
"vfmadd231ps (%4,%0,4), %%ymm12, %%ymm4 \n\t"
|
||||
"vfmadd231ps (%5,%0,4), %%ymm13, %%ymm5 \n\t"
|
||||
"vfmadd231ps (%6,%0,4), %%ymm14, %%ymm4 \n\t"
|
||||
"vfmadd231ps (%7,%0,4), %%ymm15, %%ymm5 \n\t"
|
||||
|
||||
"vaddps %%ymm4 , %%ymm5 , %%ymm5 \n\t"
|
||||
"vmulps %%ymm6 , %%ymm5 , %%ymm5 \n\t"
|
||||
"vaddps %%ymm7 , %%ymm5 , %%ymm5 \n\t"
|
||||
|
||||
"vmovups %%ymm5, (%3,%0,4) \n\t" // 8 * y
|
||||
|
||||
"addq $8 , %0 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
|
||||
".L16LABEL%=: \n\t"
|
||||
|
||||
"cmpq $0, %1 \n\t"
|
||||
"je .L16END%= \n\t"
|
||||
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vxorps %%ymm4 , %%ymm4, %%ymm4 \n\t"
|
||||
"vxorps %%ymm5 , %%ymm5, %%ymm5 \n\t"
|
||||
"vmovups (%3,%0,4), %%ymm8 \n\t" // 8 * y
|
||||
"vmovups 32(%3,%0,4), %%ymm9 \n\t" // 8 * y
|
||||
|
||||
"vfmadd231ps (%4,%0,4), %%ymm12, %%ymm4 \n\t"
|
||||
"vfmadd231ps 32(%4,%0,4), %%ymm12, %%ymm5 \n\t"
|
||||
"vfmadd231ps (%5,%0,4), %%ymm13, %%ymm4 \n\t"
|
||||
"vfmadd231ps 32(%5,%0,4), %%ymm13, %%ymm5 \n\t"
|
||||
"vfmadd231ps (%6,%0,4), %%ymm14, %%ymm4 \n\t"
|
||||
"vfmadd231ps 32(%6,%0,4), %%ymm14, %%ymm5 \n\t"
|
||||
"vfmadd231ps (%7,%0,4), %%ymm15, %%ymm4 \n\t"
|
||||
"vfmadd231ps 32(%7,%0,4), %%ymm15, %%ymm5 \n\t"
|
||||
|
||||
"vfmadd231ps %%ymm6 , %%ymm4 , %%ymm8 \n\t"
|
||||
"vfmadd231ps %%ymm6 , %%ymm5 , %%ymm9 \n\t"
|
||||
|
||||
"vmovups %%ymm8, (%3,%0,4) \n\t" // 8 * y
|
||||
"vmovups %%ymm9, 32(%3,%0,4) \n\t" // 8 * y
|
||||
|
||||
"addq $16, %0 \n\t"
|
||||
"subq $16, %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
".L16END%=: \n\t"
|
||||
"vzeroupper \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (ap[2]), // 6
|
||||
"r" (ap[3]), // 7
|
||||
"r" (alpha) // 8
|
||||
: "cc",
|
||||
"%xmm4", "%xmm5",
|
||||
"%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -1,461 +0,0 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
static void sgemv_kernel_64( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
|
||||
float *pre = a + lda*2;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vbroadcastss %1, %%ymm1\n\t" // alpha -> ymm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
"movq %6, %%r8\n\t" // address for prefetch
|
||||
"prefetcht0 (%%r8)\n\t" // Prefetch
|
||||
"prefetcht0 64(%%r8)\n\t" // Prefetch
|
||||
|
||||
"vxorps %%ymm8 , %%ymm8 , %%ymm8 \n\t" // set to zero
|
||||
"vxorps %%ymm9 , %%ymm9 , %%ymm9 \n\t" // set to zero
|
||||
"vxorps %%ymm10, %%ymm10, %%ymm10\n\t" // set to zero
|
||||
"vxorps %%ymm11, %%ymm11, %%ymm11\n\t" // set to zero
|
||||
"vxorps %%ymm12, %%ymm12, %%ymm12\n\t" // set to zero
|
||||
"vxorps %%ymm13, %%ymm13, %%ymm13\n\t" // set to zero
|
||||
"vxorps %%ymm14, %%ymm14, %%ymm14\n\t" // set to zero
|
||||
"vxorps %%ymm15, %%ymm15, %%ymm15\n\t" // set to zero
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vbroadcastss (%%rdi), %%ymm0 \n\t" // load values of c
|
||||
"leaq (%%r8 , %%rcx, 4), %%r8 \n\t" // add lda to pointer for prefetch
|
||||
|
||||
"prefetcht0 (%%r8)\n\t" // Prefetch
|
||||
"vfmadd231ps 0*4(%%rsi), %%ymm0, %%ymm8 \n\t" // multiply a and c and add to temp
|
||||
"vfmadd231ps 8*4(%%rsi), %%ymm0, %%ymm9 \n\t" // multiply a and c and add to temp
|
||||
"prefetcht0 64(%%r8)\n\t" // Prefetch
|
||||
"vfmadd231ps 16*4(%%rsi), %%ymm0, %%ymm10\n\t" // multiply a and c and add to temp
|
||||
"vfmadd231ps 24*4(%%rsi), %%ymm0, %%ymm11\n\t" // multiply a and c and add to temp
|
||||
"prefetcht0 128(%%r8)\n\t" // Prefetch
|
||||
"vfmadd231ps 32*4(%%rsi), %%ymm0, %%ymm12\n\t" // multiply a and c and add to temp
|
||||
"vfmadd231ps 40*4(%%rsi), %%ymm0, %%ymm13\n\t" // multiply a and c and add to temp
|
||||
"prefetcht0 192(%%r8)\n\t" // Prefetch
|
||||
"vfmadd231ps 48*4(%%rsi), %%ymm0, %%ymm14\n\t" // multiply a and c and add to temp
|
||||
"vfmadd231ps 56*4(%%rsi), %%ymm0, %%ymm15\n\t" // multiply a and c and add to temp
|
||||
|
||||
"addq $4 , %%rdi \n\t" // increment pointer of c
|
||||
"leaq (%%rsi, %%rcx, 4), %%rsi \n\t" // add lda to pointer of a
|
||||
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmulps %%ymm8 , %%ymm1, %%ymm8 \n\t" // scale by alpha
|
||||
"vmulps %%ymm9 , %%ymm1, %%ymm9 \n\t" // scale by alpha
|
||||
"vmulps %%ymm10, %%ymm1, %%ymm10\n\t" // scale by alpha
|
||||
"vmulps %%ymm11, %%ymm1, %%ymm11\n\t" // scale by alpha
|
||||
"vmulps %%ymm12, %%ymm1, %%ymm12\n\t" // scale by alpha
|
||||
"vmulps %%ymm13, %%ymm1, %%ymm13\n\t" // scale by alpha
|
||||
"vmulps %%ymm14, %%ymm1, %%ymm14\n\t" // scale by alpha
|
||||
"vmulps %%ymm15, %%ymm1, %%ymm15\n\t" // scale by alpha
|
||||
|
||||
"vmovups %%ymm8 , (%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm9 , 8*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm10, 16*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm11, 24*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm12, 32*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm13, 40*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm14, 48*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm15, 56*4(%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y), // 5
|
||||
"m" (pre) // 6
|
||||
: "%rax", "%rcx", "%rdx", "%rsi", "%rdi", "%r8", "cc",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
||||
static void sgemv_kernel_32( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
|
||||
float *pre = a + lda*3;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vbroadcastss %1, %%ymm1\n\t" // alpha -> ymm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
"movq %6, %%r8\n\t" // address for prefetch
|
||||
"prefetcht0 (%%r8)\n\t" // Prefetch
|
||||
"prefetcht0 64(%%r8)\n\t" // Prefetch
|
||||
|
||||
"vxorps %%ymm8 , %%ymm8 , %%ymm8 \n\t" // set to zero
|
||||
"vxorps %%ymm9 , %%ymm9 , %%ymm9 \n\t" // set to zero
|
||||
"vxorps %%ymm10, %%ymm10, %%ymm10\n\t" // set to zero
|
||||
"vxorps %%ymm11, %%ymm11, %%ymm11\n\t" // set to zero
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vbroadcastss (%%rdi), %%ymm0 \n\t" // load values of c
|
||||
"nop \n\t"
|
||||
"leaq (%%r8 , %%rcx, 4), %%r8 \n\t" // add lda to pointer for prefetch
|
||||
|
||||
"prefetcht0 (%%r8)\n\t" // Prefetch
|
||||
"prefetcht0 64(%%r8)\n\t" // Prefetch
|
||||
|
||||
"vmulps 0*4(%%rsi), %%ymm0, %%ymm4 \n\t" // multiply a and c and add to temp
|
||||
"vmulps 8*4(%%rsi), %%ymm0, %%ymm5 \n\t" // multiply a and c and add to temp
|
||||
"vmulps 16*4(%%rsi), %%ymm0, %%ymm6 \n\t" // multiply a and c and add to temp
|
||||
"vmulps 24*4(%%rsi), %%ymm0, %%ymm7 \n\t" // multiply a and c and add to temp
|
||||
|
||||
"vaddps %%ymm8 , %%ymm4, %%ymm8 \n\t" // multiply a and c and add to temp
|
||||
"vaddps %%ymm9 , %%ymm5, %%ymm9 \n\t" // multiply a and c and add to temp
|
||||
"vaddps %%ymm10, %%ymm6, %%ymm10\n\t" // multiply a and c and add to temp
|
||||
"vaddps %%ymm11, %%ymm7, %%ymm11\n\t" // multiply a and c and add to temp
|
||||
|
||||
|
||||
|
||||
"addq $4 , %%rdi \n\t" // increment pointer of c
|
||||
"leaq (%%rsi, %%rcx, 4), %%rsi \n\t" // add lda to pointer of a
|
||||
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmulps %%ymm8 , %%ymm1, %%ymm8 \n\t" // scale by alpha
|
||||
"vmulps %%ymm9 , %%ymm1, %%ymm9 \n\t" // scale by alpha
|
||||
"vmulps %%ymm10, %%ymm1, %%ymm10\n\t" // scale by alpha
|
||||
"vmulps %%ymm11, %%ymm1, %%ymm11\n\t" // scale by alpha
|
||||
|
||||
"vmovups %%ymm8 , (%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm9 , 8*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm10, 16*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm11, 24*4(%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y), // 5
|
||||
"m" (pre) // 6
|
||||
: "%rax", "%rcx", "%rdx", "%rsi", "%rdi", "%r8", "cc",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"memory"
|
||||
);
|
||||
|
||||
|
||||
|
||||
}
|
||||
|
||||
static void sgemv_kernel_16( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
float *pre = a + lda*3;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vbroadcastss %1, %%ymm1\n\t" // alpha -> ymm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
"movq %6, %%r8\n\t" // address for prefetch
|
||||
"prefetcht0 (%%r8)\n\t" // Prefetch
|
||||
"prefetcht0 64(%%r8)\n\t" // Prefetch
|
||||
|
||||
"vxorps %%ymm8 , %%ymm8 , %%ymm8 \n\t" // set to zero
|
||||
"vxorps %%ymm9 , %%ymm9 , %%ymm9 \n\t" // set to zero
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vbroadcastss (%%rdi), %%ymm0 \n\t" // load values of c
|
||||
"nop \n\t"
|
||||
"leaq (%%r8 , %%rcx, 4), %%r8 \n\t" // add lda to pointer for prefetch
|
||||
|
||||
"prefetcht0 (%%r8)\n\t" // Prefetch
|
||||
|
||||
"vmulps 0*4(%%rsi), %%ymm0, %%ymm4 \n\t" // multiply a and c and add to temp
|
||||
"vmulps 8*4(%%rsi), %%ymm0, %%ymm5 \n\t" // multiply a and c and add to temp
|
||||
|
||||
"vaddps %%ymm8 , %%ymm4, %%ymm8 \n\t" // multiply a and c and add to temp
|
||||
"vaddps %%ymm9 , %%ymm5, %%ymm9 \n\t" // multiply a and c and add to temp
|
||||
|
||||
"addq $4 , %%rdi \n\t" // increment pointer of c
|
||||
"leaq (%%rsi, %%rcx, 4), %%rsi \n\t" // add lda to pointer of a
|
||||
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmulps %%ymm8 , %%ymm1, %%ymm8 \n\t" // scale by alpha
|
||||
"vmulps %%ymm9 , %%ymm1, %%ymm9 \n\t" // scale by alpha
|
||||
|
||||
"vmovups %%ymm8 , (%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm9 , 8*4(%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y), // 5
|
||||
"m" (pre) // 6
|
||||
: "%rax", "%rcx", "%rdx", "%rsi", "%rdi", "%r8", "cc",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"memory"
|
||||
);
|
||||
|
||||
|
||||
}
|
||||
|
||||
|
||||
static void sgemv_kernel_8( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vbroadcastss %1, %%ymm1\n\t" // alpha -> ymm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
|
||||
"vxorps %%ymm8 , %%ymm8 , %%ymm8 \n\t" // set to zero
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vbroadcastss (%%rdi), %%ymm0 \n\t" // load values of c
|
||||
|
||||
"vmulps 0*4(%%rsi), %%ymm0, %%ymm4 \n\t" // multiply a and c and add to temp
|
||||
"vaddps %%ymm8 , %%ymm4, %%ymm8 \n\t" // multiply a and c and add to temp
|
||||
|
||||
"addq $4 , %%rdi \n\t" // increment pointer of c
|
||||
"leaq (%%rsi, %%rcx, 4), %%rsi \n\t" // add lda to pointer of a
|
||||
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmulps %%ymm8 , %%ymm1, %%ymm8 \n\t" // scale by alpha
|
||||
"vmovups %%ymm8 , (%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y) // 5
|
||||
: "%rax", "%rcx", "%rdx", "%rsi", "%rdi", "%r8", "cc",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"memory"
|
||||
);
|
||||
|
||||
|
||||
}
|
||||
|
||||
|
||||
static void sgemv_kernel_4( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vbroadcastss %1, %%xmm1\n\t" // alpha -> xmm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
|
||||
"vxorps %%xmm12, %%xmm12, %%xmm12\n\t" // set to zero
|
||||
|
||||
".L01LOOP%=: \n\t"
|
||||
"vbroadcastss (%%rdi), %%xmm0 \n\t" // load values of c
|
||||
|
||||
"vmulps 0*4(%%rsi), %%xmm0, %%xmm4 \n\t" // multiply a and c and add to temp
|
||||
"vaddps %%xmm12, %%xmm4, %%xmm12 \n\t" // multiply a and c and add to temp
|
||||
|
||||
"addq $4 , %%rdi \n\t" // increment pointer of c
|
||||
"leaq (%%rsi, %%rcx, 4), %%rsi \n\t" // add lda to pointer of a
|
||||
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmulps %%xmm12, %%xmm1, %%xmm12\n\t" // scale by alpha
|
||||
|
||||
"vmovups %%xmm12, (%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y) // 5
|
||||
: "%rax", "%rcx", "%rdx", "%rsi", "%rdi", "%r8",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
static void sgemv_kernel_2( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vmovss %1, %%xmm1\n\t" // alpha -> xmm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
|
||||
"vxorps %%xmm12, %%xmm12, %%xmm12\n\t" // set to zero
|
||||
"vxorps %%xmm13, %%xmm13, %%xmm13\n\t" // set to zero
|
||||
|
||||
".L01LOOP%=: \n\t"
|
||||
"vmovss (%%rdi), %%xmm0 \n\t" // load values of c
|
||||
|
||||
"vmulps 0*4(%%rsi), %%xmm0, %%xmm4 \n\t" // multiply a and c and add to temp
|
||||
"vmulps 1*4(%%rsi), %%xmm0, %%xmm5 \n\t" // multiply a and c and add to temp
|
||||
|
||||
"vaddps %%xmm12, %%xmm4, %%xmm12 \n\t" // multiply a and c and add to temp
|
||||
"vaddps %%xmm13, %%xmm5, %%xmm13 \n\t" // multiply a and c and add to temp
|
||||
|
||||
"addq $4 , %%rdi \n\t" // increment pointer of c
|
||||
"leaq (%%rsi, %%rcx, 4), %%rsi \n\t" // add lda to pointer of a
|
||||
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmulss %%xmm12, %%xmm1, %%xmm12\n\t" // scale by alpha
|
||||
"vmulss %%xmm13, %%xmm1, %%xmm13\n\t" // scale by alpha
|
||||
|
||||
"vmovss %%xmm12, (%%rdx) \n\t" // store temp -> y
|
||||
"vmovss %%xmm13, 4(%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y) // 5
|
||||
: "%rax", "%rcx", "%rdx", "%rsi", "%rdi", "%r8",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
||||
static void sgemv_kernel_1( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vmovss %1, %%xmm1\n\t" // alpha -> xmm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
|
||||
"vxorps %%xmm12, %%xmm12, %%xmm12\n\t" // set to zero
|
||||
|
||||
".L01LOOP%=: \n\t"
|
||||
"vmovss (%%rdi), %%xmm0 \n\t" // load values of c
|
||||
"addq $4 , %%rdi \n\t" // increment pointer of c
|
||||
|
||||
"vmulss 0*4(%%rsi), %%xmm0, %%xmm4 \n\t" // multiply a and c and add to temp
|
||||
"vaddss %%xmm12, %%xmm4, %%xmm12 \n\t" // multiply a and c and add to temp
|
||||
|
||||
"leaq (%%rsi, %%rcx, 4), %%rsi \n\t" // add lda to pointer of a
|
||||
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmulss %%xmm12, %%xmm1, %%xmm12\n\t" // scale by alpha
|
||||
|
||||
"vmovss %%xmm12, (%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y) // 5
|
||||
: "%rax", "%rcx", "%rdx", "%rsi", "%rdi", "%r8",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -1,144 +0,0 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#define HAVE_KERNEL_16x4 1
|
||||
static void sgemv_kernel_16x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y) __attribute__ ((noinline));
|
||||
|
||||
static void sgemv_kernel_16x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movss (%2), %%xmm12 \n\t" // x0
|
||||
"movss 4(%2), %%xmm13 \n\t" // x1
|
||||
"movss 8(%2), %%xmm14 \n\t" // x2
|
||||
"movss 12(%2), %%xmm15 \n\t" // x3
|
||||
"shufps $0, %%xmm12, %%xmm12\n\t"
|
||||
"shufps $0, %%xmm13, %%xmm13\n\t"
|
||||
"shufps $0, %%xmm14, %%xmm14\n\t"
|
||||
"shufps $0, %%xmm15, %%xmm15\n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"movups (%3,%0,4), %%xmm4 \n\t" // 4 * y
|
||||
"movups 16(%3,%0,4), %%xmm5 \n\t" // 4 * y
|
||||
"movups 32(%3,%0,4), %%xmm6 \n\t" // 4 * y
|
||||
"movups 48(%3,%0,4), %%xmm7 \n\t" // 4 * y
|
||||
|
||||
"prefetcht0 192(%4,%0,4) \n\t"
|
||||
|
||||
"movups (%4,%0,4), %%xmm8 \n\t"
|
||||
"movups 16(%4,%0,4), %%xmm9 \n\t"
|
||||
"movups 32(%4,%0,4), %%xmm10 \n\t"
|
||||
"movups 48(%4,%0,4), %%xmm11 \n\t"
|
||||
"mulps %%xmm12, %%xmm8 \n\t"
|
||||
"addps %%xmm8 , %%xmm4 \n\t"
|
||||
"mulps %%xmm12, %%xmm9 \n\t"
|
||||
"addps %%xmm9 , %%xmm5 \n\t"
|
||||
"mulps %%xmm12, %%xmm10 \n\t"
|
||||
"addps %%xmm10, %%xmm6 \n\t"
|
||||
"mulps %%xmm12, %%xmm11 \n\t"
|
||||
"addps %%xmm11, %%xmm7 \n\t"
|
||||
|
||||
"prefetcht0 192(%5,%0,4) \n\t"
|
||||
|
||||
"movups (%5,%0,4), %%xmm8 \n\t"
|
||||
"movups 16(%5,%0,4), %%xmm9 \n\t"
|
||||
"movups 32(%5,%0,4), %%xmm10 \n\t"
|
||||
"movups 48(%5,%0,4), %%xmm11 \n\t"
|
||||
"mulps %%xmm13, %%xmm8 \n\t"
|
||||
"addps %%xmm8 , %%xmm4 \n\t"
|
||||
"mulps %%xmm13, %%xmm9 \n\t"
|
||||
"addps %%xmm9 , %%xmm5 \n\t"
|
||||
"mulps %%xmm13, %%xmm10 \n\t"
|
||||
"addps %%xmm10, %%xmm6 \n\t"
|
||||
"mulps %%xmm13, %%xmm11 \n\t"
|
||||
"addps %%xmm11, %%xmm7 \n\t"
|
||||
|
||||
"prefetcht0 192(%6,%0,4) \n\t"
|
||||
|
||||
"movups (%6,%0,4), %%xmm8 \n\t"
|
||||
"movups 16(%6,%0,4), %%xmm9 \n\t"
|
||||
"movups 32(%6,%0,4), %%xmm10 \n\t"
|
||||
"movups 48(%6,%0,4), %%xmm11 \n\t"
|
||||
"mulps %%xmm14, %%xmm8 \n\t"
|
||||
"addps %%xmm8 , %%xmm4 \n\t"
|
||||
"mulps %%xmm14, %%xmm9 \n\t"
|
||||
"addps %%xmm9 , %%xmm5 \n\t"
|
||||
"mulps %%xmm14, %%xmm10 \n\t"
|
||||
"addps %%xmm10, %%xmm6 \n\t"
|
||||
"mulps %%xmm14, %%xmm11 \n\t"
|
||||
"addps %%xmm11, %%xmm7 \n\t"
|
||||
|
||||
"prefetcht0 192(%7,%0,4) \n\t"
|
||||
|
||||
"movups (%7,%0,4), %%xmm8 \n\t"
|
||||
"movups 16(%7,%0,4), %%xmm9 \n\t"
|
||||
"movups 32(%7,%0,4), %%xmm10 \n\t"
|
||||
"movups 48(%7,%0,4), %%xmm11 \n\t"
|
||||
"mulps %%xmm15, %%xmm8 \n\t"
|
||||
"addps %%xmm8 , %%xmm4 \n\t"
|
||||
"mulps %%xmm15, %%xmm9 \n\t"
|
||||
"addps %%xmm9 , %%xmm5 \n\t"
|
||||
"mulps %%xmm15, %%xmm10 \n\t"
|
||||
"addps %%xmm10, %%xmm6 \n\t"
|
||||
"mulps %%xmm15, %%xmm11 \n\t"
|
||||
"addps %%xmm11, %%xmm7 \n\t"
|
||||
|
||||
|
||||
"movups %%xmm4, (%3,%0,4) \n\t" // 4 * y
|
||||
"movups %%xmm5, 16(%3,%0,4) \n\t" // 4 * y
|
||||
"movups %%xmm6, 32(%3,%0,4) \n\t" // 4 * y
|
||||
"movups %%xmm7, 48(%3,%0,4) \n\t" // 4 * y
|
||||
|
||||
"addq $16, %0 \n\t"
|
||||
"subq $16, %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (ap[2]), // 6
|
||||
"r" (ap[3]) // 7
|
||||
: "cc",
|
||||
"%xmm4", "%xmm5",
|
||||
"%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,204 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
|
||||
#define HAVE_KERNEL_4x8 1
|
||||
static void sgemv_kernel_4x8( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, BLASLONG lda4, FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void sgemv_kernel_4x8( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, BLASLONG lda4, FLOAT *alpha)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movss (%2), %%xmm12 \n\t" // x0
|
||||
"movss 4(%2), %%xmm13 \n\t" // x1
|
||||
"movss 8(%2), %%xmm14 \n\t" // x2
|
||||
"movss 12(%2), %%xmm15 \n\t" // x3
|
||||
"shufps $0, %%xmm12, %%xmm12\n\t"
|
||||
"shufps $0, %%xmm13, %%xmm13\n\t"
|
||||
"shufps $0, %%xmm14, %%xmm14\n\t"
|
||||
"shufps $0, %%xmm15, %%xmm15\n\t"
|
||||
|
||||
"movss 16(%2), %%xmm0 \n\t" // x4
|
||||
"movss 20(%2), %%xmm1 \n\t" // x5
|
||||
"movss 24(%2), %%xmm2 \n\t" // x6
|
||||
"movss 28(%2), %%xmm3 \n\t" // x7
|
||||
"shufps $0, %%xmm0 , %%xmm0 \n\t"
|
||||
"shufps $0, %%xmm1 , %%xmm1 \n\t"
|
||||
"shufps $0, %%xmm2 , %%xmm2 \n\t"
|
||||
"shufps $0, %%xmm3 , %%xmm3 \n\t"
|
||||
|
||||
"movss (%9), %%xmm6 \n\t" // alpha
|
||||
"shufps $0, %%xmm6 , %%xmm6 \n\t"
|
||||
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"xorps %%xmm4 , %%xmm4 \n\t"
|
||||
"xorps %%xmm5 , %%xmm5 \n\t"
|
||||
"movups (%3,%0,4), %%xmm7 \n\t" // 4 * y
|
||||
|
||||
".align 2 \n\t"
|
||||
"movups (%4,%0,4), %%xmm8 \n\t"
|
||||
"movups (%5,%0,4), %%xmm9 \n\t"
|
||||
"movups (%6,%0,4), %%xmm10 \n\t"
|
||||
"movups (%7,%0,4), %%xmm11 \n\t"
|
||||
".align 2 \n\t"
|
||||
"mulps %%xmm12, %%xmm8 \n\t"
|
||||
"mulps %%xmm13, %%xmm9 \n\t"
|
||||
"mulps %%xmm14, %%xmm10 \n\t"
|
||||
"mulps %%xmm15, %%xmm11 \n\t"
|
||||
"addps %%xmm8 , %%xmm4 \n\t"
|
||||
"addps %%xmm9 , %%xmm5 \n\t"
|
||||
"addps %%xmm10, %%xmm4 \n\t"
|
||||
"addps %%xmm11, %%xmm5 \n\t"
|
||||
|
||||
"movups (%4,%8,4), %%xmm8 \n\t"
|
||||
"movups (%5,%8,4), %%xmm9 \n\t"
|
||||
"movups (%6,%8,4), %%xmm10 \n\t"
|
||||
"movups (%7,%8,4), %%xmm11 \n\t"
|
||||
".align 2 \n\t"
|
||||
"mulps %%xmm0 , %%xmm8 \n\t"
|
||||
"mulps %%xmm1 , %%xmm9 \n\t"
|
||||
"mulps %%xmm2 , %%xmm10 \n\t"
|
||||
"mulps %%xmm3 , %%xmm11 \n\t"
|
||||
"addps %%xmm8 , %%xmm4 \n\t"
|
||||
"addps %%xmm9 , %%xmm5 \n\t"
|
||||
"addps %%xmm10, %%xmm4 \n\t"
|
||||
"addps %%xmm11, %%xmm5 \n\t"
|
||||
|
||||
"addq $4 , %8 \n\t"
|
||||
"addps %%xmm5 , %%xmm4 \n\t"
|
||||
"addq $4 , %0 \n\t"
|
||||
"mulps %%xmm6 , %%xmm4 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
"addps %%xmm4 , %%xmm7 \n\t"
|
||||
|
||||
"movups %%xmm7 , -16(%3,%0,4) \n\t" // 4 * y
|
||||
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (ap[2]), // 6
|
||||
"r" (ap[3]), // 7
|
||||
"r" (lda4), // 8
|
||||
"r" (alpha) // 9
|
||||
: "cc",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm2", "%xmm3",
|
||||
"%xmm4", "%xmm5",
|
||||
"%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
||||
|
||||
#define HAVE_KERNEL_4x4 1
|
||||
static void sgemv_kernel_4x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void sgemv_kernel_4x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movss (%2), %%xmm12 \n\t" // x0
|
||||
"movss 4(%2), %%xmm13 \n\t" // x1
|
||||
"movss 8(%2), %%xmm14 \n\t" // x2
|
||||
"movss 12(%2), %%xmm15 \n\t" // x3
|
||||
"shufps $0, %%xmm12, %%xmm12\n\t"
|
||||
"shufps $0, %%xmm13, %%xmm13\n\t"
|
||||
"shufps $0, %%xmm14, %%xmm14\n\t"
|
||||
"shufps $0, %%xmm15, %%xmm15\n\t"
|
||||
|
||||
"movss (%8), %%xmm6 \n\t" // alpha
|
||||
"shufps $0, %%xmm6 , %%xmm6 \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"xorps %%xmm4 , %%xmm4 \n\t"
|
||||
"movups (%3,%0,4), %%xmm7 \n\t" // 4 * y
|
||||
|
||||
"movups (%4,%0,4), %%xmm8 \n\t"
|
||||
"movups (%5,%0,4), %%xmm9 \n\t"
|
||||
"movups (%6,%0,4), %%xmm10 \n\t"
|
||||
"movups (%7,%0,4), %%xmm11 \n\t"
|
||||
"mulps %%xmm12, %%xmm8 \n\t"
|
||||
"mulps %%xmm13, %%xmm9 \n\t"
|
||||
"mulps %%xmm14, %%xmm10 \n\t"
|
||||
"mulps %%xmm15, %%xmm11 \n\t"
|
||||
"addps %%xmm8 , %%xmm4 \n\t"
|
||||
"addq $4 , %0 \n\t"
|
||||
"addps %%xmm9 , %%xmm4 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
"addps %%xmm10 , %%xmm4 \n\t"
|
||||
"addps %%xmm4 , %%xmm11 \n\t"
|
||||
|
||||
"mulps %%xmm6 , %%xmm11 \n\t"
|
||||
"addps %%xmm7 , %%xmm11 \n\t"
|
||||
"movups %%xmm11, -16(%3,%0,4) \n\t" // 4 * y
|
||||
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (ap[2]), // 6
|
||||
"r" (ap[3]), // 7
|
||||
"r" (alpha) // 8
|
||||
: "cc",
|
||||
"%xmm4", "%xmm5",
|
||||
"%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,370 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
|
||||
|
||||
#define HAVE_KERNEL_4x8 1
|
||||
static void sgemv_kernel_4x8( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, BLASLONG lda4, FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void sgemv_kernel_4x8( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, BLASLONG lda4, FLOAT *alpha)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vzeroupper \n\t"
|
||||
"vbroadcastss (%2), %%ymm12 \n\t" // x0
|
||||
"vbroadcastss 4(%2), %%ymm13 \n\t" // x1
|
||||
"vbroadcastss 8(%2), %%ymm14 \n\t" // x2
|
||||
"vbroadcastss 12(%2), %%ymm15 \n\t" // x3
|
||||
"vbroadcastss 16(%2), %%ymm0 \n\t" // x4
|
||||
"vbroadcastss 20(%2), %%ymm1 \n\t" // x5
|
||||
"vbroadcastss 24(%2), %%ymm2 \n\t" // x6
|
||||
"vbroadcastss 28(%2), %%ymm3 \n\t" // x7
|
||||
|
||||
"vbroadcastss (%9), %%ymm6 \n\t" // alpha
|
||||
|
||||
"testq $0x04, %1 \n\t"
|
||||
"jz .L08LABEL%= \n\t"
|
||||
|
||||
"vxorps %%xmm4 , %%xmm4 , %%xmm4 \n\t"
|
||||
"vxorps %%xmm5 , %%xmm5 , %%xmm5 \n\t"
|
||||
"vmovups (%3,%0,4), %%xmm7 \n\t" // 4 * y
|
||||
|
||||
"vmulps (%4,%0,4), %%xmm12, %%xmm8 \n\t"
|
||||
"vmulps (%5,%0,4), %%xmm13, %%xmm10 \n\t"
|
||||
"vmulps (%6,%0,4), %%xmm14, %%xmm9 \n\t"
|
||||
"vmulps (%7,%0,4), %%xmm15, %%xmm11 \n\t"
|
||||
"vaddps %%xmm4, %%xmm8 , %%xmm4 \n\t"
|
||||
"vaddps %%xmm5, %%xmm10, %%xmm5 \n\t"
|
||||
"vaddps %%xmm4, %%xmm9 , %%xmm4 \n\t"
|
||||
"vaddps %%xmm5, %%xmm11, %%xmm5 \n\t"
|
||||
|
||||
"vmulps (%4,%8,4), %%xmm0 , %%xmm8 \n\t"
|
||||
"vmulps (%5,%8,4), %%xmm1 , %%xmm10 \n\t"
|
||||
"vmulps (%6,%8,4), %%xmm2 , %%xmm9 \n\t"
|
||||
"vmulps (%7,%8,4), %%xmm3 , %%xmm11 \n\t"
|
||||
"vaddps %%xmm4, %%xmm8 , %%xmm4 \n\t"
|
||||
"vaddps %%xmm5, %%xmm10, %%xmm5 \n\t"
|
||||
"vaddps %%xmm4, %%xmm9 , %%xmm4 \n\t"
|
||||
"vaddps %%xmm5, %%xmm11, %%xmm5 \n\t"
|
||||
|
||||
"vaddps %%xmm5, %%xmm4 , %%xmm4 \n\t"
|
||||
"vmulps %%xmm6, %%xmm4 , %%xmm5 \n\t"
|
||||
"vaddps %%xmm5, %%xmm7 , %%xmm5 \n\t"
|
||||
|
||||
"vmovups %%xmm5, (%3,%0,4) \n\t" // 4 * y
|
||||
|
||||
"addq $4, %8 \n\t"
|
||||
"addq $4, %0 \n\t"
|
||||
"subq $4, %1 \n\t"
|
||||
|
||||
".L08LABEL%=: \n\t"
|
||||
|
||||
"testq $0x08, %1 \n\t"
|
||||
"jz .L16LABEL%= \n\t"
|
||||
|
||||
"vxorps %%ymm4 , %%ymm4 , %%ymm4 \n\t"
|
||||
"vxorps %%ymm5 , %%ymm5 , %%ymm5 \n\t"
|
||||
"vmovups (%3,%0,4), %%ymm7 \n\t" // 8 * y
|
||||
|
||||
"vmulps (%4,%0,4), %%ymm12, %%ymm8 \n\t"
|
||||
"vmulps (%5,%0,4), %%ymm13, %%ymm10 \n\t"
|
||||
"vmulps (%6,%0,4), %%ymm14, %%ymm9 \n\t"
|
||||
"vmulps (%7,%0,4), %%ymm15, %%ymm11 \n\t"
|
||||
"vaddps %%ymm4, %%ymm8 , %%ymm4 \n\t"
|
||||
"vaddps %%ymm5, %%ymm10, %%ymm5 \n\t"
|
||||
"vaddps %%ymm4, %%ymm9 , %%ymm4 \n\t"
|
||||
"vaddps %%ymm5, %%ymm11, %%ymm5 \n\t"
|
||||
|
||||
"vmulps (%4,%8,4), %%ymm0 , %%ymm8 \n\t"
|
||||
"vmulps (%5,%8,4), %%ymm1 , %%ymm10 \n\t"
|
||||
"vmulps (%6,%8,4), %%ymm2 , %%ymm9 \n\t"
|
||||
"vmulps (%7,%8,4), %%ymm3 , %%ymm11 \n\t"
|
||||
"vaddps %%ymm4, %%ymm8 , %%ymm4 \n\t"
|
||||
"vaddps %%ymm5, %%ymm10, %%ymm5 \n\t"
|
||||
"vaddps %%ymm4, %%ymm9 , %%ymm4 \n\t"
|
||||
"vaddps %%ymm5, %%ymm11, %%ymm5 \n\t"
|
||||
|
||||
"vaddps %%ymm5, %%ymm4 , %%ymm4 \n\t"
|
||||
"vmulps %%ymm6, %%ymm4 , %%ymm5 \n\t"
|
||||
"vaddps %%ymm5, %%ymm7 , %%ymm5 \n\t"
|
||||
|
||||
"vmovups %%ymm5, (%3,%0,4) \n\t" // 8 * y
|
||||
|
||||
"addq $8, %8 \n\t"
|
||||
"addq $8, %0 \n\t"
|
||||
"subq $8, %1 \n\t"
|
||||
|
||||
|
||||
".L16LABEL%=: \n\t"
|
||||
|
||||
"cmpq $0, %1 \n\t"
|
||||
"je .L16END%= \n\t"
|
||||
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vxorps %%ymm4 , %%ymm4 , %%ymm4 \n\t"
|
||||
"vxorps %%ymm5 , %%ymm5 , %%ymm5 \n\t"
|
||||
|
||||
"prefetcht0 192(%4,%0,4) \n\t"
|
||||
"vmulps (%4,%0,4), %%ymm12, %%ymm8 \n\t"
|
||||
"vmulps 32(%4,%0,4), %%ymm12, %%ymm9 \n\t"
|
||||
"prefetcht0 192(%5,%0,4) \n\t"
|
||||
"vmulps (%5,%0,4), %%ymm13, %%ymm10 \n\t"
|
||||
"vmulps 32(%5,%0,4), %%ymm13, %%ymm11 \n\t"
|
||||
"vaddps %%ymm4, %%ymm8 , %%ymm4 \n\t"
|
||||
"vaddps %%ymm5, %%ymm9 , %%ymm5 \n\t"
|
||||
"vaddps %%ymm4, %%ymm10, %%ymm4 \n\t"
|
||||
"vaddps %%ymm5, %%ymm11, %%ymm5 \n\t"
|
||||
|
||||
"prefetcht0 192(%6,%0,4) \n\t"
|
||||
"vmulps (%6,%0,4), %%ymm14, %%ymm8 \n\t"
|
||||
"vmulps 32(%6,%0,4), %%ymm14, %%ymm9 \n\t"
|
||||
"prefetcht0 192(%7,%0,4) \n\t"
|
||||
"vmulps (%7,%0,4), %%ymm15, %%ymm10 \n\t"
|
||||
"vmulps 32(%7,%0,4), %%ymm15, %%ymm11 \n\t"
|
||||
"vaddps %%ymm4, %%ymm8 , %%ymm4 \n\t"
|
||||
"vaddps %%ymm5, %%ymm9 , %%ymm5 \n\t"
|
||||
"vaddps %%ymm4, %%ymm10, %%ymm4 \n\t"
|
||||
"vaddps %%ymm5, %%ymm11, %%ymm5 \n\t"
|
||||
|
||||
"prefetcht0 192(%4,%8,4) \n\t"
|
||||
"vmulps (%4,%8,4), %%ymm0 , %%ymm8 \n\t"
|
||||
"vmulps 32(%4,%8,4), %%ymm0 , %%ymm9 \n\t"
|
||||
"prefetcht0 192(%5,%8,4) \n\t"
|
||||
"vmulps (%5,%8,4), %%ymm1 , %%ymm10 \n\t"
|
||||
"vmulps 32(%5,%8,4), %%ymm1 , %%ymm11 \n\t"
|
||||
"vaddps %%ymm4, %%ymm8 , %%ymm4 \n\t"
|
||||
"vaddps %%ymm5, %%ymm9 , %%ymm5 \n\t"
|
||||
"vaddps %%ymm4, %%ymm10, %%ymm4 \n\t"
|
||||
"vaddps %%ymm5, %%ymm11, %%ymm5 \n\t"
|
||||
|
||||
"prefetcht0 192(%6,%8,4) \n\t"
|
||||
"vmulps (%6,%8,4), %%ymm2 , %%ymm8 \n\t"
|
||||
"vmulps 32(%6,%8,4), %%ymm2 , %%ymm9 \n\t"
|
||||
"prefetcht0 192(%7,%8,4) \n\t"
|
||||
"vmulps (%7,%8,4), %%ymm3 , %%ymm10 \n\t"
|
||||
"vmulps 32(%7,%8,4), %%ymm3 , %%ymm11 \n\t"
|
||||
"vaddps %%ymm4, %%ymm8 , %%ymm4 \n\t"
|
||||
"vaddps %%ymm5, %%ymm9 , %%ymm5 \n\t"
|
||||
"vaddps %%ymm4, %%ymm10, %%ymm4 \n\t"
|
||||
"vaddps %%ymm5, %%ymm11, %%ymm5 \n\t"
|
||||
|
||||
"vmulps %%ymm6, %%ymm4 , %%ymm4 \n\t"
|
||||
"vmulps %%ymm6, %%ymm5 , %%ymm5 \n\t"
|
||||
|
||||
"vaddps (%3,%0,4), %%ymm4 , %%ymm4 \n\t" // 8 * y
|
||||
"vaddps 32(%3,%0,4), %%ymm5 , %%ymm5 \n\t" // 8 * y
|
||||
|
||||
"vmovups %%ymm4, (%3,%0,4) \n\t" // 8 * y
|
||||
"vmovups %%ymm5, 32(%3,%0,4) \n\t" // 8 * y
|
||||
|
||||
"addq $16, %8 \n\t"
|
||||
"addq $16, %0 \n\t"
|
||||
"subq $16, %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
".L16END%=: \n\t"
|
||||
"vzeroupper \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (ap[2]), // 6
|
||||
"r" (ap[3]), // 7
|
||||
"r" (lda4), // 8
|
||||
"r" (alpha) // 9
|
||||
: "cc",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm2", "%xmm3",
|
||||
"%xmm4", "%xmm5",
|
||||
"%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
||||
|
||||
#define HAVE_KERNEL_4x4 1
|
||||
static void sgemv_kernel_4x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha) __attribute__ ((noinline));
|
||||
|
||||
static void sgemv_kernel_4x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y, FLOAT *alpha)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"vzeroupper \n\t"
|
||||
"vbroadcastss (%2), %%ymm12 \n\t" // x0
|
||||
"vbroadcastss 4(%2), %%ymm13 \n\t" // x1
|
||||
"vbroadcastss 8(%2), %%ymm14 \n\t" // x2
|
||||
"vbroadcastss 12(%2), %%ymm15 \n\t" // x3
|
||||
|
||||
"vbroadcastss (%8), %%ymm6 \n\t" // alpha
|
||||
|
||||
"testq $0x04, %1 \n\t"
|
||||
"jz .L08LABEL%= \n\t"
|
||||
|
||||
"vxorps %%ymm4 , %%ymm4 , %%ymm4 \n\t"
|
||||
"vxorps %%ymm5 , %%ymm5 , %%ymm5 \n\t"
|
||||
"vmovups (%3,%0,4), %%xmm7 \n\t" // 4 * y
|
||||
|
||||
"vmulps (%4,%0,4), %%xmm12, %%xmm8 \n\t"
|
||||
"vmulps (%5,%0,4), %%xmm13, %%xmm10 \n\t"
|
||||
"vmulps (%6,%0,4), %%xmm14, %%xmm9 \n\t"
|
||||
"vmulps (%7,%0,4), %%xmm15, %%xmm11 \n\t"
|
||||
"vaddps %%xmm4, %%xmm8 , %%xmm4 \n\t"
|
||||
"vaddps %%xmm5, %%xmm10, %%xmm5 \n\t"
|
||||
"vaddps %%xmm4, %%xmm9 , %%xmm4 \n\t"
|
||||
"vaddps %%xmm5, %%xmm11, %%xmm5 \n\t"
|
||||
|
||||
"vaddps %%xmm5, %%xmm4 , %%xmm4 \n\t"
|
||||
"vmulps %%xmm6, %%xmm4 , %%xmm5 \n\t"
|
||||
"vaddps %%xmm5, %%xmm7 , %%xmm5 \n\t"
|
||||
|
||||
"vmovups %%xmm5, (%3,%0,4) \n\t" // 4 * y
|
||||
|
||||
"addq $4, %0 \n\t"
|
||||
"subq $4, %1 \n\t"
|
||||
|
||||
".L08LABEL%=: \n\t"
|
||||
|
||||
"testq $0x08, %1 \n\t"
|
||||
"jz .L16LABEL%= \n\t"
|
||||
|
||||
"vxorps %%ymm4 , %%ymm4 , %%ymm4 \n\t"
|
||||
"vxorps %%ymm5 , %%ymm5 , %%ymm5 \n\t"
|
||||
"vmovups (%3,%0,4), %%ymm7 \n\t" // 8 * y
|
||||
|
||||
"vmulps (%4,%0,4), %%ymm12, %%ymm8 \n\t"
|
||||
"vmulps (%5,%0,4), %%ymm13, %%ymm10 \n\t"
|
||||
"vmulps (%6,%0,4), %%ymm14, %%ymm9 \n\t"
|
||||
"vmulps (%7,%0,4), %%ymm15, %%ymm11 \n\t"
|
||||
"vaddps %%ymm4, %%ymm8 , %%ymm4 \n\t"
|
||||
"vaddps %%ymm5, %%ymm10, %%ymm5 \n\t"
|
||||
"vaddps %%ymm4, %%ymm9 , %%ymm4 \n\t"
|
||||
"vaddps %%ymm5, %%ymm11, %%ymm5 \n\t"
|
||||
|
||||
"vaddps %%ymm5, %%ymm4 , %%ymm4 \n\t"
|
||||
"vmulps %%ymm6, %%ymm4 , %%ymm5 \n\t"
|
||||
"vaddps %%ymm5, %%ymm7 , %%ymm5 \n\t"
|
||||
|
||||
"vmovups %%ymm5, (%3,%0,4) \n\t" // 8 * y
|
||||
|
||||
"addq $8, %0 \n\t"
|
||||
"subq $8, %1 \n\t"
|
||||
|
||||
|
||||
".L16LABEL%=: \n\t"
|
||||
|
||||
"cmpq $0, %1 \n\t"
|
||||
"je .L16END%= \n\t"
|
||||
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vxorps %%ymm4 , %%ymm4 , %%ymm4 \n\t"
|
||||
"vxorps %%ymm5 , %%ymm5 , %%ymm5 \n\t"
|
||||
"vmovups (%3,%0,4), %%ymm0 \n\t" // 8 * y
|
||||
"vmovups 32(%3,%0,4), %%ymm1 \n\t" // 8 * y
|
||||
|
||||
"prefetcht0 192(%4,%0,4) \n\t"
|
||||
"vmulps (%4,%0,4), %%ymm12, %%ymm8 \n\t"
|
||||
"vmulps 32(%4,%0,4), %%ymm12, %%ymm9 \n\t"
|
||||
"prefetcht0 192(%5,%0,4) \n\t"
|
||||
"vmulps (%5,%0,4), %%ymm13, %%ymm10 \n\t"
|
||||
"vmulps 32(%5,%0,4), %%ymm13, %%ymm11 \n\t"
|
||||
"vaddps %%ymm4, %%ymm8 , %%ymm4 \n\t"
|
||||
"vaddps %%ymm5, %%ymm9 , %%ymm5 \n\t"
|
||||
"vaddps %%ymm4, %%ymm10, %%ymm4 \n\t"
|
||||
"vaddps %%ymm5, %%ymm11, %%ymm5 \n\t"
|
||||
|
||||
"prefetcht0 192(%6,%0,4) \n\t"
|
||||
"vmulps (%6,%0,4), %%ymm14, %%ymm8 \n\t"
|
||||
"vmulps 32(%6,%0,4), %%ymm14, %%ymm9 \n\t"
|
||||
"prefetcht0 192(%7,%0,4) \n\t"
|
||||
"vmulps (%7,%0,4), %%ymm15, %%ymm10 \n\t"
|
||||
"vmulps 32(%7,%0,4), %%ymm15, %%ymm11 \n\t"
|
||||
"vaddps %%ymm4, %%ymm8 , %%ymm4 \n\t"
|
||||
"vaddps %%ymm5, %%ymm9 , %%ymm5 \n\t"
|
||||
"vaddps %%ymm4, %%ymm10, %%ymm4 \n\t"
|
||||
"vaddps %%ymm5, %%ymm11, %%ymm5 \n\t"
|
||||
|
||||
"vmulps %%ymm6, %%ymm4 , %%ymm4 \n\t"
|
||||
"vmulps %%ymm6, %%ymm5 , %%ymm5 \n\t"
|
||||
|
||||
"vaddps %%ymm4, %%ymm0 , %%ymm0 \n\t"
|
||||
"vaddps %%ymm5, %%ymm1 , %%ymm1 \n\t"
|
||||
|
||||
"vmovups %%ymm0, (%3,%0,4) \n\t" // 8 * y
|
||||
"vmovups %%ymm1, 32(%3,%0,4) \n\t" // 8 * y
|
||||
|
||||
"addq $16, %0 \n\t"
|
||||
"subq $16, %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
".L16END%=: \n\t"
|
||||
"vzeroupper \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (x), // 2
|
||||
"r" (y), // 3
|
||||
"r" (ap[0]), // 4
|
||||
"r" (ap[1]), // 5
|
||||
"r" (ap[2]), // 6
|
||||
"r" (ap[3]), // 7
|
||||
"r" (alpha) // 8
|
||||
: "cc",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm2", "%xmm3",
|
||||
"%xmm4", "%xmm5",
|
||||
"%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -1,473 +0,0 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
static void sgemv_kernel_64( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
|
||||
float *pre = a + lda*2;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vbroadcastss %1, %%ymm1\n\t" // alpha -> ymm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
"movq %6, %%r8\n\t" // address for prefetch
|
||||
"prefetcht0 (%%r8)\n\t" // Prefetch
|
||||
"prefetcht0 64(%%r8)\n\t" // Prefetch
|
||||
|
||||
"vxorps %%ymm8 , %%ymm8 , %%ymm8 \n\t" // set to zero
|
||||
"vxorps %%ymm9 , %%ymm9 , %%ymm9 \n\t" // set to zero
|
||||
"vxorps %%ymm10, %%ymm10, %%ymm10\n\t" // set to zero
|
||||
"vxorps %%ymm11, %%ymm11, %%ymm11\n\t" // set to zero
|
||||
"vxorps %%ymm12, %%ymm12, %%ymm12\n\t" // set to zero
|
||||
"vxorps %%ymm13, %%ymm13, %%ymm13\n\t" // set to zero
|
||||
"vxorps %%ymm14, %%ymm14, %%ymm14\n\t" // set to zero
|
||||
"vxorps %%ymm15, %%ymm15, %%ymm15\n\t" // set to zero
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vbroadcastss (%%rdi), %%ymm0 \n\t" // load values of c
|
||||
"nop \n\t"
|
||||
"leaq (%%r8 , %%rcx, 4), %%r8 \n\t" // add lda to pointer for prefetch
|
||||
|
||||
"prefetcht0 (%%r8)\n\t" // Prefetch
|
||||
"vmulps 0*4(%%rsi), %%ymm0, %%ymm4 \n\t" // multiply a and c and add to temp
|
||||
"vmulps 8*4(%%rsi), %%ymm0, %%ymm5 \n\t" // multiply a and c and add to temp
|
||||
"prefetcht0 64(%%r8)\n\t" // Prefetch
|
||||
"vmulps 16*4(%%rsi), %%ymm0, %%ymm6 \n\t" // multiply a and c and add to temp
|
||||
"vmulps 24*4(%%rsi), %%ymm0, %%ymm7 \n\t" // multiply a and c and add to temp
|
||||
|
||||
"vaddps %%ymm8 , %%ymm4, %%ymm8 \n\t" // multiply a and c and add to temp
|
||||
"vaddps %%ymm9 , %%ymm5, %%ymm9 \n\t" // multiply a and c and add to temp
|
||||
"prefetcht0 128(%%r8)\n\t" // Prefetch
|
||||
"vaddps %%ymm10, %%ymm6, %%ymm10\n\t" // multiply a and c and add to temp
|
||||
"vaddps %%ymm11, %%ymm7, %%ymm11\n\t" // multiply a and c and add to temp
|
||||
|
||||
"prefetcht0 192(%%r8)\n\t" // Prefetch
|
||||
"vmulps 32*4(%%rsi), %%ymm0, %%ymm4 \n\t" // multiply a and c and add to temp
|
||||
"vmulps 40*4(%%rsi), %%ymm0, %%ymm5 \n\t" // multiply a and c and add to temp
|
||||
"vmulps 48*4(%%rsi), %%ymm0, %%ymm6 \n\t" // multiply a and c and add to temp
|
||||
"vmulps 56*4(%%rsi), %%ymm0, %%ymm7 \n\t" // multiply a and c and add to temp
|
||||
|
||||
"vaddps %%ymm12, %%ymm4, %%ymm12\n\t" // multiply a and c and add to temp
|
||||
"vaddps %%ymm13, %%ymm5, %%ymm13\n\t" // multiply a and c and add to temp
|
||||
"vaddps %%ymm14, %%ymm6, %%ymm14\n\t" // multiply a and c and add to temp
|
||||
"vaddps %%ymm15, %%ymm7, %%ymm15\n\t" // multiply a and c and add to temp
|
||||
|
||||
"addq $4 , %%rdi \n\t" // increment pointer of c
|
||||
"leaq (%%rsi, %%rcx, 4), %%rsi \n\t" // add lda to pointer of a
|
||||
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmulps %%ymm8 , %%ymm1, %%ymm8 \n\t" // scale by alpha
|
||||
"vmulps %%ymm9 , %%ymm1, %%ymm9 \n\t" // scale by alpha
|
||||
"vmulps %%ymm10, %%ymm1, %%ymm10\n\t" // scale by alpha
|
||||
"vmulps %%ymm11, %%ymm1, %%ymm11\n\t" // scale by alpha
|
||||
"vmulps %%ymm12, %%ymm1, %%ymm12\n\t" // scale by alpha
|
||||
"vmulps %%ymm13, %%ymm1, %%ymm13\n\t" // scale by alpha
|
||||
"vmulps %%ymm14, %%ymm1, %%ymm14\n\t" // scale by alpha
|
||||
"vmulps %%ymm15, %%ymm1, %%ymm15\n\t" // scale by alpha
|
||||
|
||||
"vmovups %%ymm8 , (%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm9 , 8*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm10, 16*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm11, 24*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm12, 32*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm13, 40*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm14, 48*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm15, 56*4(%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y), // 5
|
||||
"m" (pre) // 6
|
||||
: "%rax", "%rcx", "%rdx", "%rsi", "%rdi", "%r8", "cc",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
||||
static void sgemv_kernel_32( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
|
||||
float *pre = a + lda*3;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vbroadcastss %1, %%ymm1\n\t" // alpha -> ymm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
"movq %6, %%r8\n\t" // address for prefetch
|
||||
"prefetcht0 (%%r8)\n\t" // Prefetch
|
||||
"prefetcht0 64(%%r8)\n\t" // Prefetch
|
||||
|
||||
"vxorps %%ymm8 , %%ymm8 , %%ymm8 \n\t" // set to zero
|
||||
"vxorps %%ymm9 , %%ymm9 , %%ymm9 \n\t" // set to zero
|
||||
"vxorps %%ymm10, %%ymm10, %%ymm10\n\t" // set to zero
|
||||
"vxorps %%ymm11, %%ymm11, %%ymm11\n\t" // set to zero
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vbroadcastss (%%rdi), %%ymm0 \n\t" // load values of c
|
||||
"nop \n\t"
|
||||
"leaq (%%r8 , %%rcx, 4), %%r8 \n\t" // add lda to pointer for prefetch
|
||||
|
||||
"prefetcht0 (%%r8)\n\t" // Prefetch
|
||||
"prefetcht0 64(%%r8)\n\t" // Prefetch
|
||||
|
||||
"vmulps 0*4(%%rsi), %%ymm0, %%ymm4 \n\t" // multiply a and c and add to temp
|
||||
"vmulps 8*4(%%rsi), %%ymm0, %%ymm5 \n\t" // multiply a and c and add to temp
|
||||
"vmulps 16*4(%%rsi), %%ymm0, %%ymm6 \n\t" // multiply a and c and add to temp
|
||||
"vmulps 24*4(%%rsi), %%ymm0, %%ymm7 \n\t" // multiply a and c and add to temp
|
||||
|
||||
"vaddps %%ymm8 , %%ymm4, %%ymm8 \n\t" // multiply a and c and add to temp
|
||||
"vaddps %%ymm9 , %%ymm5, %%ymm9 \n\t" // multiply a and c and add to temp
|
||||
"vaddps %%ymm10, %%ymm6, %%ymm10\n\t" // multiply a and c and add to temp
|
||||
"vaddps %%ymm11, %%ymm7, %%ymm11\n\t" // multiply a and c and add to temp
|
||||
|
||||
|
||||
|
||||
"addq $4 , %%rdi \n\t" // increment pointer of c
|
||||
"leaq (%%rsi, %%rcx, 4), %%rsi \n\t" // add lda to pointer of a
|
||||
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmulps %%ymm8 , %%ymm1, %%ymm8 \n\t" // scale by alpha
|
||||
"vmulps %%ymm9 , %%ymm1, %%ymm9 \n\t" // scale by alpha
|
||||
"vmulps %%ymm10, %%ymm1, %%ymm10\n\t" // scale by alpha
|
||||
"vmulps %%ymm11, %%ymm1, %%ymm11\n\t" // scale by alpha
|
||||
|
||||
"vmovups %%ymm8 , (%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm9 , 8*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm10, 16*4(%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm11, 24*4(%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y), // 5
|
||||
"m" (pre) // 6
|
||||
: "%rax", "%rcx", "%rdx", "%rsi", "%rdi", "%r8", "cc",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"memory"
|
||||
);
|
||||
|
||||
|
||||
|
||||
}
|
||||
|
||||
static void sgemv_kernel_16( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
float *pre = a + lda*3;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vbroadcastss %1, %%ymm1\n\t" // alpha -> ymm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
"movq %6, %%r8\n\t" // address for prefetch
|
||||
"prefetcht0 (%%r8)\n\t" // Prefetch
|
||||
"prefetcht0 64(%%r8)\n\t" // Prefetch
|
||||
|
||||
"vxorps %%ymm8 , %%ymm8 , %%ymm8 \n\t" // set to zero
|
||||
"vxorps %%ymm9 , %%ymm9 , %%ymm9 \n\t" // set to zero
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vbroadcastss (%%rdi), %%ymm0 \n\t" // load values of c
|
||||
"nop \n\t"
|
||||
"leaq (%%r8 , %%rcx, 4), %%r8 \n\t" // add lda to pointer for prefetch
|
||||
|
||||
"prefetcht0 (%%r8)\n\t" // Prefetch
|
||||
|
||||
"vmulps 0*4(%%rsi), %%ymm0, %%ymm4 \n\t" // multiply a and c and add to temp
|
||||
"vmulps 8*4(%%rsi), %%ymm0, %%ymm5 \n\t" // multiply a and c and add to temp
|
||||
|
||||
"vaddps %%ymm8 , %%ymm4, %%ymm8 \n\t" // multiply a and c and add to temp
|
||||
"vaddps %%ymm9 , %%ymm5, %%ymm9 \n\t" // multiply a and c and add to temp
|
||||
|
||||
"addq $4 , %%rdi \n\t" // increment pointer of c
|
||||
"leaq (%%rsi, %%rcx, 4), %%rsi \n\t" // add lda to pointer of a
|
||||
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmulps %%ymm8 , %%ymm1, %%ymm8 \n\t" // scale by alpha
|
||||
"vmulps %%ymm9 , %%ymm1, %%ymm9 \n\t" // scale by alpha
|
||||
|
||||
"vmovups %%ymm8 , (%%rdx) \n\t" // store temp -> y
|
||||
"vmovups %%ymm9 , 8*4(%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y), // 5
|
||||
"m" (pre) // 6
|
||||
: "%rax", "%rcx", "%rdx", "%rsi", "%rdi", "%r8", "cc",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"memory"
|
||||
);
|
||||
|
||||
|
||||
}
|
||||
|
||||
|
||||
static void sgemv_kernel_8( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vbroadcastss %1, %%ymm1\n\t" // alpha -> ymm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
|
||||
"vxorps %%ymm8 , %%ymm8 , %%ymm8 \n\t" // set to zero
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vbroadcastss (%%rdi), %%ymm0 \n\t" // load values of c
|
||||
|
||||
"vmulps 0*4(%%rsi), %%ymm0, %%ymm4 \n\t" // multiply a and c and add to temp
|
||||
"vaddps %%ymm8 , %%ymm4, %%ymm8 \n\t" // multiply a and c and add to temp
|
||||
|
||||
"addq $4 , %%rdi \n\t" // increment pointer of c
|
||||
"leaq (%%rsi, %%rcx, 4), %%rsi \n\t" // add lda to pointer of a
|
||||
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmulps %%ymm8 , %%ymm1, %%ymm8 \n\t" // scale by alpha
|
||||
"vmovups %%ymm8 , (%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y) // 5
|
||||
: "%rax", "%rcx", "%rdx", "%rsi", "%rdi", "%r8", "cc",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"memory"
|
||||
);
|
||||
|
||||
|
||||
}
|
||||
|
||||
|
||||
static void sgemv_kernel_4( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vbroadcastss %1, %%xmm1\n\t" // alpha -> xmm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
|
||||
"vxorps %%xmm12, %%xmm12, %%xmm12\n\t" // set to zero
|
||||
|
||||
".L01LOOP%=: \n\t"
|
||||
"vbroadcastss (%%rdi), %%xmm0 \n\t" // load values of c
|
||||
|
||||
"vmulps 0*4(%%rsi), %%xmm0, %%xmm4 \n\t" // multiply a and c and add to temp
|
||||
"vaddps %%xmm12, %%xmm4, %%xmm12 \n\t" // multiply a and c and add to temp
|
||||
|
||||
"addq $4 , %%rdi \n\t" // increment pointer of c
|
||||
"leaq (%%rsi, %%rcx, 4), %%rsi \n\t" // add lda to pointer of a
|
||||
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmulps %%xmm12, %%xmm1, %%xmm12\n\t" // scale by alpha
|
||||
|
||||
"vmovups %%xmm12, (%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y) // 5
|
||||
: "%rax", "%rcx", "%rdx", "%rsi", "%rdi", "%r8",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
static void sgemv_kernel_2( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vmovss %1, %%xmm1\n\t" // alpha -> xmm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
|
||||
"vxorps %%xmm12, %%xmm12, %%xmm12\n\t" // set to zero
|
||||
"vxorps %%xmm13, %%xmm13, %%xmm13\n\t" // set to zero
|
||||
|
||||
".L01LOOP%=: \n\t"
|
||||
"vmovss (%%rdi), %%xmm0 \n\t" // load values of c
|
||||
|
||||
"vmulps 0*4(%%rsi), %%xmm0, %%xmm4 \n\t" // multiply a and c and add to temp
|
||||
"vmulps 1*4(%%rsi), %%xmm0, %%xmm5 \n\t" // multiply a and c and add to temp
|
||||
|
||||
"vaddps %%xmm12, %%xmm4, %%xmm12 \n\t" // multiply a and c and add to temp
|
||||
"vaddps %%xmm13, %%xmm5, %%xmm13 \n\t" // multiply a and c and add to temp
|
||||
|
||||
"addq $4 , %%rdi \n\t" // increment pointer of c
|
||||
"leaq (%%rsi, %%rcx, 4), %%rsi \n\t" // add lda to pointer of a
|
||||
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmulss %%xmm12, %%xmm1, %%xmm12\n\t" // scale by alpha
|
||||
"vmulss %%xmm13, %%xmm1, %%xmm13\n\t" // scale by alpha
|
||||
|
||||
"vmovss %%xmm12, (%%rdx) \n\t" // store temp -> y
|
||||
"vmovss %%xmm13, 4(%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y) // 5
|
||||
: "%rax", "%rcx", "%rdx", "%rsi", "%rdi", "%r8",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
||||
static void sgemv_kernel_1( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vmovss %1, %%xmm1\n\t" // alpha -> xmm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
|
||||
"vxorps %%xmm12, %%xmm12, %%xmm12\n\t" // set to zero
|
||||
|
||||
".L01LOOP%=: \n\t"
|
||||
"vmovss (%%rdi), %%xmm0 \n\t" // load values of c
|
||||
"addq $4 , %%rdi \n\t" // increment pointer of c
|
||||
|
||||
"vmulss 0*4(%%rsi), %%xmm0, %%xmm4 \n\t" // multiply a and c and add to temp
|
||||
"vaddss %%xmm12, %%xmm4, %%xmm12 \n\t" // multiply a and c and add to temp
|
||||
|
||||
"leaq (%%rsi, %%rcx, 4), %%rsi \n\t" // add lda to pointer of a
|
||||
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vmulss %%xmm12, %%xmm1, %%xmm12\n\t" // scale by alpha
|
||||
|
||||
"vmovss %%xmm12, (%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y) // 5
|
||||
: "%rax", "%rcx", "%rdx", "%rsi", "%rdi", "%r8",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm8", "%xmm9", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -28,16 +28,6 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|||
|
||||
#include "common.h"
|
||||
|
||||
#if defined(BULLDOZER) || defined(PILEDRIVER)
|
||||
#include "sgemv_t_microk_bulldozer-2.c"
|
||||
#elif defined(HASWELL)
|
||||
#include "sgemv_t_microk_haswell-2.c"
|
||||
#elif defined(SANDYBRIDGE)
|
||||
#include "sgemv_t_microk_sandy-2.c"
|
||||
#elif defined(NEHALEM)
|
||||
#include "sgemv_t_microk_nehalem-2.c"
|
||||
#endif
|
||||
|
||||
#define NBMAX 4096
|
||||
|
||||
#ifndef HAVE_KERNEL_16x4
|
||||
|
|
|
|||
|
|
@ -0,0 +1,624 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
#include "common.h"
|
||||
|
||||
#if defined(NEHALEM)
|
||||
#include "sgemv_t_microk_nehalem-4.c"
|
||||
#elif defined(BULLDOZER) || defined(PILEDRIVER)
|
||||
#include "sgemv_t_microk_bulldozer-4.c"
|
||||
#elif defined(SANDYBRIDGE)
|
||||
#include "sgemv_t_microk_sandy-4.c"
|
||||
#elif defined(HASWELL)
|
||||
#include "sgemv_t_microk_haswell-4.c"
|
||||
#endif
|
||||
|
||||
#define NBMAX 4096
|
||||
|
||||
#ifndef HAVE_KERNEL_4x4
|
||||
|
||||
static void sgemv_kernel_4x4(BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
BLASLONG i;
|
||||
FLOAT *a0,*a1,*a2,*a3;
|
||||
a0 = ap[0];
|
||||
a1 = ap[1];
|
||||
a2 = ap[2];
|
||||
a3 = ap[3];
|
||||
FLOAT temp0 = 0.0;
|
||||
FLOAT temp1 = 0.0;
|
||||
FLOAT temp2 = 0.0;
|
||||
FLOAT temp3 = 0.0;
|
||||
|
||||
for ( i=0; i< n; i+=4 )
|
||||
{
|
||||
temp0 += a0[i]*x[i] + a0[i+1]*x[i+1] + a0[i+2]*x[i+2] + a0[i+3]*x[i+3];
|
||||
temp1 += a1[i]*x[i] + a1[i+1]*x[i+1] + a1[i+2]*x[i+2] + a1[i+3]*x[i+3];
|
||||
temp2 += a2[i]*x[i] + a2[i+1]*x[i+1] + a2[i+2]*x[i+2] + a2[i+3]*x[i+3];
|
||||
temp3 += a3[i]*x[i] + a3[i+1]*x[i+1] + a3[i+2]*x[i+2] + a3[i+3]*x[i+3];
|
||||
}
|
||||
y[0] = temp0;
|
||||
y[1] = temp1;
|
||||
y[2] = temp2;
|
||||
y[3] = temp3;
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
static void sgemv_kernel_4x2(BLASLONG n, FLOAT *ap0, FLOAT *ap1, FLOAT *x, FLOAT *y) __attribute__ ((noinline));
|
||||
|
||||
static void sgemv_kernel_4x2(BLASLONG n, FLOAT *ap0, FLOAT *ap1, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
BLASLONG i;
|
||||
|
||||
i=0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"xorps %%xmm10 , %%xmm10 \n\t"
|
||||
"xorps %%xmm11 , %%xmm11 \n\t"
|
||||
|
||||
"testq $4 , %1 \n\t"
|
||||
"jz .L01LABEL%= \n\t"
|
||||
|
||||
"movups (%5,%0,4) , %%xmm14 \n\t" // x
|
||||
"movups (%3,%0,4) , %%xmm12 \n\t" // ap0
|
||||
"movups (%4,%0,4) , %%xmm13 \n\t" // ap1
|
||||
"mulps %%xmm14 , %%xmm12 \n\t"
|
||||
"mulps %%xmm14 , %%xmm13 \n\t"
|
||||
"addq $4 , %0 \n\t"
|
||||
"addps %%xmm12 , %%xmm10 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
"addps %%xmm13 , %%xmm11 \n\t"
|
||||
|
||||
".L01LABEL%=: \n\t"
|
||||
|
||||
"cmpq $0, %1 \n\t"
|
||||
"je .L01END%= \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
|
||||
"movups (%5,%0,4) , %%xmm14 \n\t" // x
|
||||
"movups (%3,%0,4) , %%xmm12 \n\t" // ap0
|
||||
"movups (%4,%0,4) , %%xmm13 \n\t" // ap1
|
||||
"mulps %%xmm14 , %%xmm12 \n\t"
|
||||
"mulps %%xmm14 , %%xmm13 \n\t"
|
||||
"addps %%xmm12 , %%xmm10 \n\t"
|
||||
"addps %%xmm13 , %%xmm11 \n\t"
|
||||
|
||||
"movups 16(%5,%0,4) , %%xmm14 \n\t" // x
|
||||
"movups 16(%3,%0,4) , %%xmm12 \n\t" // ap0
|
||||
"movups 16(%4,%0,4) , %%xmm13 \n\t" // ap1
|
||||
"mulps %%xmm14 , %%xmm12 \n\t"
|
||||
"mulps %%xmm14 , %%xmm13 \n\t"
|
||||
"addps %%xmm12 , %%xmm10 \n\t"
|
||||
"addps %%xmm13 , %%xmm11 \n\t"
|
||||
|
||||
"addq $8 , %0 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
".L01END%=: \n\t"
|
||||
|
||||
"haddps %%xmm10, %%xmm10 \n\t"
|
||||
"haddps %%xmm11, %%xmm11 \n\t"
|
||||
"haddps %%xmm10, %%xmm10 \n\t"
|
||||
"haddps %%xmm11, %%xmm11 \n\t"
|
||||
|
||||
"movss %%xmm10, (%2) \n\t"
|
||||
"movss %%xmm11,4(%2) \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (y), // 2
|
||||
"r" (ap0), // 3
|
||||
"r" (ap1), // 4
|
||||
"r" (x) // 5
|
||||
: "cc",
|
||||
"%xmm4", "%xmm5", "%xmm10", "%xmm11",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
|
||||
}
|
||||
|
||||
static void sgemv_kernel_4x1(BLASLONG n, FLOAT *ap, FLOAT *x, FLOAT *y) __attribute__ ((noinline));
|
||||
|
||||
static void sgemv_kernel_4x1(BLASLONG n, FLOAT *ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
BLASLONG i;
|
||||
|
||||
i=0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"xorps %%xmm9 , %%xmm9 \n\t"
|
||||
"xorps %%xmm10 , %%xmm10 \n\t"
|
||||
|
||||
"testq $4 , %1 \n\t"
|
||||
"jz .L01LABEL%= \n\t"
|
||||
|
||||
"movups (%3,%0,4) , %%xmm12 \n\t"
|
||||
"movups (%4,%0,4) , %%xmm11 \n\t"
|
||||
"mulps %%xmm11 , %%xmm12 \n\t"
|
||||
"addq $4 , %0 \n\t"
|
||||
"addps %%xmm12 , %%xmm10 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
|
||||
".L01LABEL%=: \n\t"
|
||||
|
||||
"cmpq $0, %1 \n\t"
|
||||
"je .L01END%= \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
|
||||
"movups (%3,%0,4) , %%xmm12 \n\t"
|
||||
"movups 16(%3,%0,4) , %%xmm14 \n\t"
|
||||
"movups (%4,%0,4) , %%xmm11 \n\t"
|
||||
"movups 16(%4,%0,4) , %%xmm13 \n\t"
|
||||
"mulps %%xmm11 , %%xmm12 \n\t"
|
||||
"mulps %%xmm13 , %%xmm14 \n\t"
|
||||
"addq $8 , %0 \n\t"
|
||||
"addps %%xmm12 , %%xmm10 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
"addps %%xmm14 , %%xmm9 \n\t"
|
||||
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
".L01END%=: \n\t"
|
||||
|
||||
"addps %%xmm9 , %%xmm10 \n\t"
|
||||
"haddps %%xmm10, %%xmm10 \n\t"
|
||||
"haddps %%xmm10, %%xmm10 \n\t"
|
||||
|
||||
"movss %%xmm10, (%2) \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (y), // 2
|
||||
"r" (ap), // 3
|
||||
"r" (x) // 4
|
||||
: "cc",
|
||||
"%xmm9", "%xmm10" ,
|
||||
"%xmm11", "%xmm12", "%xmm13", "%xmm14",
|
||||
"memory"
|
||||
);
|
||||
|
||||
|
||||
}
|
||||
|
||||
static void copy_x(BLASLONG n, FLOAT *src, FLOAT *dest, BLASLONG inc_src)
|
||||
{
|
||||
BLASLONG i;
|
||||
for ( i=0; i<n; i++ )
|
||||
{
|
||||
*dest = *src;
|
||||
dest++;
|
||||
src += inc_src;
|
||||
}
|
||||
}
|
||||
|
||||
static void add_y(BLASLONG n, FLOAT da , FLOAT *src, FLOAT *dest, BLASLONG inc_dest) __attribute__ ((noinline));
|
||||
|
||||
static void add_y(BLASLONG n, FLOAT da , FLOAT *src, FLOAT *dest, BLASLONG inc_dest)
|
||||
{
|
||||
|
||||
BLASLONG i;
|
||||
|
||||
if ( inc_dest != 1 )
|
||||
{
|
||||
for ( i=0; i<n; i++ )
|
||||
{
|
||||
*dest += src[i] * da;
|
||||
dest += inc_dest;
|
||||
}
|
||||
return;
|
||||
}
|
||||
|
||||
i=0;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movss (%2) , %%xmm10 \n\t"
|
||||
"shufps $0 , %%xmm10 , %%xmm10 \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
|
||||
"movups (%3,%0,4) , %%xmm12 \n\t"
|
||||
"movups (%4,%0,4) , %%xmm11 \n\t"
|
||||
"mulps %%xmm10 , %%xmm12 \n\t"
|
||||
"addq $4 , %0 \n\t"
|
||||
"addps %%xmm12 , %%xmm11 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
"movups %%xmm11, -16(%4,%0,4) \n\t"
|
||||
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
:
|
||||
:
|
||||
"r" (i), // 0
|
||||
"r" (n), // 1
|
||||
"r" (&da), // 2
|
||||
"r" (src), // 3
|
||||
"r" (dest) // 4
|
||||
: "cc",
|
||||
"%xmm10", "%xmm11", "%xmm12",
|
||||
"memory"
|
||||
);
|
||||
|
||||
|
||||
}
|
||||
|
||||
int CNAME(BLASLONG m, BLASLONG n, BLASLONG dummy1, FLOAT alpha, FLOAT *a, BLASLONG lda, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y, FLOAT *buffer)
|
||||
{
|
||||
BLASLONG register i;
|
||||
BLASLONG register j;
|
||||
FLOAT *a_ptr;
|
||||
FLOAT *x_ptr;
|
||||
FLOAT *y_ptr;
|
||||
BLASLONG n0;
|
||||
BLASLONG n1;
|
||||
BLASLONG m1;
|
||||
BLASLONG m2;
|
||||
BLASLONG m3;
|
||||
BLASLONG n2;
|
||||
FLOAT ybuffer[4],*xbuffer;
|
||||
FLOAT *ytemp;
|
||||
|
||||
if ( m < 1 ) return(0);
|
||||
if ( n < 1 ) return(0);
|
||||
|
||||
xbuffer = buffer;
|
||||
ytemp = buffer + NBMAX;
|
||||
|
||||
n0 = n / NBMAX;
|
||||
n1 = (n % NBMAX) >> 2 ;
|
||||
n2 = n & 3 ;
|
||||
|
||||
m3 = m & 3 ;
|
||||
m1 = m & -4 ;
|
||||
m2 = (m & (NBMAX-1)) - m3 ;
|
||||
|
||||
|
||||
BLASLONG NB = NBMAX;
|
||||
|
||||
while ( NB == NBMAX )
|
||||
{
|
||||
|
||||
m1 -= NB;
|
||||
if ( m1 < 0)
|
||||
{
|
||||
if ( m2 == 0 ) break;
|
||||
NB = m2;
|
||||
}
|
||||
|
||||
y_ptr = y;
|
||||
a_ptr = a;
|
||||
x_ptr = x;
|
||||
|
||||
if ( inc_x == 1 )
|
||||
xbuffer = x_ptr;
|
||||
else
|
||||
copy_x(NB,x_ptr,xbuffer,inc_x);
|
||||
|
||||
|
||||
FLOAT *ap[4];
|
||||
FLOAT *yp;
|
||||
BLASLONG register lda4 = 4 * lda;
|
||||
ap[0] = a_ptr;
|
||||
ap[1] = a_ptr + lda;
|
||||
ap[2] = ap[1] + lda;
|
||||
ap[3] = ap[2] + lda;
|
||||
|
||||
if ( n0 > 0 )
|
||||
{
|
||||
BLASLONG nb1 = NBMAX / 4;
|
||||
for( j=0; j<n0; j++)
|
||||
{
|
||||
|
||||
yp = ytemp;
|
||||
for( i = 0; i < nb1 ; i++)
|
||||
{
|
||||
sgemv_kernel_4x4(NB,ap,xbuffer,yp);
|
||||
ap[0] += lda4 ;
|
||||
ap[1] += lda4 ;
|
||||
ap[2] += lda4 ;
|
||||
ap[3] += lda4 ;
|
||||
yp += 4;
|
||||
}
|
||||
add_y(nb1*4, alpha, ytemp, y_ptr, inc_y );
|
||||
y_ptr += nb1 * inc_y * 4;
|
||||
a_ptr += nb1 * lda4 ;
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
|
||||
yp = ytemp;
|
||||
|
||||
for( i = 0; i < n1 ; i++)
|
||||
{
|
||||
sgemv_kernel_4x4(NB,ap,xbuffer,yp);
|
||||
ap[0] += lda4 ;
|
||||
ap[1] += lda4 ;
|
||||
ap[2] += lda4 ;
|
||||
ap[3] += lda4 ;
|
||||
yp += 4;
|
||||
}
|
||||
if ( n1 > 0 )
|
||||
{
|
||||
add_y(n1*4, alpha, ytemp, y_ptr, inc_y );
|
||||
y_ptr += n1 * inc_y * 4;
|
||||
a_ptr += n1 * lda4 ;
|
||||
}
|
||||
|
||||
if ( n2 & 2 )
|
||||
{
|
||||
|
||||
sgemv_kernel_4x2(NB,ap[0],ap[1],xbuffer,ybuffer);
|
||||
a_ptr += lda * 2;
|
||||
*y_ptr += ybuffer[0] * alpha;
|
||||
y_ptr += inc_y;
|
||||
*y_ptr += ybuffer[1] * alpha;
|
||||
y_ptr += inc_y;
|
||||
|
||||
}
|
||||
|
||||
if ( n2 & 1 )
|
||||
{
|
||||
|
||||
sgemv_kernel_4x1(NB,a_ptr,xbuffer,ybuffer);
|
||||
a_ptr += lda;
|
||||
*y_ptr += ybuffer[0] * alpha;
|
||||
y_ptr += inc_y;
|
||||
|
||||
}
|
||||
a += NB;
|
||||
x += NB * inc_x;
|
||||
}
|
||||
|
||||
if ( m3 == 0 ) return(0);
|
||||
|
||||
x_ptr = x;
|
||||
a_ptr = a;
|
||||
if ( m3 == 3 )
|
||||
{
|
||||
FLOAT xtemp0 = *x_ptr * alpha;
|
||||
x_ptr += inc_x;
|
||||
FLOAT xtemp1 = *x_ptr * alpha;
|
||||
x_ptr += inc_x;
|
||||
FLOAT xtemp2 = *x_ptr * alpha;
|
||||
|
||||
FLOAT *aj = a_ptr;
|
||||
y_ptr = y;
|
||||
|
||||
if ( lda == 3 && inc_y == 1 )
|
||||
{
|
||||
|
||||
for ( j=0; j< ( n & -4) ; j+=4 )
|
||||
{
|
||||
|
||||
y_ptr[j] += aj[0] * xtemp0 + aj[1] * xtemp1 + aj[2] * xtemp2;
|
||||
y_ptr[j+1] += aj[3] * xtemp0 + aj[4] * xtemp1 + aj[5] * xtemp2;
|
||||
y_ptr[j+2] += aj[6] * xtemp0 + aj[7] * xtemp1 + aj[8] * xtemp2;
|
||||
y_ptr[j+3] += aj[9] * xtemp0 + aj[10] * xtemp1 + aj[11] * xtemp2;
|
||||
aj += 12;
|
||||
}
|
||||
|
||||
for ( ; j<n; j++ )
|
||||
{
|
||||
y_ptr[j] += aj[0] * xtemp0 + aj[1] * xtemp1 + aj[2] * xtemp2;
|
||||
aj += 3;
|
||||
}
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
|
||||
if ( inc_y == 1 )
|
||||
{
|
||||
|
||||
BLASLONG register lda2 = lda << 1;
|
||||
BLASLONG register lda4 = lda << 2;
|
||||
BLASLONG register lda3 = lda2 + lda;
|
||||
|
||||
for ( j=0; j< ( n & -4 ); j+=4 )
|
||||
{
|
||||
|
||||
y_ptr[j] += *aj * xtemp0 + *(aj+1) * xtemp1 + *(aj+2) * xtemp2;
|
||||
y_ptr[j+1] += *(aj+lda) * xtemp0 + *(aj+lda+1) * xtemp1 + *(aj+lda+2) * xtemp2;
|
||||
y_ptr[j+2] += *(aj+lda2) * xtemp0 + *(aj+lda2+1) * xtemp1 + *(aj+lda2+2) * xtemp2;
|
||||
y_ptr[j+3] += *(aj+lda3) * xtemp0 + *(aj+lda3+1) * xtemp1 + *(aj+lda3+2) * xtemp2;
|
||||
aj += lda4;
|
||||
}
|
||||
|
||||
for ( ; j< n ; j++ )
|
||||
{
|
||||
|
||||
y_ptr[j] += *aj * xtemp0 + *(aj+1) * xtemp1 + *(aj+2) * xtemp2 ;
|
||||
aj += lda;
|
||||
}
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
|
||||
for ( j=0; j<n; j++ )
|
||||
{
|
||||
*y_ptr += *aj * xtemp0 + *(aj+1) * xtemp1 + *(aj+2) * xtemp2;
|
||||
y_ptr += inc_y;
|
||||
aj += lda;
|
||||
}
|
||||
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
return(0);
|
||||
}
|
||||
|
||||
if ( m3 == 2 )
|
||||
{
|
||||
FLOAT xtemp0 = *x_ptr * alpha;
|
||||
x_ptr += inc_x;
|
||||
FLOAT xtemp1 = *x_ptr * alpha;
|
||||
|
||||
FLOAT *aj = a_ptr;
|
||||
y_ptr = y;
|
||||
|
||||
if ( lda == 2 && inc_y == 1 )
|
||||
{
|
||||
|
||||
for ( j=0; j< ( n & -4) ; j+=4 )
|
||||
{
|
||||
y_ptr[j] += aj[0] * xtemp0 + aj[1] * xtemp1 ;
|
||||
y_ptr[j+1] += aj[2] * xtemp0 + aj[3] * xtemp1 ;
|
||||
y_ptr[j+2] += aj[4] * xtemp0 + aj[5] * xtemp1 ;
|
||||
y_ptr[j+3] += aj[6] * xtemp0 + aj[7] * xtemp1 ;
|
||||
aj += 8;
|
||||
|
||||
}
|
||||
|
||||
for ( ; j<n; j++ )
|
||||
{
|
||||
y_ptr[j] += aj[0] * xtemp0 + aj[1] * xtemp1 ;
|
||||
aj += 2;
|
||||
}
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
if ( inc_y == 1 )
|
||||
{
|
||||
|
||||
BLASLONG register lda2 = lda << 1;
|
||||
BLASLONG register lda4 = lda << 2;
|
||||
BLASLONG register lda3 = lda2 + lda;
|
||||
|
||||
for ( j=0; j< ( n & -4 ); j+=4 )
|
||||
{
|
||||
|
||||
y_ptr[j] += *aj * xtemp0 + *(aj+1) * xtemp1 ;
|
||||
y_ptr[j+1] += *(aj+lda) * xtemp0 + *(aj+lda+1) * xtemp1 ;
|
||||
y_ptr[j+2] += *(aj+lda2) * xtemp0 + *(aj+lda2+1) * xtemp1 ;
|
||||
y_ptr[j+3] += *(aj+lda3) * xtemp0 + *(aj+lda3+1) * xtemp1 ;
|
||||
aj += lda4;
|
||||
}
|
||||
|
||||
for ( ; j< n ; j++ )
|
||||
{
|
||||
|
||||
y_ptr[j] += *aj * xtemp0 + *(aj+1) * xtemp1 ;
|
||||
aj += lda;
|
||||
}
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
for ( j=0; j<n; j++ )
|
||||
{
|
||||
*y_ptr += *aj * xtemp0 + *(aj+1) * xtemp1 ;
|
||||
y_ptr += inc_y;
|
||||
aj += lda;
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
return(0);
|
||||
|
||||
}
|
||||
|
||||
FLOAT xtemp = *x_ptr * alpha;
|
||||
FLOAT *aj = a_ptr;
|
||||
y_ptr = y;
|
||||
if ( lda == 1 && inc_y == 1 )
|
||||
{
|
||||
for ( j=0; j< ( n & -4) ; j+=4 )
|
||||
{
|
||||
y_ptr[j] += aj[j] * xtemp;
|
||||
y_ptr[j+1] += aj[j+1] * xtemp;
|
||||
y_ptr[j+2] += aj[j+2] * xtemp;
|
||||
y_ptr[j+3] += aj[j+3] * xtemp;
|
||||
}
|
||||
for ( ; j<n ; j++ )
|
||||
{
|
||||
y_ptr[j] += aj[j] * xtemp;
|
||||
}
|
||||
|
||||
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
if ( inc_y == 1 )
|
||||
{
|
||||
|
||||
BLASLONG register lda2 = lda << 1;
|
||||
BLASLONG register lda4 = lda << 2;
|
||||
BLASLONG register lda3 = lda2 + lda;
|
||||
for ( j=0; j< ( n & -4 ); j+=4 )
|
||||
{
|
||||
y_ptr[j] += *aj * xtemp;
|
||||
y_ptr[j+1] += *(aj+lda) * xtemp;
|
||||
y_ptr[j+2] += *(aj+lda2) * xtemp;
|
||||
y_ptr[j+3] += *(aj+lda3) * xtemp;
|
||||
aj += lda4 ;
|
||||
}
|
||||
|
||||
for ( ; j<n; j++ )
|
||||
{
|
||||
y_ptr[j] += *aj * xtemp;
|
||||
aj += lda;
|
||||
}
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
for ( j=0; j<n; j++ )
|
||||
{
|
||||
*y_ptr += *aj * xtemp;
|
||||
y_ptr += inc_y;
|
||||
aj += lda;
|
||||
}
|
||||
|
||||
}
|
||||
}
|
||||
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -1,232 +0,0 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
|
||||
#include "common.h"
|
||||
|
||||
#if defined(BULLDOZER) || defined(PILEDRIVER)
|
||||
#include "sgemv_t_microk_bulldozer.c"
|
||||
#elif defined(HASWELL)
|
||||
#include "sgemv_t_microk_haswell.c"
|
||||
#else
|
||||
#include "sgemv_t_microk_sandy.c"
|
||||
#endif
|
||||
|
||||
static void copy_x(BLASLONG n, FLOAT *src, FLOAT *dest, BLASLONG inc_src)
|
||||
{
|
||||
BLASLONG i;
|
||||
for ( i=0; i<n; i++ )
|
||||
{
|
||||
*dest = *src;
|
||||
dest++;
|
||||
src += inc_src;
|
||||
}
|
||||
}
|
||||
|
||||
static void sgemv_kernel_1( BLASLONG n, FLOAT alpha, FLOAT *a, BLASLONG lda, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
|
||||
FLOAT register temp0 = 0.0;
|
||||
BLASLONG i;
|
||||
for ( i=0; i<n ; i++)
|
||||
{
|
||||
temp0 += a[i] * x[i];
|
||||
}
|
||||
temp0 *= alpha ;
|
||||
*y += temp0;
|
||||
}
|
||||
|
||||
|
||||
|
||||
|
||||
int CNAME(BLASLONG m, BLASLONG n, BLASLONG dummy1, FLOAT alpha, FLOAT *a, BLASLONG lda, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y, FLOAT *buffer)
|
||||
{
|
||||
BLASLONG i;
|
||||
BLASLONG j;
|
||||
FLOAT *a_ptr;
|
||||
FLOAT *x_ptr;
|
||||
FLOAT *y_ptr;
|
||||
FLOAT *a_ptrl;
|
||||
BLASLONG m1;
|
||||
BLASLONG register m2;
|
||||
FLOAT *xbuffer;
|
||||
xbuffer = buffer;
|
||||
BLASLONG register Mblock;
|
||||
|
||||
m1 = m / 1024 ;
|
||||
m2 = m % 1024 ;
|
||||
|
||||
x_ptr = x;
|
||||
a_ptr = a;
|
||||
|
||||
for (j=0; j<m1; j++)
|
||||
{
|
||||
|
||||
if ( inc_x == 1 )
|
||||
xbuffer = x_ptr;
|
||||
else
|
||||
copy_x(1024,x_ptr,xbuffer,inc_x);
|
||||
|
||||
y_ptr = y;
|
||||
a_ptrl = a_ptr;
|
||||
|
||||
for(i = 0; i<n; i++ )
|
||||
{
|
||||
sgemv_kernel_16(1024,alpha,a_ptrl,lda,xbuffer,y_ptr);
|
||||
y_ptr += inc_y;
|
||||
a_ptrl += lda;
|
||||
}
|
||||
a_ptr += 1024;
|
||||
x_ptr += 1024 * inc_x;
|
||||
}
|
||||
|
||||
if ( m2 == 0 ) return(0);
|
||||
|
||||
Mblock = 512;
|
||||
while ( Mblock >= 16 )
|
||||
{
|
||||
if ( m2 & Mblock)
|
||||
{
|
||||
|
||||
if ( inc_x == 1 )
|
||||
xbuffer = x_ptr;
|
||||
else
|
||||
copy_x(Mblock,x_ptr,xbuffer,inc_x);
|
||||
|
||||
y_ptr = y;
|
||||
a_ptrl = a_ptr;
|
||||
|
||||
for(i = 0; i<n; i++ )
|
||||
{
|
||||
sgemv_kernel_16(Mblock,alpha,a_ptrl,lda,xbuffer,y_ptr);
|
||||
y_ptr += inc_y;
|
||||
a_ptrl += lda;
|
||||
}
|
||||
a_ptr += Mblock;
|
||||
x_ptr += Mblock * inc_x;
|
||||
|
||||
|
||||
}
|
||||
Mblock /= 2;
|
||||
|
||||
}
|
||||
|
||||
if ( m2 & Mblock)
|
||||
{
|
||||
|
||||
if ( inc_x == 1 )
|
||||
xbuffer = x_ptr;
|
||||
else
|
||||
copy_x(Mblock,x_ptr,xbuffer,inc_x);
|
||||
|
||||
y_ptr = y;
|
||||
a_ptrl = a_ptr;
|
||||
|
||||
for(i = 0; i<n; i++ )
|
||||
{
|
||||
sgemv_kernel_1(Mblock,alpha,a_ptrl,lda,xbuffer,y_ptr);
|
||||
y_ptr += inc_y;
|
||||
a_ptrl += lda;
|
||||
}
|
||||
a_ptr += Mblock;
|
||||
x_ptr += Mblock * inc_x;
|
||||
|
||||
|
||||
}
|
||||
Mblock /= 2;
|
||||
|
||||
|
||||
if ( m2 & Mblock)
|
||||
{
|
||||
|
||||
if ( inc_x == 1 )
|
||||
xbuffer = x_ptr;
|
||||
else
|
||||
copy_x(Mblock,x_ptr,xbuffer,inc_x);
|
||||
|
||||
y_ptr = y;
|
||||
a_ptrl = a_ptr;
|
||||
|
||||
for(i = 0; i<n; i++ )
|
||||
{
|
||||
sgemv_kernel_1(Mblock,alpha,a_ptrl,lda,xbuffer,y_ptr);
|
||||
y_ptr += inc_y;
|
||||
a_ptrl += lda;
|
||||
}
|
||||
a_ptr += Mblock;
|
||||
x_ptr += Mblock * inc_x;
|
||||
|
||||
|
||||
}
|
||||
Mblock /= 2;
|
||||
|
||||
if ( m2 & Mblock)
|
||||
{
|
||||
|
||||
if ( inc_x == 1 )
|
||||
xbuffer = x_ptr;
|
||||
else
|
||||
copy_x(Mblock,x_ptr,xbuffer,inc_x);
|
||||
|
||||
y_ptr = y;
|
||||
a_ptrl = a_ptr;
|
||||
|
||||
for(i = 0; i<n; i++ )
|
||||
{
|
||||
sgemv_kernel_1(Mblock,alpha,a_ptrl,lda,xbuffer,y_ptr);
|
||||
y_ptr += inc_y;
|
||||
a_ptrl += lda;
|
||||
}
|
||||
a_ptr += Mblock;
|
||||
x_ptr += Mblock * inc_x;
|
||||
|
||||
|
||||
}
|
||||
Mblock /= 2;
|
||||
|
||||
if ( m2 & Mblock)
|
||||
{
|
||||
|
||||
xbuffer = x_ptr;
|
||||
|
||||
y_ptr = y;
|
||||
a_ptrl = a_ptr;
|
||||
|
||||
for(i = 0; i<n; i++ )
|
||||
{
|
||||
sgemv_kernel_1(Mblock,alpha,a_ptrl,lda,xbuffer,y_ptr);
|
||||
y_ptr += inc_y;
|
||||
a_ptrl += lda;
|
||||
}
|
||||
|
||||
|
||||
}
|
||||
|
||||
return(0);
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -25,10 +25,10 @@ OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
|||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
#define HAVE_KERNEL_16x4 1
|
||||
static void sgemv_kernel_16x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y) __attribute__ ((noinline));
|
||||
#define HAVE_KERNEL_4x4 1
|
||||
static void sgemv_kernel_4x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y) __attribute__ ((noinline));
|
||||
|
||||
static void sgemv_kernel_16x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
||||
static void sgemv_kernel_4x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
||||
{
|
||||
|
||||
BLASLONG register i = 0;
|
||||
|
|
@ -40,38 +40,76 @@ static void sgemv_kernel_16x4( BLASLONG n, FLOAT **ap, FLOAT *x, FLOAT *y)
|
|||
"vxorps %%xmm6, %%xmm6, %%xmm6 \n\t"
|
||||
"vxorps %%xmm7, %%xmm7, %%xmm7 \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vmovups (%2,%0,4), %%xmm12 \n\t" // 4 * x
|
||||
"testq $0x04, %1 \n\t"
|
||||
"jz .L08LABEL%= \n\t"
|
||||
|
||||
"vmovups (%2,%0,4), %%xmm12 \n\t" // 4 * x
|
||||
"vfmaddps %%xmm4, (%4,%0,4), %%xmm12, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, (%5,%0,4), %%xmm12, %%xmm5 \n\t"
|
||||
"vfmaddps %%xmm6, (%6,%0,4), %%xmm12, %%xmm6 \n\t"
|
||||
"vfmaddps %%xmm7, (%7,%0,4), %%xmm12, %%xmm7 \n\t"
|
||||
"addq $4 , %0 \n\t"
|
||||
"subq $4 , %1 \n\t"
|
||||
|
||||
".L08LABEL%=: \n\t"
|
||||
|
||||
"testq $0x08, %1 \n\t"
|
||||
"jz .L16LABEL%= \n\t"
|
||||
|
||||
"vmovups (%2,%0,4), %%xmm12 \n\t" // 4 * x
|
||||
"vmovups 16(%2,%0,4), %%xmm13 \n\t" // 4 * x
|
||||
"vfmaddps %%xmm4, (%4,%0,4), %%xmm12, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, (%5,%0,4), %%xmm12, %%xmm5 \n\t"
|
||||
"vfmaddps %%xmm6, (%6,%0,4), %%xmm12, %%xmm6 \n\t"
|
||||
"vfmaddps %%xmm7, (%7,%0,4), %%xmm12, %%xmm7 \n\t"
|
||||
"vfmaddps %%xmm4, 16(%4,%0,4), %%xmm13, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%5,%0,4), %%xmm13, %%xmm5 \n\t"
|
||||
"vfmaddps %%xmm6, 16(%6,%0,4), %%xmm13, %%xmm6 \n\t"
|
||||
"vfmaddps %%xmm7, 16(%7,%0,4), %%xmm13, %%xmm7 \n\t"
|
||||
|
||||
"addq $8 , %0 \n\t"
|
||||
"subq $8 , %1 \n\t"
|
||||
|
||||
".L16LABEL%=: \n\t"
|
||||
|
||||
"cmpq $0, %1 \n\t"
|
||||
"je .L16END%= \n\t"
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
"vmovups (%2,%0,4), %%xmm12 \n\t" // 4 * x
|
||||
|
||||
"prefetcht0 384(%4,%0,4) \n\t"
|
||||
"vfmaddps %%xmm4, (%4,%0,4), %%xmm12, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, (%5,%0,4), %%xmm12, %%xmm5 \n\t"
|
||||
"vmovups 16(%2,%0,4), %%xmm13 \n\t" // 4 * x
|
||||
"vmovups 16(%2,%0,4), %%xmm13 \n\t" // 4 * x
|
||||
"vfmaddps %%xmm6, (%6,%0,4), %%xmm12, %%xmm6 \n\t"
|
||||
"vfmaddps %%xmm7, (%7,%0,4), %%xmm12, %%xmm7 \n\t"
|
||||
"prefetcht0 384(%5,%0,4) \n\t"
|
||||
".align 2 \n\t"
|
||||
"vfmaddps %%xmm4, 16(%4,%0,4), %%xmm13, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 16(%5,%0,4), %%xmm13, %%xmm5 \n\t"
|
||||
"vmovups 32(%2,%0,4), %%xmm14 \n\t" // 4 * x
|
||||
"vmovups 32(%2,%0,4), %%xmm14 \n\t" // 4 * x
|
||||
"vfmaddps %%xmm6, 16(%6,%0,4), %%xmm13, %%xmm6 \n\t"
|
||||
"vfmaddps %%xmm7, 16(%7,%0,4), %%xmm13, %%xmm7 \n\t"
|
||||
"prefetcht0 384(%6,%0,4) \n\t"
|
||||
".align 2 \n\t"
|
||||
"vfmaddps %%xmm4, 32(%4,%0,4), %%xmm14, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 32(%5,%0,4), %%xmm14, %%xmm5 \n\t"
|
||||
"vmovups 48(%2,%0,4), %%xmm15 \n\t" // 4 * x
|
||||
"vmovups 48(%2,%0,4), %%xmm15 \n\t" // 4 * x
|
||||
"vfmaddps %%xmm6, 32(%6,%0,4), %%xmm14, %%xmm6 \n\t"
|
||||
"vfmaddps %%xmm7, 32(%7,%0,4), %%xmm14, %%xmm7 \n\t"
|
||||
"prefetcht0 384(%7,%0,4) \n\t"
|
||||
"vfmaddps %%xmm4, 48(%4,%0,4), %%xmm15, %%xmm4 \n\t"
|
||||
"vfmaddps %%xmm5, 48(%5,%0,4), %%xmm15, %%xmm5 \n\t"
|
||||
"vfmaddps %%xmm6, 48(%6,%0,4), %%xmm15, %%xmm6 \n\t"
|
||||
"vfmaddps %%xmm7, 48(%7,%0,4), %%xmm15, %%xmm7 \n\t"
|
||||
"addq $16, %0 \n\t"
|
||||
"vfmaddps %%xmm5,-16(%5,%0,4), %%xmm15, %%xmm5 \n\t"
|
||||
"vfmaddps %%xmm6,-16(%6,%0,4), %%xmm15, %%xmm6 \n\t"
|
||||
"subq $16, %1 \n\t"
|
||||
"vfmaddps %%xmm7,-16(%7,%0,4), %%xmm15, %%xmm7 \n\t"
|
||||
|
||||
"addq $16, %0 \n\t"
|
||||
"subq $16, %1 \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
".L16END%=: \n\t"
|
||||
"vhaddps %%xmm4, %%xmm4, %%xmm4 \n\t"
|
||||
"vhaddps %%xmm5, %%xmm5, %%xmm5 \n\t"
|
||||
"vhaddps %%xmm6, %%xmm6, %%xmm6 \n\t"
|
||||
|
|
@ -1,99 +0,0 @@
|
|||
/***************************************************************************
|
||||
Copyright (c) 2014, The OpenBLAS Project
|
||||
All rights reserved.
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in
|
||||
the documentation and/or other materials provided with the
|
||||
distribution.
|
||||
3. Neither the name of the OpenBLAS project nor the names of
|
||||
its contributors may be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
||||
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*****************************************************************************/
|
||||
|
||||
static void sgemv_kernel_16( long n, float alpha, float *a, long lda, float *x, float *y)
|
||||
{
|
||||
|
||||
//n = n / 16;
|
||||
|
||||
__asm__ __volatile__
|
||||
(
|
||||
"movq %0, %%rax\n\t" // n -> rax
|
||||
"vmovss %1, %%xmm1\n\t" // alpha -> xmm1
|
||||
"movq %2, %%rsi\n\t" // adress of a -> rsi
|
||||
"movq %3, %%rcx\n\t" // value of lda > rcx
|
||||
"movq %4, %%rdi\n\t" // adress of x -> rdi
|
||||
"movq %5, %%rdx\n\t" // adress of y -> rdx
|
||||
|
||||
"leaq (, %%rcx,4), %%rcx \n\t" // scale lda by size of float
|
||||
"leaq (%%rsi,%%rcx,1), %%r8 \n\t" // pointer to next line
|
||||
|
||||
"vxorps %%xmm12, %%xmm12, %%xmm12\n\t" // set to zero
|
||||
"vxorps %%xmm13, %%xmm13, %%xmm13\n\t" // set to zero
|
||||
"vxorps %%xmm14, %%xmm14, %%xmm14\n\t" // set to zero
|
||||
"vxorps %%xmm15, %%xmm15, %%xmm15\n\t" // set to zero
|
||||
|
||||
"sarq $4, %%rax \n\t" // n = n / 16
|
||||
|
||||
".align 16 \n\t"
|
||||
".L01LOOP%=: \n\t"
|
||||
// "prefetcht0 512(%%rsi) \n\t"
|
||||
"prefetcht0 (%%r8) \n\t" //prefetch next line of a
|
||||
"vmovups (%%rsi), %%xmm4 \n\t"
|
||||
"vmovups 4*4(%%rsi), %%xmm5 \n\t"
|
||||
"vmovups 8*4(%%rsi), %%xmm6 \n\t"
|
||||
"vmovups 12*4(%%rsi), %%xmm7 \n\t"
|
||||
|
||||
"vfmaddps %%xmm12, 0*4(%%rdi), %%xmm4, %%xmm12\n\t" // multiply a and c and add to temp
|
||||
"vfmaddps %%xmm13, 4*4(%%rdi), %%xmm5, %%xmm13\n\t" // multiply a and c and add to temp
|
||||
"vfmaddps %%xmm14, 8*4(%%rdi), %%xmm6, %%xmm14\n\t" // multiply a and c and add to temp
|
||||
"vfmaddps %%xmm15, 12*4(%%rdi), %%xmm7, %%xmm15\n\t" // multiply a and c and add to temp
|
||||
|
||||
"addq $16*4 , %%r8 \n\t" // increment prefetch pointer
|
||||
"addq $16*4 , %%rsi \n\t" // increment pointer of a
|
||||
"addq $16*4 , %%rdi \n\t" // increment pointer of c
|
||||
"dec %%rax \n\t" // n = n -1
|
||||
"jnz .L01LOOP%= \n\t"
|
||||
|
||||
"vaddps %%xmm12, %%xmm14, %%xmm12\n\t"
|
||||
"vaddps %%xmm13, %%xmm15, %%xmm13\n\t"
|
||||
"vaddps %%xmm12, %%xmm13, %%xmm12\n\t"
|
||||
"vhaddps %%xmm12, %%xmm12, %%xmm12\n\t"
|
||||
"vhaddps %%xmm12, %%xmm12, %%xmm12\n\t"
|
||||
|
||||
"vfmaddss (%%rdx), %%xmm12, %%xmm1, %%xmm12\n\t"
|
||||
"vmovss %%xmm12, (%%rdx) \n\t" // store temp -> y
|
||||
|
||||
:
|
||||
:
|
||||
"m" (n), // 0
|
||||
"m" (alpha), // 1
|
||||
"m" (a), // 2
|
||||
"m" (lda), // 3
|
||||
"m" (x), // 4
|
||||
"m" (y) // 5
|
||||
: "%rax", "%rcx", "%rdx", "%rsi", "%rdi", "%r8",
|
||||
"%xmm0", "%xmm1",
|
||||
"%xmm4", "%xmm5", "%xmm6", "%xmm7",
|
||||
"%xmm12", "%xmm13", "%xmm14", "%xmm15",
|
||||
"memory"
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue