From ee25991c40de7e9c3dcff5bb247a4c444c673951 Mon Sep 17 00:00:00 2001 From: floraachy <1622042529@qq.com> Date: Sat, 19 Oct 2024 13:44:31 +0800 Subject: [PATCH] =?UTF-8?q?38=E5=B2=81=E8=80=81Mac222222=20closing=20#3,4?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- AVX512.md | 287 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 287 insertions(+) create mode 100644 AVX512.md diff --git a/AVX512.md b/AVX512.md new file mode 100644 index 0000000..dd1130a --- /dev/null +++ b/AVX512.md @@ -0,0 +1,287 @@ +Go 1.11 release introduces [AVX-512](https://en.wikipedia.org/wiki/AVX-512) support. +This page describes how to use new features as well as some important encoder details. + +### Terminology + +Most terminology comes from [Intel Software Developer's manual](https://software.intel.com/en-us/articles/intel-sdm). +Suffixes originate from Go assembler syntax, which is close to AT&T, which also uses size suffixes. + +Some terms are listed to avoid ambiguity (for example, opcode can have different meanings). + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TermDescription
Operand + Same as "instruction argument". +
Opcode + Name that refers to instruction group. For example, VADDPD is an opcode.
+ It refers to both VEX and EVEX encoded forms and all operand combinations.
+ Most Go assembler opcodes for AVX-512 match Intel manual entries, with exceptions for cases
+ where additional size suffix is used (e.g. VCVTTPD2DQY is VCVTTPD2DQ). +
Opcode suffix + Suffix that overrides some opcode properties. Listed after "." (dot).
+ For example, VADDPD.Z has "Z" opcode suffix.
+ There can be multiple dot-separated opcode suffixes. +
Size suffix + Suffix that specifies instruction operand size if it can't be inferred from operands alone.
+ For example, VCVTSS2USIL has "L" size suffix. +
Opmask + Used for both {k1} notation and to describe instructions that have K registers operands.
+ Related to masking support in EVEX prefix. +
Register block + Multi-source operand that encodes register range.
+ Intel manual uses +n notation for register blocks.
+ For example, +3 is a register block of 4 registers. +
FPFloating-point
+ +### New registers + +EVEX-enabled instructions can access additional 16 `X` (128-bit xmm) and `Y` (256-bit ymm) registers, plus 32 new `Z` (512-bit zmm) registers in 64-bit mode. 32-bit mode only gets `Z0-Z7`. + +New opmask registers are named `K0-K7`. +They can be used for both masking and for special opmask instructions (like `KADDB`). + +### Masking support + +Instructions that support masking can omit `K` register operand. +In this case, `K0` register is implied ("all ones") and merging-masking is performed. +This is effectively "no masking". + +`K1-K7` registers can be used to override default opmask. +`K` register should be placed right before destination operand. + +Zeroing-masking can be activated with `Z` opcode suffix. Zeroing-masking requires that a mask register other than K0 be specified. + +For example, `VADDPD.Z (AX), Z30, K3, Z10` uses zeroing-masking and explicit `K` register. +- If `Z` opcode suffix is removed, it's merging-masking with `K3` mask. +- If `K3` operand is removed, it generates an assembler error. +- If both `Z` opcode suffix and `K3` operand are removed, it is merging-masking with `K0` mask. + +It's compile-time error to use `K0` register for `{k1}` operands (consult [manuals](https://software.intel.com/en-us/articles/intel-sdm) for details). + +### EVEX broadcast/rounding/SAE support + +Embedded broadcast, rounding and SAE activated through opcode suffixes. + +For reg-reg FP instructions with `{er}` enabled, rounding opcode suffix can be specified: + +* `RU_SAE` to round towards +Inf +* `RD_SAE` to round towards -Inf +* `RZ_SAE` to round towards zero +* `RN_SAE` to round towards nearest + +> To read more about rounding modes, see [MXCSR.RC info](http://qcd.phys.cmu.edu/QCDcluster/intel/vtune/reference/vc148.htm). + +For reg-reg FP instructions with `{sae}` enabled, exception suppression can be specified with `SAE` opcode suffix. + +For reg-mem instrictons with `m32bcst/m64bcst` operand, broadcasting can be turned on with `BCST` opcode suffix. + +Zeroing opcode suffix can be combined with any of these. +For example, `VMAXPD.SAE.Z Z3, Z2, Z1` uses both `Z` and `SAE` opcode suffixes. +It is important to put zeroing opcode suffix last, otherwise it is a compilation error. + +### Register block (multi-source) operands + +Register blocks are specified using register range syntax. + +It would be enough to specify just first (low) register, but Go assembler requires +explicit range with both ends for readability reasons. + +For example, instructions with `+3` range can be used like `VP4DPWSSD Z25, [Z0-Z3], (AX)`. +Range `[Z0-Z3]` reads like "register block of Z0, Z1, Z2, Z3". +Invalid ranges result in compilation error. + +### AVX1 and AVX2 instructions with EVEX prefix + +Previously existed opcodes that can be encoded using EVEX prefix now can access AVX-512 features like wider register file, zeroing/merging masking, etc. For example, `VADDPD` can now use 512-bit vector registers. + +See [encoder details](#encoder-details) for more info. + +### Supported extensions + +Best way to get up-to-date list of supported extensions is to do `ls -1` inside [test suite](https://github.com/golang/go/tree/master/src/cmd/asm/internal/asm/testdata/avx512enc) directory. + +Latest list includes: +``` +aes_avx512f +avx512_4fmaps +avx512_4vnniw +avx512_bitalg +avx512_ifma +avx512_vbmi +avx512_vbmi2 +avx512_vnni +avx512_vpopcntdq +avx512bw +avx512cd +avx512dq +avx512er +avx512f +avx512pf +gfni_avx512f +vpclmulqdq_avx512f +``` + +128-bit and 256-bit instructions additionally require `avx512vl`. +That is, if `VADDPD` is available in `avx512f`, you can't use `X` and `Y` arguments +without `avx512vl`. + +Filenames follow `GNU as` (gas) conventions. +[avx512extmap.csv](https://gist.github.com/Quasilyte/92321dadcc3f86b05c1aeda2c13c851f) can make naming scheme more apparent. + +### Instructions with size suffix + +Some opcodes do not match Intel manual entries. +This section is provided for search convenience. + +| Intel opcode | Go assembler opcodes | +|--------------|----------------------| +| `VCVTPD2DQ` | `VCVTPD2DQX`, `VCVTPD2DQY` | +| `VCVTPD2PS` | `VCVTPD2PSX`, `VCVTPD2PSY` | +| `VCVTTPD2DQ` | `VCVTTPD2DQX`, `VCVTTPD2DQY` | +| `VCVTQQ2PS` | `VCVTQQ2PSX`, `VCVTQQ2PSY` | +| `VCVTUQQ2PS` | `VCVTUQQ2PSX`, `VCVTUQQ2PSY` | +| `VCVTPD2UDQ` | `VCVTPD2UDQX`, `VCVTPD2UDQY` | +| `VCVTTPD2UDQ` | `VCVTTPD2UDQX`, `VCVTTPD2UDQY` | +| `VFPCLASSPD` | `VFPCLASSPDX`, `VFPCLASSPDY`, `VFPCLASSPDZ` | +| `VFPCLASSPS` | `VFPCLASSPSX`, `VFPCLASSPSY`, `VFPCLASSPSZ` | +| `VCVTSD2SI` | `VCVTSD2SI`, `VCVTSD2SIQ` | +| `VCVTTSD2SI` | `VCVTSD2SI`, `VCVTSD2SIQ` | +| `VCVTTSS2SI` | `VCVTSD2SI`, `VCVTSD2SIQ` | +| `VCVTSS2SI` | `VCVTSD2SI`, `VCVTSD2SIQ` | +| `VCVTSD2USI` | `VCVTSD2USIL`, `VCVTSD2USIQ` | +| `VCVTSS2USI` | `VCVTSS2USIL`, `VCVTSS2USIQ` | +| `VCVTTSD2USI` | `VCVTTSD2USIL`, `VCVTTSD2USIQ` | +| `VCVTTSS2USI` | `VCVTTSS2USIL`, `VCVTTSS2USIQ` | +| `VCVTUSI2SD` | `VCVTUSI2SDL`, `VCVTUSI2SDQ` | +| `VCVTUSI2SS` | `VCVTUSI2SSL`, `VCVTUSI2SSQ` | +| `VCVTSI2SD` | `VCVTSI2SDL`, `VCVTSI2SDQ` | +| `VCVTSI2SS` | `VCVTSI2SSL`, `VCVTSI2SSQ` | +| `ANDN` | `ANDNL`, `ANDNQ` | +| `BEXTR` | `BEXTRL`, `BEXTRQ` | +| `BLSI` | `BLSIL`, `BLSIQ` | +| `BLSMSK` | `BLSMSKL`, `BLSMSKQ` | +| `BLSR` | `BLSRL`, `BLSRQ` | +| `BZHI` | `BZHIL`, `BZHIQ` | +| `MULX` | `MULXL`, `MULXQ` | +| `PDEP` | `PDEPL`, `PDEPQ` | +| `PEXT` | `PEXTL`, `PEXTQ` | +| `RORX` | `RORXL`, `RORXQ` | +| `SARX` | `SARXL`, `SARXQ` | +| `SHLX` | `SHLXL`, `SHLXQ` | +| `SHRX` | `SHRXL`, `SHRXQ` | + +### Encoder details + +Bitwise comparison with older encoder may fail for VEX-encoded instructions due to slightly different encoder tables order. + +This difference may arise for instructions with both `{reg, reg/mem}` and `{reg/mem, reg}` forms for reg-reg case. One of such instructions is `VMOVUPS`. + +This does not affect code behavior, nor makes it bigger/less efficient. +New encoding selection scheme is borrowed from [Intel XED](https://github.com/intelxed/xed). + +EVEX encoding is used when any of the following is true: + +* Instruction uses new registers (High 16 `X`/`Y`, `Z` or `K` registers) +* Instruction uses EVEX-related opcode suffixes like `BCST` +* Instruction uses operands combination that is only available for AVX-512 + +In all other cases VEX encoding is used. +This means that VEX is used whenever possible, and EVEX whenever required. + +Compressed disp8 is applied whenever possible for EVEX-encoded instructions. +This also covers broadcasting disp8 which sometimes has different N multiplier. + +Experienced readers can inspect [avx_optabs.go](https://github.com/golang/go/blob/master/src/cmd/internal/obj/x86/avx_optabs.go) to learn about N multipliers for any instruction. + +For example, `VADDPD` has these: +* `N=64` for 512-bit form; `N=8` when broadcasting +* `N=32` for 256-bit form; `N=8` when broadcasting +* `N=16` for 128-bit form; `N=8` when broadcasting + +### Examples + +Exhaustive amount of examples can be found in Go assembler [test suite](https://github.com/golang/go/tree/master/src/cmd/asm/internal/asm/testdata/avx512enc). + +Each file provides several examples for every supported instruction form in particular AVX-512 extension. +Every example also includes generated machine code. + +Here is adopted "Vectorized Histogram Update Using AVX-512CD" from +[IntelĀ® Optimization Manual](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf): + +```go +for i := 0; i < 512; i++ { + histo[key[i]] += 1 +} +``` + +```asm +top: + VMOVUPS 0x40(SP)(DX*4), Z4 //; vmovups zmm4, [rsp+rdx*4+0x40] + VPXORD Z1, Z1, Z1 //; vpxord zmm1, zmm1, zmm1 + KMOVW K1, K2 //; kmovw k2, k1 + VPCONFLICTD Z4, Z2 //; vpconflictd zmm2, zmm4 + VPGATHERDD (AX)(Z4*4), K2, Z1 //; vpgatherdd zmm1{k2}, [rax+zmm4*4] + VPTESTMD histo<>(SB), Z2, K0 //; vptestmd k0, zmm2, [rip+0x185c] + KMOVW K0, CX //; kmovw ecx, k0 + VPADDD Z0, Z1, Z3 //; vpaddd zmm3, zmm1, zmm0 + TESTL CX, CX //; test ecx, ecx + JZ noConflicts //; jz noConflicts + VMOVUPS histo<>(SB), Z1 //; vmovups zmm1, [rip+0x1884] + VPTESTMD histo<>(SB), Z2, K0 //; vptestmd k0, zmm2, [rip+0x18ba] + VPLZCNTD Z2, Z5 //; vplzcntd zmm5, zmm2 + XORB BX, BX //; xor bl, bl + KMOVW K0, CX //; kmovw ecx, k0 + VPSUBD Z5, Z1, Z1 //; vpsubd zmm1, zmm1, zmm5 + VPSUBD Z5, Z1, Z1 //; vpsubd zmm1, zmm1, zmm5 + +resolveConflicts: + VPBROADCASTD CX, Z5 //; vpbroadcastd zmm5, ecx + KMOVW CX, K2 //; kmovw k2, ecx + VPERMD Z3, Z1, K2, Z3 //; vpermd zmm3{k2}, zmm1, zmm3 + VPADDD Z0, Z3, K2, Z3 //; vpaddd zmm3{k2}, zmm3, zmm0 + VPTESTMD Z2, Z5, K2, K0 //; vptestmd k0{k2}, zmm5, zmm2 + KMOVW K0, SI //; kmovw esi, k0 + ANDL SI, CX //; and ecx, esi + JZ noConflicts //; jz noConflicts + ADDB $1, BX //; add bl, 0x1 + CMPB BX, $16 //; cmp bl, 0x10 + JB resolveConflicts //; jb resolveConflicts + +noConflicts: + KMOVW K1, K2 //; kmovw k2, k1 + VPSCATTERDD Z3, K2, (AX)(Z4*4) //; vpscatterdd [rax+zmm4*4]{k2}, zmm3 + ADDL $16, DX //; add edx, 0x10 + CMPL DX, $1024 //; cmp edx, 0x400 + JB top //; jb top +``` \ No newline at end of file