This commit is contained in:
erenup 2021-09-02 09:42:23 +08:00
parent 783874049b
commit 3bd0dd768b
7 changed files with 4600 additions and 5 deletions

View File

@ -22,8 +22,7 @@ Natural Language Processing with transformers.
- 蔡杰北京大学篇章4
- hlzhang麦吉尔大学篇章4
- 台运鹏 篇章2
其他:
- 张红旭 篇章2
本项目总结和学习了多篇优秀文档和分享,在各个章节均有标注来源,如有侵权,请及时联系项目成员,谢谢。去[Github点完Star](https://github.com/datawhalechina/learn-nlp-with-transformers)再学习事半功倍哦😄,谢谢。
@ -35,7 +34,8 @@ Natural Language Processing with transformers.
## 篇章2-Transformer相关原理
* [2.1-图解attention](./篇章2-Transformer相关原理/2.1-图解attention.md)
* [2.2-图解transformer](./篇章2-Transformer相关原理/2.2-图解transformer.md)
* [2.2.1-Pytorch编写完整的Transformer](./篇章2-Transformer相关原理/2.2.1-Pytorch编写完整的Transformer.md)
* [2.2.1-Pytorch编写Transformer.md](./篇章2-Transformer相关原理/2.2.1-Pytorch编写Transformer.md)
* [2.2.2-Pytorch编写Transformer-选读.md](./篇章2-Transformer相关原理/2.2.1-Pytorch编写Transformer-选读.md)
* [2.3-图解BERT](./篇章2-Transformer相关原理/2.3-图解BERT.md)
* [2.4-图解GPT](./篇章2-Transformer相关原理/2.4-图解GPT.md)
* [2.5-篇章小测](./篇章2-Transformer相关原理/2.5-篇章小测.md)

View File

@ -5,7 +5,8 @@
[篇章2-Transformer相关原理](./篇章2-Transformer相关原理/2.0-前言.md)
* [2.1-图解attention](./篇章2-Transformer相关原理/2.1-图解attention.md)
* [2.2-图解transformer](./篇章2-Transformer相关原理/2.2-图解transformer.md)
* [2.2.1-Pytorch编写完整的Transformer](./篇章2-Transformer相关原理/2.2.1-Pytorch编写完整的Transformer.md)
* [2.2.1-Pytorch编写Transformer.md](./篇章2-Transformer相关原理/2.2.1-Pytorch编写Transformer.md)
* [2.2.2-Pytorch编写Transformer-选读.md](./篇章2-Transformer相关原理/2.2.2-Pytorch编写Transformer-选读.md)
* [2.3-图解BERT](./篇章2-Transformer相关原理/2.3-图解BERT.md)
* [2.4-图解GPT](./篇章2-Transformer相关原理/2.4-图解GPT.md)
* [2.5-篇章小测](./篇章2-Transformer相关原理/2.5-篇章小测.md)

View File

@ -2,7 +2,8 @@
本章节将会对Transformer相关的原理进行深入讲解主要涉及的内容有attentiontransformer和两个经典模型BERT和GPT。
* [2.1-图解attention](./篇章2-Transformer相关原理/2.1-图解attention.md)
* [2.2-图解transformer](./篇章2-Transformer相关原理/2.2-图解transformer.md)
* [2.2.1-Pytorch编写完整的Transformer](./篇章2-Transformer相关原理/2.2.1-Pytorch编写完整的Transformer.md)
* [2.2.1-Pytorch编写Transformer.md](./篇章2-Transformer相关原理/2.2.1-Pytorch编写Transformer.md)
* [2.2.2-Pytorch编写Transformer-选读.md](./篇章2-Transformer相关原理/2.2.2-Pytorch编写Transformer-选读.md)
* [2.3-图解BERT](./篇章2-Transformer相关原理/2.3-图解BERT.md)
* [2.4-图解GPT](./篇章2-Transformer相关原理/2.4-图解GPT.md)
* [2.5-篇章小测](./篇章2-Transformer相关原理/2.5-篇章小测.md)

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,929 @@
```python
from IPython.display import Image
Image(filename='pictures/transformer.png')
```
![png](2.2.1-Pytorch%E7%BC%96%E5%86%99Transformer_files/2.2.1-Pytorch%E7%BC%96%E5%86%99Transformer_0_0.png)
本文翻译自哈佛NLP[The Annotated Transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html)
本文主要由Harvard NLP的学者在2018年初撰写以逐行实现的形式呈现了论文的“注释”版本,对原始论文进行了重排并在整个过程中添加了评论和注释。本文的note book可以在[篇章2](https://github.com/datawhalechina/learn-nlp-with-transformers/tree/main/docs/%E7%AF%87%E7%AB%A02-Transformer%E7%9B%B8%E5%85%B3%E5%8E%9F%E7%90%86)下载。
内容组织:
- Pytorch编写完整的Transformer
- 背景
- 模型架构
- Encoder部分和Decoder部分
- Encoder
- Decoder
- Attention
- 模型中Attention的应用
- 基于位置的前馈网络
- Embeddings和softmax
- 位置编码
- 完整模型
- 训练
- 批处理和mask
- Traning Loop
- 训练数据和批处理
- 硬件和训练时间
- 优化器
- 正则化
- 标签平滑
- 实例
- 合成数据
- 损失函数计算
- 贪婪解码
- 真实场景例
- 结语
- 致谢
# 预备工作
```python
# !pip install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp36-cp36m-linux_x86_64.whl numpy matplotlib spacy torchtext seaborn
```
```python
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import math, copy, time
from torch.autograd import Variable
import matplotlib.pyplot as plt
import seaborn
seaborn.set_context(context="talk")
%matplotlib inline
```
Table of Contents
* Table of Contents
{:toc}
# 背景
关于Transformer的更多背景知识读者可以阅读本项目的[篇章2.2图解Transformer](https://github.com/datawhalechina/learn-nlp-with-transformers/blob/main/docs/%E7%AF%87%E7%AB%A02-Transformer%E7%9B%B8%E5%85%B3%E5%8E%9F%E7%90%86/2.2-%E5%9B%BE%E8%A7%A3transformer.md)进行学习。
# 模型架构
大部分序列到序列seq2seq模型都使用编码器-解码器结构 [(引用)](https://arxiv.org/abs/1409.0473)。编码器把一个输入序列$(x_{1},...x_{n})$映射到一个连续的表示$z=(z_{1},...z_{n})$中。解码器对z中的每个元素生成输出序列$(y_{1},...y_{m})$。解码器一个时间步生成一个输出。在每一步中,模型都是自回归的[(引用)](https://arxiv.org/abs/1308.0850)在生成下一个结果时会将先前生成的结果加入输入序列来一起预测。现在我们先构建一个EncoderDecoder类来搭建一个seq2seq架构
```python
class EncoderDecoder(nn.Module):
"""
基础的Encoder-Decoder结构。
A standard Encoder-Decoder architecture. Base for this and many
other models.
"""
def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
super(EncoderDecoder, self).__init__()
self.encoder = encoder
self.decoder = decoder
self.src_embed = src_embed
self.tgt_embed = tgt_embed
self.generator = generator
def forward(self, src, tgt, src_mask, tgt_mask):
"Take in and process masked src and target sequences."
return self.decode(self.encode(src, src_mask), src_mask,
tgt, tgt_mask)
def encode(self, src, src_mask):
return self.encoder(self.src_embed(src), src_mask)
def decode(self, memory, src_mask, tgt, tgt_mask):
return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)
```
```python
class Generator(nn.Module):
"定义生成器由linear和softmax组成"
"Define standard linear + softmax generation step."
def __init__(self, d_model, vocab):
super(Generator, self).__init__()
self.proj = nn.Linear(d_model, vocab)
def forward(self, x):
return F.log_softmax(self.proj(x), dim=-1)
```
TTransformer的编码器和解码器都使用self-attention和全连接层堆叠而成。如下图的左、右两边所示。
```python
Image(filename='./pictures/2-transformer.png')
```
![png](2.2.1-Pytorch%E7%BC%96%E5%86%99Transformer_files/2.2.1-Pytorch%E7%BC%96%E5%86%99Transformer_13_0.png)
## Encoder部分和Decoder部分
### Encoder
编码器由N = 6个完全相同的层组成。
```python
def clones(module, N):
"产生N个完全相同的网络层"
"Produce N identical layers."
return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])
```
```python
class Encoder(nn.Module):
"完整的Encoder包含N层"
def __init__(self, layer, N):
super(Encoder, self).__init__()
self.layers = clones(layer, N)
self.norm = LayerNorm(layer.size)
def forward(self, x, mask):
"每一层的输入是x和mask"
for layer in self.layers:
x = layer(x, mask)
return self.norm(x)
```
编码器的每层encoder包含Self Attention 子层和FFNN子层每个子层都使用了残差连接[(cite)](https://arxiv.org/abs/1512.03385)和层标准化layer-normalization [(cite)](https://arxiv.org/abs/1607.06450)。先实现一下层标准化:
```python
class LayerNorm(nn.Module):
"Construct a layernorm module (See citation for details)."
def __init__(self, features, eps=1e-6):
super(LayerNorm, self).__init__()
self.a_2 = nn.Parameter(torch.ones(features))
self.b_2 = nn.Parameter(torch.zeros(features))
self.eps = eps
def forward(self, x):
mean = x.mean(-1, keepdim=True)
std = x.std(-1, keepdim=True)
return self.a_2 * (x - mean) / (std + self.eps) + self.b_2
```
我们称呼子层为:$\mathrm{Sublayer}(x)$,每个子层的最终输出是$\mathrm{LayerNorm}(x + \mathrm{Sublayer}(x))$。 dropout [(cite)](http://jmlr.org/papers/v15/srivastava14a.html)被加在Sublayer上。
为了便于进行残差连接模型中的所有子层以及embedding层产生的输出的维度都为 $d_{\text{model}}=512$。
下面的SublayerConnection类用来处理单个Sublayer的输出该输出将继续被输入下一个Sublayer
```python
class SublayerConnection(nn.Module):
"""
A residual connection followed by a layer norm.
Note for code simplicity the norm is first as opposed to last.
"""
def __init__(self, size, dropout):
super(SublayerConnection, self).__init__()
self.norm = LayerNorm(size)
self.dropout = nn.Dropout(dropout)
def forward(self, x, sublayer):
"Apply residual connection to any sublayer with the same size."
return x + self.dropout(sublayer(self.norm(x)))
```
每一层encoder都有两个子层。 第一层是一个multi-head self-attention层第二层是一个简单的全连接前馈网络对于这两层都需要使用SublayerConnection类进行处理。
```python
class EncoderLayer(nn.Module):
"Encoder is made up of self-attn and feed forward (defined below)"
def __init__(self, size, self_attn, feed_forward, dropout):
super(EncoderLayer, self).__init__()
self.self_attn = self_attn
self.feed_forward = feed_forward
self.sublayer = clones(SublayerConnection(size, dropout), 2)
self.size = size
def forward(self, x, mask):
"Follow Figure 1 (left) for connections."
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
return self.sublayer[1](x, self.feed_forward)
```
### Decoder
解码器也是由N = 6 个完全相同的decoder层组成。
```python
class Decoder(nn.Module):
"Generic N layer decoder with masking."
def __init__(self, layer, N):
super(Decoder, self).__init__()
self.layers = clones(layer, N)
self.norm = LayerNorm(layer.size)
def forward(self, x, memory, src_mask, tgt_mask):
for layer in self.layers:
x = layer(x, memory, src_mask, tgt_mask)
return self.norm(x)
```
单层decoder与单层encoder相比decoder还有第三个子层该层对encoder的输出执行attention即encoder-decoder-attention层q向量来自decoder上一层的输出k和v向量是encoder最后层的输出向量。与encoder类似我们在每个子层再采用残差连接然后进行层标准化。
```python
class DecoderLayer(nn.Module):
"Decoder is made of self-attn, src-attn, and feed forward (defined below)"
def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
super(DecoderLayer, self).__init__()
self.size = size
self.self_attn = self_attn
self.src_attn = src_attn
self.feed_forward = feed_forward
self.sublayer = clones(SublayerConnection(size, dropout), 3)
def forward(self, x, memory, src_mask, tgt_mask):
"Follow Figure 1 (right) for connections."
m = memory
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
return self.sublayer[2](x, self.feed_forward)
```
对于单层decoder中的self-attention子层我们需要使用mask机制以防止在当前位置关注到后面的位置。
```python
def subsequent_mask(size):
"Mask out subsequent positions."
attn_shape = (1, size, size)
subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
return torch.from_numpy(subsequent_mask) == 0
```
> 下面的attention mask显示了每个tgt单词允许查看的位置。在训练时将当前单词的未来信息屏蔽掉阻止此单词关注到后面的单词。
```python
plt.figure(figsize=(5,5))
plt.imshow(subsequent_mask(20)[0])
None
```
![svg](2.2.1-Pytorch%E7%BC%96%E5%86%99Transformer_files/2.2.1-Pytorch%E7%BC%96%E5%86%99Transformer_30_0.svg)
### Attention
Attention功能可以描述为将query和一组key-value映射到输出其中query、key、value和输出都是向量。输出为value的加权和其中每个value的权重通过query与相应key的计算得到。
我们将particular attention称之为“缩放的点积Attention”(Scaled Dot-Product Attention")。其输入为query、key(维度是$d_k$)以及values(维度是$d_v$)。我们计算query和所有key的点积然后对每个除以 $\sqrt{d_k}$, 最后用softmax函数获得value的权重。
```python
Image(filename='./pictures/transformer-self-attention.png')
```
![png](2.2.1-Pytorch%E7%BC%96%E5%86%99Transformer_files/2.2.1-Pytorch%E7%BC%96%E5%86%99Transformer_32_0.png)
在实践中我们同时计算一组query的attention函数并将它们组合成一个矩阵$Q$。key和value也一起组成矩阵$K$和$V$。 我们计算的输出矩阵为:
$$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V
$$
```python
def attention(query, key, value, mask=None, dropout=None):
"Compute 'Scaled Dot Product Attention'"
d_k = query.size(-1)
scores = torch.matmul(query, key.transpose(-2, -1)) \
/ math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
p_attn = F.softmax(scores, dim = -1)
if dropout is not None:
p_attn = dropout(p_attn)
return torch.matmul(p_attn, value), p_attn
```
  两个最常用的attention函数是
- 加法attention[(cite)](https://arxiv.org/abs/1409.0473)
- - 点积乘法attention
除了缩放因子$\frac{1}{\sqrt{d_k}}$ 点积Attention跟我们的平时的点乘算法一样。加法attention使用具有单个隐层的前馈网络计算相似度。虽然理论上点积attention和加法attention复杂度相似但在实践中点积attention可以使用高度优化的矩阵乘法来实现因此点积attention计算更快、更节省空间。
当$d_k$ 的值比较小的时候,这两个机制的性能相近。当$d_k$比较大时加法attention比不带缩放的点积attention性能好 [(cite)](https://arxiv.org/abs/1703.03906)。我们怀疑,对于很大的$d_k$值, 点积大幅度增长将softmax函数推向具有极小梯度的区域。(为了说明为什么点积变大假设q和k是独立的随机变量均值为0方差为1。那么它们的点积$q \cdot k = \sum_{i=1}^{d_k} q_ik_i$, 均值为0方差为$d_k$)。为了抵消这种影响,我们将点积缩小 $\frac{1}{\sqrt{d_k}}$倍。
在此引用苏剑林文章[《浅谈Transformer的初始化、参数化与标准化》](https://zhuanlan.zhihu.com/p/400925524?utm_source=wechat_session&utm_medium=social&utm_oi=1400823417357139968&utm_campaign=shareopn)中谈到的为什么Attention中除以$\sqrt{d}$这么重要?
```python
Image(filename='pictures/transformer-linear.png')
```
![png](2.2.1-Pytorch%E7%BC%96%E5%86%99Transformer_files/2.2.1-Pytorch%E7%BC%96%E5%86%99Transformer_37_0.png)
Multi-head attention允许模型同时关注来自不同位置的不同表示子空间的信息如果只有一个attention head向量的表示能力会下降。
$$
\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head_1}, ..., \mathrm{head_h})W^O \\
\text{where}~\mathrm{head_i} = \mathrm{Attention}(QW^Q_i, KW^K_i, VW^V_i)
$$
其中映射由权重矩阵完成:$W^Q_i \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W^K_i \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W^V_i \in \mathbb{R}^{d_{\text{model}} \times d_v}$ and $W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}$。
在这项工作中,我们采用$h=8$个平行attention层或者叫head。对于这些head中的每一个我们使用$d_k=d_v=d_{\text{model}}/h=64$。由于每个head的维度减小总计算成本与具有全部维度的单个head attention相似。
```python
class MultiHeadedAttention(nn.Module):
def __init__(self, h, d_model, dropout=0.1):
"Take in model size and number of heads."
super(MultiHeadedAttention, self).__init__()
assert d_model % h == 0
# We assume d_v always equals d_k
self.d_k = d_model // h
self.h = h
self.linears = clones(nn.Linear(d_model, d_model), 4)
self.attn = None
self.dropout = nn.Dropout(p=dropout)
def forward(self, query, key, value, mask=None):
"Implements Figure 2"
if mask is not None:
# Same mask applied to all h heads.
mask = mask.unsqueeze(1)
nbatches = query.size(0)
# 1) Do all the linear projections in batch from d_model => h x d_k
query, key, value = \
[l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
for l, x in zip(self.linears, (query, key, value))]
# 2) Apply attention on all the projected vectors in batch.
x, self.attn = attention(query, key, value, mask=mask,
dropout=self.dropout)
# 3) "Concat" using a view and apply a final linear.
x = x.transpose(1, 2).contiguous() \
.view(nbatches, -1, self.h * self.d_k)
return self.linears[-1](x)
```
### 模型中Attention的应用
multi-head attention在Transformer中有三种不同的使用方式
- 在encoder-decoder attention层中queries来自前面的decoder层而keys和values来自encoder的输出。这使得decoder中的每个位置都能关注到输入序列中的所有位置。这是模仿序列到序列模型中典型的编码器—解码器的attention机制例如 [(cite)](https://arxiv.org/abs/1609.08144).
- encoder包含self-attention层。在self-attention层中所有keyvalue和query来自同一个地方即encoder中前一层的输出。在这种情况下encoder中的每个位置都可以关注到encoder上一层的所有位置。
- 类似地decoder中的self-attention层允许decoder中的每个位置都关注decoder层中当前位置之前的所有位置包括当前位置。 为了保持解码器的自回归特性需要防止解码器中的信息向左流动。我们在缩放点积attention的内部通过屏蔽softmax输入中所有的非法连接值设置为$-\infty$)实现了这一点。
### 基于位置的前馈网络
除了attention子层之外我们的编码器和解码器中的每个层都包含一个全连接的前馈网络该网络在每个层的位置相同都在每个encoder-layer或者decoder-layer的最后。该前馈网络包括两个线性变换并在两个线性变换中间有一个ReLU激活函数。
$$\mathrm{FFN}(x)=\max(0, xW_1 + b_1) W_2 + b_2$$
尽管两层都是线性变换但它们在层与层之间使用不同的参数。另一种描述方式是两个内核大小为1的卷积。 输入和输出的维度都是 $d_{\text{model}}=512$, 内层维度是$d_{ff}=2048$。也就是第一层输入512维,输出2048维第二层输入2048维输出512维
```python
class PositionwiseFeedForward(nn.Module):
"Implements FFN equation."
def __init__(self, d_model, d_ff, dropout=0.1):
super(PositionwiseFeedForward, self).__init__()
self.w_1 = nn.Linear(d_model, d_ff)
self.w_2 = nn.Linear(d_ff, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
return self.w_2(self.dropout(F.relu(self.w_1(x))))
```
## Embeddings and Softmax
与其他seq2seq模型类似我们使用学习到的embedding将输入token和输出token转换为$d_{\text{model}}$维的向量。我们还使用普通的线性变换和softmax函数将解码器输出转换为预测的下一个token的概率 在我们的模型中两个嵌入层之间和pre-softmax线性变换共享相同的权重矩阵类似于[(cite)](https://arxiv.org/abs/1608.05859)。在embedding层中我们将这些权重乘以$\sqrt{d_{\text{model}}}$。
```python
class Embeddings(nn.Module):
def __init__(self, d_model, vocab):
super(Embeddings, self).__init__()
self.lut = nn.Embedding(vocab, d_model)
self.d_model = d_model
def forward(self, x):
return self.lut(x) * math.sqrt(self.d_model)
```
## 位置编码
  由于我们的模型不包含循环和卷积为了让模型利用序列的顺序我们必须加入一些序列中token的相对或者绝对位置的信息。为此我们将“位置编码”添加到编码器和解码器堆栈底部的输入embeddinng中。位置编码和embedding的维度相同也是$d_{\text{model}}$ , 所以这两个向量可以相加。有多种位置编码可以选择,例如通过学习得到的位置编码和固定的位置编码 [(cite)](https://arxiv.org/pdf/1705.03122.pdf)。
  在这项工作中,我们使用不同频率的正弦和余弦函数: $$PE_{(pos,2i)} = sin(pos / 10000^{2i/d_{\text{model}}})$$
$$PE_{(pos,2i+1)} = cos(pos / 10000^{2i/d_{\text{model}}})$$
  其中$pos$ 是位置,$i$ 是维度。也就是说,位置编码的每个维度对应于一个正弦曲线。 这些波长形成一个从$2\pi$ 到 $10000 \cdot 2\pi$的集合级数。我们选择这个函数是因为我们假设它会让模型很容易学习对相对位置的关注,因为对任意确定的偏移$k$, $PE_{pos+k}$ 可以表示为 $PE_{pos}$的线性函数。
  此外我们会将编码器和解码器堆栈中的embedding和位置编码的和再加一个dropout。对于基本模型我们使用的dropout比例是$P_{drop}=0.1$。
```python
class PositionalEncoding(nn.Module):
"Implement the PE function."
def __init__(self, d_model, dropout, max_len=5000):
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout)
# Compute the positional encodings once in log space.
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) *
-(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer('pe', pe)
def forward(self, x):
x = x + Variable(self.pe[:, :x.size(1)],
requires_grad=False)
return self.dropout(x)
```
> 如下图,位置编码将根据位置添加正弦波。波的频率和偏移对于每个维度都是不同的。
```python
plt.figure(figsize=(15, 5))
pe = PositionalEncoding(20, 0)
y = pe.forward(Variable(torch.zeros(1, 100, 20)))
plt.plot(np.arange(100), y[0, :, 4:8].data.numpy())
plt.legend(["dim %d"%p for p in [4,5,6,7]])
None
```
![svg](2.2.1-Pytorch%E7%BC%96%E5%86%99Transformer_files/2.2.1-Pytorch%E7%BC%96%E5%86%99Transformer_48_0.svg)
我们还尝试使用学习的位置embeddings[(cite)](https://arxiv.org/pdf/1705.03122.pdf)来代替固定的位置编码,结果发现两种方法产生了几乎相同的效果。于是我们选择了正弦版本,因为它可能允许模型外推到,比训练时遇到的序列更长的序列。
## 完整模型
> 在这里,我们定义了一个从超参数到完整模型的函数。
```python
def make_model(src_vocab, tgt_vocab, N=6,
d_model=512, d_ff=2048, h=8, dropout=0.1):
"Helper: Construct a model from hyperparameters."
c = copy.deepcopy
attn = MultiHeadedAttention(h, d_model)
ff = PositionwiseFeedForward(d_model, d_ff, dropout)
position = PositionalEncoding(d_model, dropout)
model = EncoderDecoder(
Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
Decoder(DecoderLayer(d_model, c(attn), c(attn),
c(ff), dropout), N),
nn.Sequential(Embeddings(d_model, src_vocab), c(position)),
nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)),
Generator(d_model, tgt_vocab))
# This was important from their code.
# Initialize parameters with Glorot / fan_avg.
for p in model.parameters():
if p.dim() > 1:
nn.init.xavier_uniform(p)
return model
```
```python
# Small example model.
tmp_model = make_model(10, 10, 2)
None
```
/var/folders/2k/x3py0v857kgcwqvvl00xxhxw0000gn/T/ipykernel_27532/2289673833.py:20: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
nn.init.xavier_uniform(p)
# 训练
本节描述了我们模型的训练机制。
> 我们在这快速地介绍一些工具这些工具用于训练一个标准的encoder-decoder模型。首先我们定义一个批处理对象其中包含用于训练的 src 和目标句子,以及构建掩码。
## 批处理和掩码
```python
class Batch:
"Object for holding a batch of data with mask during training."
def __init__(self, src, trg=None, pad=0):
self.src = src
self.src_mask = (src != pad).unsqueeze(-2)
if trg is not None:
self.trg = trg[:, :-1]
self.trg_y = trg[:, 1:]
self.trg_mask = \
self.make_std_mask(self.trg, pad)
self.ntokens = (self.trg_y != pad).data.sum()
@staticmethod
def make_std_mask(tgt, pad):
"Create a mask to hide padding and future words."
tgt_mask = (tgt != pad).unsqueeze(-2)
tgt_mask = tgt_mask & Variable(
subsequent_mask(tgt.size(-1)).type_as(tgt_mask.data))
return tgt_mask
```
> 接下来我们创建一个通用的训练和评估函数来跟踪损失。我们传入一个通用的损失函数,也用它来进行参数更新。
## Training Loop
```python
def run_epoch(data_iter, model, loss_compute):
"Standard Training and Logging Function"
start = time.time()
total_tokens = 0
total_loss = 0
tokens = 0
for i, batch in enumerate(data_iter):
out = model.forward(batch.src, batch.trg,
batch.src_mask, batch.trg_mask)
loss = loss_compute(out, batch.trg_y, batch.ntokens)
total_loss += loss
total_tokens += batch.ntokens
tokens += batch.ntokens
if i % 50 == 1:
elapsed = time.time() - start
print("Epoch Step: %d Loss: %f Tokens per Sec: %f" %
(i, loss / batch.ntokens, tokens / elapsed))
start = time.time()
tokens = 0
return total_loss / total_tokens
```
## 训练数据和批处理
&#8195;&#8195;我们在包含约450万个句子对的标准WMT 2014英语-德语数据集上进行了训练。这些句子使用字节对编码进行编码源语句和目标语句共享大约37000个token的词汇表。对于英语-法语翻译我们使用了明显更大的WMT 2014英语-法语数据集,该数据集由 3600 万个句子组成并将token拆分为32000个word-piece词表。<br>
每个训练批次包含一组句子对句子对按相近序列长度来分批处理。每个训练批次的句子对包含大约25000个源语言的tokens和25000个目标语言的tokens。
> 我们将使用torch text进行批处理后文会进行更详细地讨论。在这里我们在torchtext函数中创建批处理以确保我们填充到最大值的批处理大小不会超过阈值如果我们有8个gpu则为25000
```python
global max_src_in_batch, max_tgt_in_batch
def batch_size_fn(new, count, sofar):
"Keep augmenting batch and calculate total number of tokens + padding."
global max_src_in_batch, max_tgt_in_batch
if count == 1:
max_src_in_batch = 0
max_tgt_in_batch = 0
max_src_in_batch = max(max_src_in_batch, len(new.src))
max_tgt_in_batch = max(max_tgt_in_batch, len(new.trg) + 2)
src_elements = count * max_src_in_batch
tgt_elements = count * max_tgt_in_batch
return max(src_elements, tgt_elements)
```
## 硬件和训练时间
我们在一台配备8个 NVIDIA P100 GPU 的机器上训练我们的模型。使用论文中描述的超参数的base models每个训练step大约需要0.4秒。我们对base models进行了总共10万steps或12小时的训练。而对于big models每个step训练时间为1.0秒big models训练了30万steps3.5 天)。
## Optimizer
我们使用Adam优化器[(cite)](https://arxiv.org/abs/1412.6980),其中 $\beta_1=0.9$, $\beta_2=0.98$并且$\epsilon=10^{-9}$。我们根据以下公式在训练过程中改变学习率:
$$
lrate = d_{\text{model}}^{-0.5} \cdot
\min({step\_num}^{-0.5},
{step\_num} \cdot {warmup\_steps}^{-1.5})
$$
这对应于在第一次$warmup\_steps$步中线性地增加学习速率,并且随后将其与步数的平方根成比例地减小。我们使用$warmup\_steps=4000$。
> 注意:这部分非常重要。需要使用此模型设置进行训练。
```python
class NoamOpt:
"Optim wrapper that implements rate."
def __init__(self, model_size, factor, warmup, optimizer):
self.optimizer = optimizer
self._step = 0
self.warmup = warmup
self.factor = factor
self.model_size = model_size
self._rate = 0
def step(self):
"Update parameters and rate"
self._step += 1
rate = self.rate()
for p in self.optimizer.param_groups:
p['lr'] = rate
self._rate = rate
self.optimizer.step()
def rate(self, step = None):
"Implement `lrate` above"
if step is None:
step = self._step
return self.factor * \
(self.model_size ** (-0.5) *
min(step ** (-0.5), step * self.warmup ** (-1.5)))
def get_std_opt(model):
return NoamOpt(model.src_embed[0].d_model, 2, 4000,
torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))
```
> 以下是此模型针对不同模型大小和优化超参数的曲线示例。
```python
# Three settings of the lrate hyperparameters.
opts = [NoamOpt(512, 1, 4000, None),
NoamOpt(512, 1, 8000, None),
NoamOpt(256, 1, 4000, None)]
plt.plot(np.arange(1, 20000), [[opt.rate(i) for opt in opts] for i in range(1, 20000)])
plt.legend(["512:4000", "512:8000", "256:4000"])
None
```
![svg](2.2.1-Pytorch%E7%BC%96%E5%86%99Transformer_files/2.2.1-Pytorch%E7%BC%96%E5%86%99Transformer_68_0.svg)
## 正则化
### 标签平滑
在训练过程中我们使用的label平滑的值为$\epsilon_{ls}=0.1$ [(cite)](https://arxiv.org/abs/1512.00567)。虽然对label进行平滑会让模型困惑但提高了准确性和BLEU得分。
> 我们使用KL div损失实现标签平滑。我们没有使用one-hot独热分布而是创建了一个分布该分布设定目标分布为1-smoothing将剩余概率分配给词表中的其他单词。
```python
class LabelSmoothing(nn.Module):
"Implement label smoothing."
def __init__(self, size, padding_idx, smoothing=0.0):
super(LabelSmoothing, self).__init__()
self.criterion = nn.KLDivLoss(size_average=False)
self.padding_idx = padding_idx
self.confidence = 1.0 - smoothing
self.smoothing = smoothing
self.size = size
self.true_dist = None
def forward(self, x, target):
assert x.size(1) == self.size
true_dist = x.data.clone()
true_dist.fill_(self.smoothing / (self.size - 2))
true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence)
true_dist[:, self.padding_idx] = 0
mask = torch.nonzero(target.data == self.padding_idx)
if mask.dim() > 0:
true_dist.index_fill_(0, mask.squeeze(), 0.0)
self.true_dist = true_dist
return self.criterion(x, Variable(true_dist, requires_grad=False))
```
下面我们看一个例子,看看平滑后的真实概率分布。
```python
#Example of label smoothing.
crit = LabelSmoothing(5, 0, 0.4)
predict = torch.FloatTensor([[0, 0.2, 0.7, 0.1, 0],
[0, 0.2, 0.7, 0.1, 0],
[0, 0.2, 0.7, 0.1, 0]])
v = crit(Variable(predict.log()),
Variable(torch.LongTensor([2, 1, 0])))
# Show the target distributions expected by the system.
plt.imshow(crit.true_dist)
None
```
/Users/niepig/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
![svg](2.2.1-Pytorch%E7%BC%96%E5%86%99Transformer_files/2.2.1-Pytorch%E7%BC%96%E5%86%99Transformer_73_1.svg)
```python
print(crit.true_dist)
```
tensor([[0.0000, 0.1333, 0.6000, 0.1333, 0.1333],
[0.0000, 0.6000, 0.1333, 0.1333, 0.1333],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000]])
由于标签平滑的存在如果模型对于某个单词特别有信心输出特别大的概率会被惩罚。如下代码所示随着输入x的增大x/d会越来越大1/d会越来越小但是loss并不是一直降低的。
```python
crit = LabelSmoothing(5, 0, 0.1)
def loss(x):
d = x + 3 * 1
predict = torch.FloatTensor([[0, x / d, 1 / d, 1 / d, 1 / d],
])
#print(predict)
return crit(Variable(predict.log()),
Variable(torch.LongTensor([1]))).item()
y = [loss(x) for x in range(1, 100)]
x = np.arange(1, 100)
plt.plot(x, y)
```
[<matplotlib.lines.Line2D at 0x7f7fad46c970>]
![svg](2.2.1-Pytorch%E7%BC%96%E5%86%99Transformer_files/2.2.1-Pytorch%E7%BC%96%E5%86%99Transformer_76_1.svg)
# 实例
> 我们可以从尝试一个简单的复制任务开始。给定来自小词汇表的一组随机输入符号symbols目标是生成这些相同的符号。
## 合成数据
```python
def data_gen(V, batch, nbatches):
"Generate random data for a src-tgt copy task."
for i in range(nbatches):
data = torch.from_numpy(np.random.randint(1, V, size=(batch, 10)))
data[:, 0] = 1
src = Variable(data, requires_grad=False)
tgt = Variable(data, requires_grad=False)
yield Batch(src, tgt, 0)
```
## 损失函数计算
```python
class SimpleLossCompute:
"A simple loss compute and train function."
def __init__(self, generator, criterion, opt=None):
self.generator = generator
self.criterion = criterion
self.opt = opt
def __call__(self, x, y, norm):
x = self.generator(x)
loss = self.criterion(x.contiguous().view(-1, x.size(-1)),
y.contiguous().view(-1)) / norm
loss.backward()
if self.opt is not None:
self.opt.step()
self.opt.optimizer.zero_grad()
return loss.item() * norm
```
## 贪婪解码
```python
# Train the simple copy task.
V = 11
criterion = LabelSmoothing(size=V, padding_idx=0, smoothing=0.0)
model = make_model(V, V, N=2)
model_opt = NoamOpt(model.src_embed[0].d_model, 1, 400,
torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))
for epoch in range(10):
model.train()
run_epoch(data_gen(V, 30, 20), model,
SimpleLossCompute(model.generator, criterion, model_opt))
model.eval()
print(run_epoch(data_gen(V, 30, 5), model,
SimpleLossCompute(model.generator, criterion, None)))
```
> 为了简单起见,此代码使用贪婪解码来预测翻译。
```python
def greedy_decode(model, src, src_mask, max_len, start_symbol):
memory = model.encode(src, src_mask)
ys = torch.ones(1, 1).fill_(start_symbol).type_as(src.data)
for i in range(max_len-1):
out = model.decode(memory, src_mask,
Variable(ys),
Variable(subsequent_mask(ys.size(1))
.type_as(src.data)))
prob = model.generator(out[:, -1])
_, next_word = torch.max(prob, dim = 1)
next_word = next_word.data[0]
ys = torch.cat([ys,
torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=1)
return ys
model.eval()
src = Variable(torch.LongTensor([[1,2,3,4,5,6,7,8,9,10]]) )
src_mask = Variable(torch.ones(1, 1, 10) )
print(greedy_decode(model, src, src_mask, max_len=10, start_symbol=1))
```
tensor([[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
# 真实场景示例
由于原始jupyter的真实数据场景需要多GPU训练本教程暂时不将其纳入感兴趣的读者可以继续阅读[原始教程](https://nlp.seas.harvard.edu/2018/04/03/attention.html)。另外由于真是数据原始url失效原始教程应该也无法运行真是数据场景的代码。
# 结语
到目前为止我们逐行实现了一个完整的Transformer并使用合成的数据对其进行了训练和预测希望这个教程能对你有帮助。
# 致谢
本文由张红旭同学翻译由多多同学整理原始jupyter来源于哈佛NLP [The annotated Transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html)。
<div id="disqus_thread"></div>
<script>
/**
* RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
* LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables
*/
/*
var disqus_config = function () {
this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable
this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
};
*/
(function() { // REQUIRED CONFIGURATION VARIABLE: EDIT THE SHORTNAME BELOW
var d = document, s = d.createElement('script');
s.src = 'https://EXAMPLE.disqus.com/embed.js'; // IMPORTANT: Replace EXAMPLE with your forum shortname!
s.setAttribute('data-timestamp', +new Date());
(d.head || d.body).appendChild(s);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript" rel="nofollow">comments powered by Disqus.</a></noscript>

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,951 @@
# Transformer源代码解释之PyTorch篇
在阅读完[2.2-图解transformer](./篇章2-Transformer相关原理/2.2-图解transformer.md)之后希望大家能对transformer各个模块的设计和计算有一个形象的认识本小节我们基于pytorch来实现一个Transformer帮助大家进一步学习这个复杂的模型。与2.2.1不同的是本文实现Transformer的时候是按照输入-模型-输出的顺序依次实现的。供大家参考。
**章节**
- [词嵌入](#embed)
- [位置编码](#pos)
- [多头注意力](#multihead)
- [搭建Transformer](#build)
![](./pictures/0-1-transformer-arc.png)
Transformer结构图
## **<div id='embed'>词嵌入</div>**
如上图所示Transformer图里左边的是Encoder右边是Decoder部分。Encoder输入源语言序列Decoder里面输入需要被翻译的语言文本在训练时。一个文本常有许多序列组成常见操作为将序列进行一些预处理如词切分等变成列表一个序列的列表的元素通常为词表中不可切分的最小词整个文本就是一个大列表元素为一个一个由序列组成的列表。如一个序列经过切分后变为["am", "##ro", "##zi", "meets", "his", "father"],接下来按照它们在词表中对应的索引进行转换,假设结果如[23, 94, 13, 41, 27, 96]。假如整个文本一共100个句子那么就有100个列表为它的元素因为每个序列的长度不一需要设定最大长度这里不妨设为128那么将整个文本转换为数组之后形状即为100 x 128这就对应着batch_size和seq_length。
输入之后,紧接着进行词嵌入处理,词嵌入就是将每一个词用预先训练好的向量进行映射。
词嵌入在torch里基于`torch.nn.Embedding`实现,实例化时需要设置的参数为词表的大小和被映射的向量的维度比如`embed = nn.Embedding(10,8)`。向量的维度通俗来说就是向量里面有多少个数。注意第一个参数是词表的大小如果你目前最多有8个词通常填写10多一个位置留给unk和pad你后面万一进入与这8个词不同的词就映射到unk上序列padding的部分就映射到pad上。
假如我们打算映射到8维num_features或者embed_dim那么整个文本的形状变为100 x 128 x 8。接下来举个小例子解释一下假设我们词表一共有10个词(算上unk和pad)文本里有2个句子每个句子有4个词我们想要把每个词映射到8维的向量。于是248对应于batch_size, seq_length, embed_dim如果batch在第一维的话
另外一般深度学习任务只改变num_features所以讲维度一般是针对最后特征所在的维度。
开始编程:
所有需要的包的导入:
```python
import torch
import torch.nn as nn
from torch.nn.parameter import Parameter
from torch.nn.init import xavier_uniform_
from torch.nn.init import constant_
from torch.nn.init import xavier_normal_
import torch.nn.functional as F
from typing import Optional, Tuple, Any
from typing import List, Optional, Tuple
import math
import warnings
```
```python
X = torch.zeros((2,4),dtype=torch.long)
embed = nn.Embedding(10,8)
print(embed(X).shape)
```
torch.Size([2, 4, 8])
## **<div id='pos'>位置编码</div>**
词嵌入之后紧接着就是位置编码位置编码用以区分不同词以及同词不同特征之间的关系。代码中需要注意X_只是初始化的矩阵并不是输入进来的完成位置编码之后会加一个dropout。另外位置编码是最后加上去的因此输入输出形状不变。
```python
Tensor = torch.Tensor
def positional_encoding(X, num_features, dropout_p=0.1, max_len=512) -> Tensor:
r'''
给输入加入位置编码
参数:
- num_features: 输入进来的维度
- dropout_p: dropout的概率当其为非零时执行dropout
- max_len: 句子的最大长度默认512
形状:
- 输入: [batch_size, seq_length, num_features]
- 输出: [batch_size, seq_length, num_features]
例子:
>>> X = torch.randn((2,4,10))
>>> X = positional_encoding(X, 10)
>>> print(X.shape)
>>> torch.Size([2, 4, 10])
'''
dropout = nn.Dropout(dropout_p)
P = torch.zeros((1,max_len,num_features))
X_ = torch.arange(max_len,dtype=torch.float32).reshape(-1,1) / torch.pow(
10000,
torch.arange(0,num_features,2,dtype=torch.float32) /num_features)
P[:,:,0::2] = torch.sin(X_)
P[:,:,1::2] = torch.cos(X_)
X = X + P[:,:X.shape[1],:].to(X.device)
return dropout(X)
```
```python
# 位置编码例子
X = torch.randn((2,4,10))
X = positional_encoding(X, 10)
print(X.shape)
```
torch.Size([2, 4, 10])
## **<div id='multihead'>多头注意力</div>**
### 拆开看多头注意力机制
**完整版本可运行的多头注意里机制的class在后面先看一下完整的 多头注意力机制-MultiheadAttention 小节再回来依次看下面的解释。**
多头注意力类主要成分是参数初始化、multi_head_attention_forward
#### 初始化参数
```python
if self._qkv_same_embed_dim is False:
# 初始化前后形状维持不变
# (seq_length x embed_dim) x (embed_dim x embed_dim) ==> (seq_length x embed_dim)
self.q_proj_weight = Parameter(torch.empty((embed_dim, embed_dim)))
self.k_proj_weight = Parameter(torch.empty((embed_dim, self.kdim)))
self.v_proj_weight = Parameter(torch.empty((embed_dim, self.vdim)))
self.register_parameter('in_proj_weight', None)
else:
self.in_proj_weight = Parameter(torch.empty((3 * embed_dim, embed_dim)))
self.register_parameter('q_proj_weight', None)
self.register_parameter('k_proj_weight', None)
self.register_parameter('v_proj_weight', None)
if bias:
self.in_proj_bias = Parameter(torch.empty(3 * embed_dim))
else:
self.register_parameter('in_proj_bias', None)
# 后期会将所有头的注意力拼接在一起然后乘上权重矩阵输出
# out_proj是为了后期准备的
self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
self._reset_parameters()
```
torch.empty是按照所给的形状形成对应的tensor特点是填充的值还未初始化类比torch.randn标准正态分布这就是一种初始化的方式。在PyTorch中变量类型是tensor的话是无法修改值的而Parameter()函数可以看作为一种类型转变函数将不可改值的tensor转换为可训练可修改的模型参数即与model.parameters绑定在一起register_parameter的意思是是否将这个参数放到model.parametersNone的意思是没有这个参数。
这里有个if判断用以判断q,k,v的最后一维是否一致若一致则一个大的权重矩阵全部乘然后分割出来若不是则各初始化各的其实初始化是不会改变原来的形状的如![](http://latex.codecogs.com/svg.latex?q=qW_q+b_q),见注释)。
可以发现最后有一个_reset_parameters()函数这个是用来初始化参数数值的。xavier_uniform意思是从[连续型均匀分布](https://zh.wikipedia.org/wiki/%E9%80%A3%E7%BA%8C%E5%9E%8B%E5%9D%87%E5%8B%BB%E5%88%86%E5%B8%83)里面随机取样出值来作为初始化的值xavier_normal_取样的分布是正态分布。正因为初始化值在训练神经网络的时候很重要所以才需要这两个函数。
constant_意思是用所给值来填充输入的向量。
另外在PyTorch的源码里似乎projection代表是一种线性变换的意思in_proj_bias的意思就是一开始的线性变换的偏置
```python
def _reset_parameters(self):
if self._qkv_same_embed_dim:
xavier_uniform_(self.in_proj_weight)
else:
xavier_uniform_(self.q_proj_weight)
xavier_uniform_(self.k_proj_weight)
xavier_uniform_(self.v_proj_weight)
if self.in_proj_bias is not None:
constant_(self.in_proj_bias, 0.)
constant_(self.out_proj.bias, 0.)
```
#### multi_head_attention_forward
这个函数如下代码所示主要分成3个部分
- query, key, value通过_in_projection_packed变换得到q,k,v
- 遮挡机制
- 点积注意力
```python
import torch
Tensor = torch.Tensor
def multi_head_attention_forward(
query: Tensor,
key: Tensor,
value: Tensor,
num_heads: int,
in_proj_weight: Tensor,
in_proj_bias: Optional[Tensor],
dropout_p: float,
out_proj_weight: Tensor,
out_proj_bias: Optional[Tensor],
training: bool = True,
key_padding_mask: Optional[Tensor] = None,
need_weights: bool = True,
attn_mask: Optional[Tensor] = None,
use_seperate_proj_weight = None,
q_proj_weight: Optional[Tensor] = None,
k_proj_weight: Optional[Tensor] = None,
v_proj_weight: Optional[Tensor] = None,
) -> Tuple[Tensor, Optional[Tensor]]:
r'''
形状:
输入:
- query`(L, N, E)`
- key: `(S, N, E)`
- value: `(S, N, E)`
- key_padding_mask: `(N, S)`
- attn_mask: `(L, S)` or `(N * num_heads, L, S)`
输出:
- attn_output:`(L, N, E)`
- attn_output_weights:`(N, L, S)`
'''
tgt_len, bsz, embed_dim = query.shape
src_len, _, _ = key.shape
head_dim = embed_dim // num_heads
q, k, v = _in_projection_packed(query, key, value, in_proj_weight, in_proj_bias)
if attn_mask is not None:
if attn_mask.dtype == torch.uint8:
warnings.warn("Byte tensor for attn_mask in nn.MultiheadAttention is deprecated. Use bool tensor instead.")
attn_mask = attn_mask.to(torch.bool)
else:
assert attn_mask.is_floating_point() or attn_mask.dtype == torch.bool, \
f"Only float, byte, and bool types are supported for attn_mask, not {attn_mask.dtype}"
if attn_mask.dim() == 2:
correct_2d_size = (tgt_len, src_len)
if attn_mask.shape != correct_2d_size:
raise RuntimeError(f"The shape of the 2D attn_mask is {attn_mask.shape}, but should be {correct_2d_size}.")
attn_mask = attn_mask.unsqueeze(0)
elif attn_mask.dim() == 3:
correct_3d_size = (bsz * num_heads, tgt_len, src_len)
if attn_mask.shape != correct_3d_size:
raise RuntimeError(f"The shape of the 3D attn_mask is {attn_mask.shape}, but should be {correct_3d_size}.")
else:
raise RuntimeError(f"attn_mask's dimension {attn_mask.dim()} is not supported")
if key_padding_mask is not None and key_padding_mask.dtype == torch.uint8:
warnings.warn("Byte tensor for key_padding_mask in nn.MultiheadAttention is deprecated. Use bool tensor instead.")
key_padding_mask = key_padding_mask.to(torch.bool)
# reshape q,k,v将Batch放在第一维以适合点积注意力
# 同时为多头机制,将不同的头拼在一起组成一层
q = q.contiguous().view(tgt_len, bsz * num_heads, head_dim).transpose(0, 1)
k = k.contiguous().view(-1, bsz * num_heads, head_dim).transpose(0, 1)
v = v.contiguous().view(-1, bsz * num_heads, head_dim).transpose(0, 1)
if key_padding_mask is not None:
assert key_padding_mask.shape == (bsz, src_len), \
f"expecting key_padding_mask shape of {(bsz, src_len)}, but got {key_padding_mask.shape}"
key_padding_mask = key_padding_mask.view(bsz, 1, 1, src_len). \
expand(-1, num_heads, -1, -1).reshape(bsz * num_heads, 1, src_len)
if attn_mask is None:
attn_mask = key_padding_mask
elif attn_mask.dtype == torch.bool:
attn_mask = attn_mask.logical_or(key_padding_mask)
else:
attn_mask = attn_mask.masked_fill(key_padding_mask, float("-inf"))
# 若attn_mask值是布尔值则将mask转换为float
if attn_mask is not None and attn_mask.dtype == torch.bool:
new_attn_mask = torch.zeros_like(attn_mask, dtype=torch.float)
new_attn_mask.masked_fill_(attn_mask, float("-inf"))
attn_mask = new_attn_mask
# 若training为True时才应用dropout
if not training:
dropout_p = 0.0
attn_output, attn_output_weights = _scaled_dot_product_attention(q, k, v, attn_mask, dropout_p)
attn_output = attn_output.transpose(0, 1).contiguous().view(tgt_len, bsz, embed_dim)
attn_output = nn.functional.linear(attn_output, out_proj_weight, out_proj_bias)
if need_weights:
# average attention weights over heads
attn_output_weights = attn_output_weights.view(bsz, num_heads, tgt_len, src_len)
return attn_output, attn_output_weights.sum(dim=1) / num_heads
else:
return attn_output, None
```
##### query, key, value通过_in_projection_packed变换得到q,k,v
```
q, k, v = _in_projection_packed(query, key, value, in_proj_weight, in_proj_bias)
```
对于`nn.functional.linear`函数,其实就是一个线性变换,与`nn.Linear`不同的是,前者可以提供权重矩阵和偏置,执行![](http://latex.codecogs.com/svg.latex?y=xW^T+b),而后者是可以自由决定输出的维度。
```python
def _in_projection_packed(
q: Tensor,
k: Tensor,
v: Tensor,
w: Tensor,
b: Optional[Tensor] = None,
) -> List[Tensor]:
r"""
用一个大的权重参数矩阵进行线性变换
参数:
q, k, v: 对自注意来说三者都是src对于seq2seq模型k和v是一致的tensor。
但它们的最后一维(num_features或者叫做embed_dim)都必须保持一致。
w: 用以线性变换的大矩阵按照q,k,v的顺序压在一个tensor里面。
b: 用以线性变换的偏置按照q,k,v的顺序压在一个tensor里面。
形状:
输入:
- q: shape:`(..., E)`E是词嵌入的维度下面出现的E均为此意
- k: shape:`(..., E)`
- v: shape:`(..., E)`
- w: shape:`(E * 3, E)`
- b: shape:`E * 3`
输出:
- 输出列表 :`[q', k', v']`q,k,v经过线性变换前后的形状都一致。
"""
E = q.size(-1)
# 若为自注意则q = k = v = src因此它们的引用变量都是src
# 即k is v和q is k结果均为True
# 若为seq2seqk = v因而k is v的结果是True
if k is v:
if q is k:
return F.linear(q, w, b).chunk(3, dim=-1)
else:
# seq2seq模型
w_q, w_kv = w.split([E, E * 2])
if b is None:
b_q = b_kv = None
else:
b_q, b_kv = b.split([E, E * 2])
return (F.linear(q, w_q, b_q),) + F.linear(k, w_kv, b_kv).chunk(2, dim=-1)
else:
w_q, w_k, w_v = w.chunk(3)
if b is None:
b_q = b_k = b_v = None
else:
b_q, b_k, b_v = b.chunk(3)
return F.linear(q, w_q, b_q), F.linear(k, w_k, b_k), F.linear(v, w_v, b_v)
# q, k, v = _in_projection_packed(query, key, value, in_proj_weight, in_proj_bias)
```
***
##### 遮挡机制
对于attn_mask来说若为2D形状如`(L, S)`L和S分别代表着目标语言和源语言序列长度若为3D,形状如`(N * num_heads, L, S)`N代表着batch_sizenum_heads代表注意力头的数目。若为attn_mask的dtype为ByteTensor非0的位置会被忽略不做注意力若为BoolTensorTrue对应的位置会被忽略若为数值则会直接加到attn_weights。
因为在decoder解码的时候只能看该位置和它之前的如果看后面就犯规了所以需要attn_mask遮挡住。
下面函数直接复制PyTorch的意思是确保不同维度的mask形状正确以及不同类型的转换
```python
if attn_mask is not None:
if attn_mask.dtype == torch.uint8:
warnings.warn("Byte tensor for attn_mask in nn.MultiheadAttention is deprecated. Use bool tensor instead.")
attn_mask = attn_mask.to(torch.bool)
else:
assert attn_mask.is_floating_point() or attn_mask.dtype == torch.bool, \
f"Only float, byte, and bool types are supported for attn_mask, not {attn_mask.dtype}"
# 对不同维度的形状判定
if attn_mask.dim() == 2:
correct_2d_size = (tgt_len, src_len)
if attn_mask.shape != correct_2d_size:
raise RuntimeError(f"The shape of the 2D attn_mask is {attn_mask.shape}, but should be {correct_2d_size}.")
attn_mask = attn_mask.unsqueeze(0)
elif attn_mask.dim() == 3:
correct_3d_size = (bsz * num_heads, tgt_len, src_len)
if attn_mask.shape != correct_3d_size:
raise RuntimeError(f"The shape of the 3D attn_mask is {attn_mask.shape}, but should be {correct_3d_size}.")
else:
raise RuntimeError(f"attn_mask's dimension {attn_mask.dim()} is not supported")
```
与`attn_mask`不同的是,`key_padding_mask`是用来遮挡住key里面的值详细来说应该是`<PAD>`被忽略的情况与attn_mask一致。
```python
# 将key_padding_mask值改为布尔值
if key_padding_mask is not None and key_padding_mask.dtype == torch.uint8:
warnings.warn("Byte tensor for key_padding_mask in nn.MultiheadAttention is deprecated. Use bool tensor instead.")
key_padding_mask = key_padding_mask.to(torch.bool)
```
先介绍两个小函数,`logical_or`输入两个tensor并对这两个tensor里的值做`逻辑或`运算只有当两个值均为0的时候才为`False`,其他时候均为`True`,另一个是`masked_fill`输入是一个mask和用以填充的值。mask由10组成0的位置值维持不变1的位置用新值填充。
```python
a = torch.tensor([0,1,10,0],dtype=torch.int8)
b = torch.tensor([4,0,1,0],dtype=torch.int8)
print(torch.logical_or(a,b))
# tensor([ True, True, True, False])
```
```python
r = torch.tensor([[0,0,0,0],[0,0,0,0]])
mask = torch.tensor([[1,1,1,1],[0,0,0,0]])
print(r.masked_fill(mask,1))
# tensor([[1, 1, 1, 1],
# [0, 0, 0, 0]])
```
其实attn_mask和key_padding_mask有些时候对象是一致的所以有时候可以合起来看。`-inf`做softmax之后值为0即被忽略。
```python
if key_padding_mask is not None:
assert key_padding_mask.shape == (bsz, src_len), \
f"expecting key_padding_mask shape of {(bsz, src_len)}, but got {key_padding_mask.shape}"
key_padding_mask = key_padding_mask.view(bsz, 1, 1, src_len). \
expand(-1, num_heads, -1, -1).reshape(bsz * num_heads, 1, src_len)
# 若attn_mask为空直接用key_padding_mask
if attn_mask is None:
attn_mask = key_padding_mask
elif attn_mask.dtype == torch.bool:
attn_mask = attn_mask.logical_or(key_padding_mask)
else:
attn_mask = attn_mask.masked_fill(key_padding_mask, float("-inf"))
# 若attn_mask值是布尔值则将mask转换为float
if attn_mask is not None and attn_mask.dtype == torch.bool:
new_attn_mask = torch.zeros_like(attn_mask, dtype=torch.float)
new_attn_mask.masked_fill_(attn_mask, float("-inf"))
attn_mask = new_attn_mask
```
***
##### 点积注意力
```python
from typing import Optional, Tuple, Any
def _scaled_dot_product_attention(
q: Tensor,
k: Tensor,
v: Tensor,
attn_mask: Optional[Tensor] = None,
dropout_p: float = 0.0,
) -> Tuple[Tensor, Tensor]:
r'''
在query, key, value上计算点积注意力若有注意力遮盖则使用并且应用一个概率为dropout_p的dropout
参数:
- q: shape:`(B, Nt, E)` B代表batch size Nt是目标语言序列长度E是嵌入后的特征维度
- key: shape:`(B, Ns, E)` Ns是源语言序列长度
- value: shape:`(B, Ns, E)`与key形状一样
- attn_mask: 要么是3D的tensor形状为:`(B, Nt, Ns)`或者2D的tensor形状如:`(Nt, Ns)`
- Output: attention values: shape:`(B, Nt, E)`与q的形状一致;attention weights: shape:`(B, Nt, Ns)`
例子:
>>> q = torch.randn((2,3,6))
>>> k = torch.randn((2,4,6))
>>> v = torch.randn((2,4,6))
>>> out = scaled_dot_product_attention(q, k, v)
>>> out[0].shape, out[1].shape
>>> torch.Size([2, 3, 6]) torch.Size([2, 3, 4])
'''
B, Nt, E = q.shape
q = q / math.sqrt(E)
# (B, Nt, E) x (B, E, Ns) -> (B, Nt, Ns)
attn = torch.bmm(q, k.transpose(-2,-1))
if attn_mask is not None:
attn += attn_mask
# attn意味着目标序列的每个词对源语言序列做注意力
attn = F.softmax(attn, dim=-1)
if dropout_p:
attn = F.dropout(attn, p=dropout_p)
# (B, Nt, Ns) x (B, Ns, E) -> (B, Nt, E)
output = torch.bmm(attn, v)
return output, attn
```
### 完整的多头注意力机制-MultiheadAttention
```python
class MultiheadAttention(nn.Module):
r'''
参数:
embed_dim: 词嵌入的维度
num_heads: 平行头的数量
batch_first: 若`True`,则为(batch, seq, feture),若为`False`,则为(seq, batch, feature)
例子:
>>> multihead_attn = MultiheadAttention(embed_dim, num_heads)
>>> attn_output, attn_output_weights = multihead_attn(query, key, value)
'''
def __init__(self, embed_dim, num_heads, dropout=0., bias=True,
kdim=None, vdim=None, batch_first=False) -> None:
# factory_kwargs = {'device': device, 'dtype': dtype}
super(MultiheadAttention, self).__init__()
self.embed_dim = embed_dim
self.kdim = kdim if kdim is not None else embed_dim
self.vdim = vdim if vdim is not None else embed_dim
self._qkv_same_embed_dim = self.kdim == embed_dim and self.vdim == embed_dim
self.num_heads = num_heads
self.dropout = dropout
self.batch_first = batch_first
self.head_dim = embed_dim // num_heads
assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
if self._qkv_same_embed_dim is False:
self.q_proj_weight = Parameter(torch.empty((embed_dim, embed_dim)))
self.k_proj_weight = Parameter(torch.empty((embed_dim, self.kdim)))
self.v_proj_weight = Parameter(torch.empty((embed_dim, self.vdim)))
self.register_parameter('in_proj_weight', None)
else:
self.in_proj_weight = Parameter(torch.empty((3 * embed_dim, embed_dim)))
self.register_parameter('q_proj_weight', None)
self.register_parameter('k_proj_weight', None)
self.register_parameter('v_proj_weight', None)
if bias:
self.in_proj_bias = Parameter(torch.empty(3 * embed_dim))
else:
self.register_parameter('in_proj_bias', None)
self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
self._reset_parameters()
def _reset_parameters(self):
if self._qkv_same_embed_dim:
xavier_uniform_(self.in_proj_weight)
else:
xavier_uniform_(self.q_proj_weight)
xavier_uniform_(self.k_proj_weight)
xavier_uniform_(self.v_proj_weight)
if self.in_proj_bias is not None:
constant_(self.in_proj_bias, 0.)
constant_(self.out_proj.bias, 0.)
def forward(self, query: Tensor, key: Tensor, value: Tensor, key_padding_mask: Optional[Tensor] = None,
need_weights: bool = True, attn_mask: Optional[Tensor] = None) -> Tuple[Tensor, Optional[Tensor]]:
if self.batch_first:
query, key, value = [x.transpose(1, 0) for x in (query, key, value)]
if not self._qkv_same_embed_dim:
attn_output, attn_output_weights = multi_head_attention_forward(
query, key, value, self.num_heads,
self.in_proj_weight, self.in_proj_bias,
self.dropout, self.out_proj.weight, self.out_proj.bias,
training=self.training,
key_padding_mask=key_padding_mask, need_weights=need_weights,
attn_mask=attn_mask, use_separate_proj_weight=True,
q_proj_weight=self.q_proj_weight, k_proj_weight=self.k_proj_weight,
v_proj_weight=self.v_proj_weight)
else:
attn_output, attn_output_weights = multi_head_attention_forward(
query, key, value, self.num_heads,
self.in_proj_weight, self.in_proj_bias,
self.dropout, self.out_proj.weight, self.out_proj.bias,
training=self.training,
key_padding_mask=key_padding_mask, need_weights=need_weights,
attn_mask=attn_mask)
if self.batch_first:
return attn_output.transpose(1, 0), attn_output_weights
else:
return attn_output, attn_output_weights
```
接下来可以实践一下,并且把位置编码加起来,可以发现加入位置编码和进行多头注意力的前后形状都是不会变的
```python
# 因为batch_first为False,所以src的shape`(seq, batch, embed_dim)`
src = torch.randn((2,4,100))
src = positional_encoding(src,100,0.1)
print(src.shape)
multihead_attn = MultiheadAttention(100, 4, 0.1)
attn_output, attn_output_weights = multihead_attn(src,src,src)
print(attn_output.shape, attn_output_weights.shape)
# torch.Size([2, 4, 100])
# torch.Size([2, 4, 100]) torch.Size([4, 2, 2])
```
torch.Size([2, 4, 100])
torch.Size([2, 4, 100]) torch.Size([4, 2, 2])
***
## **<div id='build'>搭建Transformer</div>**
- Encoder Layer
![](./pictures/2-2-1-encoder.png)
```python
class TransformerEncoderLayer(nn.Module):
r'''
参数:
d_model: 词嵌入的维度(必备)
nhead: 多头注意力中平行头的数目(必备)
dim_feedforward: 全连接层的神经元的数目又称经过此层输入的维度Default = 2048
dropout: dropout的概率Default = 0.1
activation: 两个线性层中间的激活函数默认relu或gelu
lay_norm_eps: layer normalization中的微小量防止分母为0Default = 1e-5
batch_first: 若`True`,则为(batch, seq, feture),若为`False`,则为(seq, batch, feature)DefaultFalse
例子:
>>> encoder_layer = TransformerEncoderLayer(d_model=512, nhead=8)
>>> src = torch.randn((32, 10, 512))
>>> out = encoder_layer(src)
'''
def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1, activation=F.relu,
layer_norm_eps=1e-5, batch_first=False) -> None:
super(TransformerEncoderLayer, self).__init__()
self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout, batch_first=batch_first)
self.linear1 = nn.Linear(d_model, dim_feedforward)
self.dropout = nn.Dropout(dropout)
self.linear2 = nn.Linear(dim_feedforward, d_model)
self.norm1 = nn.LayerNorm(d_model, eps=layer_norm_eps)
self.norm2 = nn.LayerNorm(d_model, eps=layer_norm_eps)
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
self.activation = activation
def forward(self, src: Tensor, src_mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None) -> Tensor:
src = positional_encoding(src, src.shape[-1])
src2 = self.self_attn(src, src, src, attn_mask=src_mask,
key_padding_mask=src_key_padding_mask)[0]
src = src + self.dropout1(src2)
src = self.norm1(src)
src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
src = src + self.dropout(src2)
src = self.norm2(src)
return src
```
```python
# 用小例子看一下
encoder_layer = TransformerEncoderLayer(d_model=512, nhead=8)
src = torch.randn((32, 10, 512))
out = encoder_layer(src)
print(out.shape)
# torch.Size([32, 10, 512])
```
torch.Size([32, 10, 512])
### Transformer layer组成Encoder
```python
class TransformerEncoder(nn.Module):
r'''
参数:
encoder_layer必备
num_layers encoder_layer的层数必备
norm: 归一化的选择(可选)
例子:
>>> encoder_layer = TransformerEncoderLayer(d_model=512, nhead=8)
>>> transformer_encoder = TransformerEncoder(encoder_layer, num_layers=6)
>>> src = torch.randn((10, 32, 512))
>>> out = transformer_encoder(src)
'''
def __init__(self, encoder_layer, num_layers, norm=None):
super(TransformerEncoder, self).__init__()
self.layer = encoder_layer
self.num_layers = num_layers
self.norm = norm
def forward(self, src: Tensor, mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None) -> Tensor:
output = positional_encoding(src, src.shape[-1])
for _ in range(self.num_layers):
output = self.layer(output, src_mask=mask, src_key_padding_mask=src_key_padding_mask)
if self.norm is not None:
output = self.norm(output)
return output
```
```python
# 例子
encoder_layer = TransformerEncoderLayer(d_model=512, nhead=8)
transformer_encoder = TransformerEncoder(encoder_layer, num_layers=6)
src = torch.randn((10, 32, 512))
out = transformer_encoder(src)
print(out.shape)
# torch.Size([10, 32, 512])
```
torch.Size([10, 32, 512])
***
## Decoder Layer:
```python
class TransformerDecoderLayer(nn.Module):
r'''
参数:
d_model: 词嵌入的维度(必备)
nhead: 多头注意力中平行头的数目(必备)
dim_feedforward: 全连接层的神经元的数目又称经过此层输入的维度Default = 2048
dropout: dropout的概率Default = 0.1
activation: 两个线性层中间的激活函数默认relu或gelu
lay_norm_eps: layer normalization中的微小量防止分母为0Default = 1e-5
batch_first: 若`True`,则为(batch, seq, feture),若为`False`,则为(seq, batch, feature)DefaultFalse
例子:
>>> decoder_layer = TransformerDecoderLayer(d_model=512, nhead=8)
>>> memory = torch.randn((10, 32, 512))
>>> tgt = torch.randn((20, 32, 512))
>>> out = decoder_layer(tgt, memory)
'''
def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1, activation=F.relu,
layer_norm_eps=1e-5, batch_first=False) -> None:
super(TransformerDecoderLayer, self).__init__()
self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout, batch_first=batch_first)
self.multihead_attn = MultiheadAttention(d_model, nhead, dropout=dropout, batch_first=batch_first)
self.linear1 = nn.Linear(d_model, dim_feedforward)
self.dropout = nn.Dropout(dropout)
self.linear2 = nn.Linear(dim_feedforward, d_model)
self.norm1 = nn.LayerNorm(d_model, eps=layer_norm_eps)
self.norm2 = nn.LayerNorm(d_model, eps=layer_norm_eps)
self.norm3 = nn.LayerNorm(d_model, eps=layer_norm_eps)
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
self.dropout3 = nn.Dropout(dropout)
self.activation = activation
def forward(self, tgt: Tensor, memory: Tensor, tgt_mask: Optional[Tensor] = None,
memory_mask: Optional[Tensor] = None,tgt_key_padding_mask: Optional[Tensor] = None, memory_key_padding_mask: Optional[Tensor] = None) -> Tensor:
r'''
参数:
tgt: 目标语言序列(必备)
memory: 从最后一个encoder_layer跑出的句子必备
tgt_mask: 目标语言序列的mask可选
memory_mask可选
tgt_key_padding_mask可选
memory_key_padding_mask可选
'''
tgt2 = self.self_attn(tgt, tgt, tgt, attn_mask=tgt_mask,
key_padding_mask=tgt_key_padding_mask)[0]
tgt = tgt + self.dropout1(tgt2)
tgt = self.norm1(tgt)
tgt2 = self.multihead_attn(tgt, memory, memory, attn_mask=memory_mask,
key_padding_mask=memory_key_padding_mask)[0]
tgt = tgt + self.dropout2(tgt2)
tgt = self.norm2(tgt)
tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
tgt = tgt + self.dropout3(tgt2)
tgt = self.norm3(tgt)
return tgt
```
```python
# 可爱的小例子
decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
memory = torch.randn((10, 32, 512))
tgt = torch.randn((20, 32, 512))
out = decoder_layer(tgt, memory)
print(out.shape)
# torch.Size([20, 32, 512])
```
torch.Size([20, 32, 512])
```python
## Decoder
```
```python
class TransformerDecoder(nn.Module):
r'''
参数:
decoder_layer必备
num_layers: decoder_layer的层数必备
norm: 归一化选择
例子:
>>> decoder_layer =TransformerDecoderLayer(d_model=512, nhead=8)
>>> transformer_decoder = TransformerDecoder(decoder_layer, num_layers=6)
>>> memory = torch.rand(10, 32, 512)
>>> tgt = torch.rand(20, 32, 512)
>>> out = transformer_decoder(tgt, memory)
'''
def __init__(self, decoder_layer, num_layers, norm=None):
super(TransformerDecoder, self).__init__()
self.layer = decoder_layer
self.num_layers = num_layers
self.norm = norm
def forward(self, tgt: Tensor, memory: Tensor, tgt_mask: Optional[Tensor] = None,
memory_mask: Optional[Tensor] = None, tgt_key_padding_mask: Optional[Tensor] = None,
memory_key_padding_mask: Optional[Tensor] = None) -> Tensor:
output = tgt
for _ in range(self.num_layers):
output = self.layer(output, memory, tgt_mask=tgt_mask,
memory_mask=memory_mask,
tgt_key_padding_mask=tgt_key_padding_mask,
memory_key_padding_mask=memory_key_padding_mask)
if self.norm is not None:
output = self.norm(output)
return output
```
```python
# 可爱的小例子
decoder_layer =TransformerDecoderLayer(d_model=512, nhead=8)
transformer_decoder = TransformerDecoder(decoder_layer, num_layers=6)
memory = torch.rand(10, 32, 512)
tgt = torch.rand(20, 32, 512)
out = transformer_decoder(tgt, memory)
print(out.shape)
# torch.Size([20, 32, 512])
```
torch.Size([20, 32, 512])
总结一下其实经过位置编码多头注意力Encoder Layer和Decoder Layer形状不会变的而Encoder和Decoder分别与src和tgt形状一致
## Transformer
```python
class Transformer(nn.Module):
r'''
参数:
d_model: 词嵌入的维度必备Default=512
nhead: 多头注意力中平行头的数目必备Default=8
num_encoder_layers:编码层层数Default=8
num_decoder_layers:解码层层数Default=8
dim_feedforward: 全连接层的神经元的数目又称经过此层输入的维度Default = 2048
dropout: dropout的概率Default = 0.1
activation: 两个线性层中间的激活函数默认relu或gelu
custom_encoder: 自定义encoderDefault=None
custom_decoder: 自定义decoderDefault=None
lay_norm_eps: layer normalization中的微小量防止分母为0Default = 1e-5
batch_first: 若`True`,则为(batch, seq, feture),若为`False`,则为(seq, batch, feature)DefaultFalse
例子:
>>> transformer_model = Transformer(nhead=16, num_encoder_layers=12)
>>> src = torch.rand((10, 32, 512))
>>> tgt = torch.rand((20, 32, 512))
>>> out = transformer_model(src, tgt)
'''
def __init__(self, d_model: int = 512, nhead: int = 8, num_encoder_layers: int = 6,
num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1,
activation = F.relu, custom_encoder: Optional[Any] = None, custom_decoder: Optional[Any] = None,
layer_norm_eps: float = 1e-5, batch_first: bool = False) -> None:
super(Transformer, self).__init__()
if custom_encoder is not None:
self.encoder = custom_encoder
else:
encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout,
activation, layer_norm_eps, batch_first)
encoder_norm = nn.LayerNorm(d_model, eps=layer_norm_eps)
self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers)
if custom_decoder is not None:
self.decoder = custom_decoder
else:
decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward, dropout,
activation, layer_norm_eps, batch_first)
decoder_norm = nn.LayerNorm(d_model, eps=layer_norm_eps)
self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm)
self._reset_parameters()
self.d_model = d_model
self.nhead = nhead
self.batch_first = batch_first
def forward(self, src: Tensor, tgt: Tensor, src_mask: Optional[Tensor] = None, tgt_mask: Optional[Tensor] = None,
memory_mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None,
tgt_key_padding_mask: Optional[Tensor] = None, memory_key_padding_mask: Optional[Tensor] = None) -> Tensor:
r'''
参数:
src: 源语言序列送入Encoder必备
tgt: 目标语言序列送入Decoder必备
src_mask: (可选)
tgt_mask: (可选)
memory_mask: (可选)
src_key_padding_mask: (可选)
tgt_key_padding_mask: (可选)
memory_key_padding_mask: (可选)
形状:
- src: shape:`(S, N, E)`, `(N, S, E)` if batch_first.
- tgt: shape:`(T, N, E)`, `(N, T, E)` if batch_first.
- src_mask: shape:`(S, S)`.
- tgt_mask: shape:`(T, T)`.
- memory_mask: shape:`(T, S)`.
- src_key_padding_mask: shape:`(N, S)`.
- tgt_key_padding_mask: shape:`(N, T)`.
- memory_key_padding_mask: shape:`(N, S)`.
[src/tgt/memory]_mask确保有些位置不被看到如做decode的时候只能看该位置及其以前的而不能看后面的。
若为ByteTensor非0的位置会被忽略不做注意力若为BoolTensorTrue对应的位置会被忽略
若为数值则会直接加到attn_weights
[src/tgt/memory]_key_padding_mask 使得key里面的某些元素不参与attention计算三种情况同上
- output: shape:`(T, N, E)`, `(N, T, E)` if batch_first.
注意:
src和tgt的最后一维需要等于d_modelbatch的那一维需要相等
例子:
>>> output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
'''
memory = self.encoder(src, mask=src_mask, src_key_padding_mask=src_key_padding_mask)
output = self.decoder(tgt, memory, tgt_mask=tgt_mask, memory_mask=memory_mask,
tgt_key_padding_mask=tgt_key_padding_mask,
memory_key_padding_mask=memory_key_padding_mask)
return output
def generate_square_subsequent_mask(self, sz: int) -> Tensor:
r'''产生关于序列的mask被遮住的区域赋值`-inf`,未被遮住的区域赋值为`0`'''
mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
return mask
def _reset_parameters(self):
r'''用正态分布初始化参数'''
for p in self.parameters():
if p.dim() > 1:
xavier_uniform_(p)
```
```python
# 小例子
transformer_model = Transformer(nhead=16, num_encoder_layers=12)
src = torch.rand((10, 32, 512))
tgt = torch.rand((20, 32, 512))
out = transformer_model(src, tgt)
print(out.shape)
# torch.Size([20, 32, 512])
```
torch.Size([20, 32, 512])
到此为止PyTorch的Transformer库我们已经全部实现相比于官方的版本手写的这个少了较多的判定语句。
## 致谢
本文由台运鹏撰写本项目成员重新组织和整理。最后期待您的阅读反馈和star谢谢。