fix typo

2021-08-17 22:20:45 +08:00 · 2021-08-17 22:20:45 +08:00 · f366ffbbf5
parent 74a8f3af63
commit f366ffbbf5
1 changed files with 2 additions and 2 deletions
--- a/docs/篇章2-Transformer相关原理/2.2-图解transformer.md
+++ b/docs/篇章2-Transformer相关原理/2.2-图解transformer.md
@ -210,7 +210,7 @@ Transformer 的论文通过增加多头注意力机制（一组注意力称为
 ![`it`的attention](./pictures/2-it-attention.webp)
 图：`it`的attention
-当我们编码单词"it"时，其中一个 attention head （注意力头）最关注的是"the animal"，另外一个 attention head 关注的是"tired"。因此在某种意义上，"it"在模型中的表示，融合了"animal"和"word"的部分表达。
+当我们编码单词"it"时，其中一个 attention head （注意力头）最关注的是"the animal"，另外一个 attention head 关注的是"tired"。因此在某种意义上，"it"在模型中的表示，融合了"animal"和"tire"的部分表达。
 然而，当我们把所有 attention heads（注意力头） 都在图上画出来时，多头注意力又变得难以解释了。
@ -477,7 +477,7 @@ x = x.view(bsz, -1, self.n_heads * (self.hid_dim // self.n_heads))
 解码器中的 Self Attention 层，和编码器中的 Self Attention 层不太一样：在解码器里，Self Attention 层只允许关注到输出序列中早于当前位置之前的单词。具体做法是：在 Self Attention 分数经过 Softmax 层之前，屏蔽当前位置之后的那些位置。
-Encoder-Decoder Attention层的原理和多头注意力（multiheaded Self Attention）机制类似，不同之处是：Encoder-Decoder Attention层是使用前一层的输出来构造 Query 矩阵，而 Key 矩阵和 Value 矩阵来自于解码器最终的输出。
+Encoder-Decoder Attention层的原理和多头注意力（multiheaded Self Attention）机制类似，不同之处是：Encoder-Decoder Attention层是使用前一层的输出来构造 Query 矩阵，而 Key 矩阵和 Value 矩阵来自于编码器最终的输出。
 ## 最后的线性层和 Softmax 层