From 7e7c2c30ae3e37bd29fe4b1945e2aef274f7e514 Mon Sep 17 00:00:00 2001
From: erenup <ping.nie@pku.edu.cn>
Date: Sat, 28 Aug 2021 20:12:50 +0800
Subject: [PATCH] fix

---
 .../3.2-如何应用一个BERT.ipynb                | 284 +++++++++---------
 .../3.2-如何应用一个BERT.md                   |   2 +-
 2 files changed, 143 insertions(+), 143 deletions(-)

diff --git a/docs/篇章3-编写一个Transformer模型：BERT/3.2-如何应用一个BERT.ipynb b/docs/篇章3-编写一个Transformer模型：BERT/3.2-如何应用一个BERT.ipynb
index 07a4899..ed5ef04 100644
--- a/docs/篇章3-编写一个Transformer模型：BERT/3.2-如何应用一个BERT.ipynb
+++ b/docs/篇章3-编写一个Transformer模型：BERT/3.2-如何应用一个BERT.ipynb
@@ -2,7 +2,6 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "metadata": {},
    "source": [
     "## 前言\n",
     "接着上一小节，我们对Huggingface开源代码库中的Bert模型进行了深入学习，这一节我们对如何应用BERT进行详细的讲解。\n",
@@ -31,11 +30,11 @@
     "\n",
     "用于初始化模型权重，同时维护继承自`PreTrainedModel`的一些标记身份或者加载模型时的类变量。\n",
     "下面，首先从预训练模型开始分析。"
-   ]
+   ],
+   "metadata": {}
   },
   {
    "cell_type": "markdown",
-   "metadata": {},
    "source": [
     "*** \n",
     "### 3.1 BertForPreTraining\n",
@@ -169,22 +168,12 @@
     "  - 同样基于BertOnlyMLMHead；\n",
     "- BertForNextSentencePrediction：只进行 NSP 任务的预训练。\n",
     "  - 基于BertOnlyNSPHead，内容就是一个线性层。"
-   ]
+   ],
+   "metadata": {}
   },
   {
    "cell_type": "code",
    "execution_count": 32,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']\n",
-      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
-     ]
-    }
-   ],
    "source": [
     "_CHECKPOINT_FOR_DOC = \"bert-base-uncased\"\n",
     "_CONFIG_FOR_DOC = \"BertConfig\"\n",
@@ -289,23 +278,22 @@
     "outputs = model(**inputs)\n",
     "prediction_logits = outputs.prediction_logits\n",
     "seq_relationship_logits = outputs.seq_relationship_logits"
-   ]
+   ],
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stderr",
+     "text": [
+      "Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']\n",
+      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
+     ]
+    }
+   ],
+   "metadata": {}
   },
   {
    "cell_type": "code",
    "execution_count": 18,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertLMHeadModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']\n",
-      "- This IS expected if you are initializing BertLMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
-      "- This IS NOT expected if you are initializing BertLMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
-     ]
-    }
-   ],
    "source": [
     "@add_start_docstrings(\n",
     "    \"\"\"Bert Model with a `language modeling` head on top for CLM fine-tuning. \"\"\", BERT_START_DOCSTRING\n",
@@ -455,24 +443,23 @@
     "inputs = tokenizer(\"Hello, my dog is cute\", return_tensors=\"pt\")\n",
     "outputs = model(**inputs)\n",
     "prediction_logits = outputs.logits"
-   ]
+   ],
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stderr",
+     "text": [
+      "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertLMHeadModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']\n",
+      "- This IS expected if you are initializing BertLMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing BertLMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
+     ]
+    }
+   ],
+   "metadata": {}
   },
   {
    "cell_type": "code",
    "execution_count": 6,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Downloading: 100%|██████████| 440M/440M [00:30<00:00, 14.5MB/s]\n",
-      "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForNextSentencePrediction: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']\n",
-      "- This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
-      "- This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
-     ]
-    }
-   ],
    "source": [
     "class BertForNextSentencePrediction(BertPreTrainedModel):\n",
     "    def __init__(self, config):\n",
@@ -569,20 +556,32 @@
     "outputs = model(**encoding, labels=torch.LongTensor([1]))\n",
     "logits = outputs.logits\n",
     "assert logits[0, 0] < logits[0, 1] # next sentence was random"
-   ]
+   ],
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stderr",
+     "text": [
+      "Downloading: 100%|██████████| 440M/440M [00:30<00:00, 14.5MB/s]\n",
+      "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForNextSentencePrediction: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']\n",
+      "- This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
+     ]
+    }
+   ],
+   "metadata": {}
   },
   {
    "cell_type": "markdown",
-   "metadata": {},
    "source": [
     "接下来介绍的是各种 Fine-tune 模型，基本都是分类任务：\n",
     "\n",
     "![Bert：finetune](./pictures/3-4-bert-ft.png) 图：Bert：finetune"
-   ]
+   ],
+   "metadata": {}
   },
   {
    "cell_type": "markdown",
-   "metadata": {},
    "source": [
     "*** \n",
     "### 3.2 BertForSequenceClassification\n",
@@ -609,13 +608,12 @@
     "- 如果初始化的num_labels=1，那么就默认为回归任务，使用 MSELoss；\n",
     "\n",
     "- 否则认为是分类任务。"
-   ]
+   ],
+   "metadata": {}
   },
   {
    "cell_type": "code",
    "execution_count": 19,
-   "metadata": {},
-   "outputs": [],
    "source": [
     "@add_start_docstrings(\n",
     "    \"\"\"\n",
@@ -716,35 +714,13 @@
     "            hidden_states=outputs.hidden_states,\n",
     "            attentions=outputs.attentions,\n",
     "        )"
-   ]
+   ],
+   "outputs": [],
+   "metadata": {}
   },
   {
    "cell_type": "code",
    "execution_count": 24,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Downloading: 100%|██████████| 213k/213k [00:00<00:00, 596kB/s]\n",
-      "Downloading: 100%|██████████| 29.0/29.0 [00:00<00:00, 12.4kB/s]\n",
-      "Downloading: 100%|██████████| 436k/436k [00:00<00:00, 808kB/s]\n",
-      "Downloading: 100%|██████████| 433/433 [00:00<00:00, 166kB/s]\n",
-      "Downloading: 100%|██████████| 433M/433M [00:29<00:00, 14.5MB/s]\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "not paraphrase: 10%\n",
-      "is paraphrase: 90%\n",
-      "not paraphrase: 94%\n",
-      "is paraphrase: 6%\n"
-     ]
-    }
-   ],
    "source": [
     "from transformers.models.bert.tokenization_bert import BertTokenizer\n",
     "from transformers.models.bert.modeling_bert import BertForSequenceClassification\n",
@@ -774,11 +750,34 @@
     "# Should not be paraphrase\n",
     "for i in range(len(classes)):\n",
     "    print(f\"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%\")"
-   ]
+   ],
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stderr",
+     "text": [
+      "Downloading: 100%|██████████| 213k/213k [00:00<00:00, 596kB/s]\n",
+      "Downloading: 100%|██████████| 29.0/29.0 [00:00<00:00, 12.4kB/s]\n",
+      "Downloading: 100%|██████████| 436k/436k [00:00<00:00, 808kB/s]\n",
+      "Downloading: 100%|██████████| 433/433 [00:00<00:00, 166kB/s]\n",
+      "Downloading: 100%|██████████| 433M/433M [00:29<00:00, 14.5MB/s]\n"
+     ]
+    },
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "not paraphrase: 10%\n",
+      "is paraphrase: 90%\n",
+      "not paraphrase: 94%\n",
+      "is paraphrase: 6%\n"
+     ]
+    }
+   ],
+   "metadata": {}
   },
   {
    "cell_type": "markdown",
-   "metadata": {},
    "source": [
     "*** \n",
     "### 3.3 BertForMultipleChoice\n",
@@ -787,11 +786,11 @@
     "- 多项选择任务的输入为一组分次输入的句子，输出为选择某一句子的单个标签。\n",
     "结构上与句子分类相似，只不过线性层输出维度为 1，即每次需要将每个样本的多个句子的输出拼接起来作为每个样本的预测分数。\n",
     "- 实际上，具体操作时是把每个 batch 的多个句子一同放入的，所以一次处理的输入为[batch_size, num_choices]数量的句子，因此相同 batch 大小时，比句子分类等任务需要更多的显存，在训练时需要小心。"
-   ]
+   ],
+   "metadata": {}
   },
   {
    "cell_type": "markdown",
-   "metadata": {},
    "source": [
     "*** \n",
     "### 3.4 BertForTokenClassification\n",
@@ -799,13 +798,12 @@
     "- 序列标注任务的输入为单个句子文本，输出为每个 token 对应的类别标签。\n",
     "由于需要用到每个 token对应的输出而不只是某几个，所以这里的BertModel不用加入 pooling 层；\n",
     "- 同时，这里将`_keys_to_ignore_on_load_unexpected`这一个类参数设置为`[r\"pooler\"]`，也就是在加载模型时对于出现不需要的权重不发生报错。"
-   ]
+   ],
+   "metadata": {}
   },
   {
    "cell_type": "code",
    "execution_count": 34,
-   "metadata": {},
-   "outputs": [],
    "source": [
     "class BertForMultipleChoice(BertPreTrainedModel):\n",
     "    def __init__(self, config):\n",
@@ -889,13 +887,13 @@
     "            hidden_states=outputs.hidden_states,\n",
     "            attentions=outputs.attentions,\n",
     "        )\n"
-   ]
+   ],
+   "outputs": [],
+   "metadata": {}
   },
   {
    "cell_type": "code",
    "execution_count": 21,
-   "metadata": {},
-   "outputs": [],
    "source": [
     "@add_start_docstrings(\n",
     "    \"\"\"\n",
@@ -989,22 +987,13 @@
     "            hidden_states=outputs.hidden_states,\n",
     "            attentions=outputs.attentions,\n",
     "        )\n"
-   ]
+   ],
+   "outputs": [],
+   "metadata": {}
   },
   {
    "cell_type": "code",
    "execution_count": 26,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Downloading: 100%|██████████| 998/998 [00:00<00:00, 382kB/s]\n",
-      "Downloading: 100%|██████████| 1.33G/1.33G [01:30<00:00, 14.7MB/s]\n"
-     ]
-    }
-   ],
    "source": [
     "from transformers import BertForTokenClassification, BertTokenizer\n",
     "import torch\n",
@@ -1032,16 +1021,30 @@
     "\n",
     "outputs = model(inputs).logits\n",
     "predictions = torch.argmax(outputs, dim=2)"
-   ]
+   ],
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stderr",
+     "text": [
+      "Downloading: 100%|██████████| 998/998 [00:00<00:00, 382kB/s]\n",
+      "Downloading: 100%|██████████| 1.33G/1.33G [01:30<00:00, 14.7MB/s]\n"
+     ]
+    }
+   ],
+   "metadata": {}
   },
   {
    "cell_type": "code",
    "execution_count": 27,
-   "metadata": {},
+   "source": [
+    "for token, prediction in zip(tokens, predictions[0].numpy()):\n",
+    "    print((token, model.config.id2label[prediction]))"
+   ],
    "outputs": [
     {
-     "name": "stdout",
      "output_type": "stream",
+     "name": "stdout",
      "text": [
       "('[CLS]', 'O')\n",
       "('Hu', 'I-ORG')\n",
@@ -1078,14 +1081,10 @@
      ]
     }
    ],
-   "source": [
-    "for token, prediction in zip(tokens, predictions[0].numpy()):\n",
-    "    print((token, model.config.id2label[prediction]))"
-   ]
+   "metadata": {}
   },
   {
    "cell_type": "markdown",
-   "metadata": {},
    "source": [
     "*** \n",
     "### 3.5 BertForQuestionAnswering\n",
@@ -1095,13 +1094,12 @@
     "- 对超出句子长度的非法 label，会将其压缩（torch.clamp_）到合理范围。\n",
     "\n",
     "以上就是关于 BERT 源码的介绍，下面介绍一些关于 BERT 模型实用的训练细节。"
-   ]
+   ],
+   "metadata": {}
   },
   {
    "cell_type": "code",
    "execution_count": 23,
-   "metadata": {},
-   "outputs": [],
    "source": [
     "@add_start_docstrings(\n",
     "    \"\"\"\n",
@@ -1203,37 +1201,13 @@
     "            hidden_states=outputs.hidden_states,\n",
     "            attentions=outputs.attentions,\n",
     "        )\n"
-   ]
+   ],
+   "outputs": [],
+   "metadata": {}
   },
   {
    "cell_type": "code",
    "execution_count": 29,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Downloading: 100%|██████████| 443/443 [00:00<00:00, 186kB/s]\n",
-      "Downloading: 100%|██████████| 232k/232k [00:00<00:00, 438kB/s]\n",
-      "Downloading: 100%|██████████| 466k/466k [00:00<00:00, 845kB/s]\n",
-      "Downloading: 100%|██████████| 28.0/28.0 [00:00<00:00, 10.5kB/s]\n",
-      "Downloading: 100%|██████████| 1.34G/1.34G [01:28<00:00, 15.1MB/s]\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Question: How many pretrained models are available in 🤗 Transformers?\n",
-      "Answer: over 32 +\n",
-      "Question: What does 🤗 Transformers provide?\n",
-      "Answer: general - purpose architectures\n",
-      "Question: 🤗 Transformers provides interoperability between which frameworks?\n",
-      "Answer: tensorflow 2. 0 and pytorch\n"
-     ]
-    }
-   ],
    "source": [
     "from transformers import AutoTokenizer, AutoModelForQuestionAnswering\n",
     "import torch\n",
@@ -1262,11 +1236,36 @@
     "    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))\n",
     "    print(f\"Question: {question}\")\n",
     "    print(f\"Answer: {answer}\")"
-   ]
+   ],
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stderr",
+     "text": [
+      "Downloading: 100%|██████████| 443/443 [00:00<00:00, 186kB/s]\n",
+      "Downloading: 100%|██████████| 232k/232k [00:00<00:00, 438kB/s]\n",
+      "Downloading: 100%|██████████| 466k/466k [00:00<00:00, 845kB/s]\n",
+      "Downloading: 100%|██████████| 28.0/28.0 [00:00<00:00, 10.5kB/s]\n",
+      "Downloading: 100%|██████████| 1.34G/1.34G [01:28<00:00, 15.1MB/s]\n"
+     ]
+    },
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "Question: How many pretrained models are available in 🤗 Transformers?\n",
+      "Answer: over 32 +\n",
+      "Question: What does 🤗 Transformers provide?\n",
+      "Answer: general - purpose architectures\n",
+      "Question: 🤗 Transformers provides interoperability between which frameworks?\n",
+      "Answer: tensorflow 2. 0 and pytorch\n"
+     ]
+    }
+   ],
+   "metadata": {}
   },
   {
    "cell_type": "markdown",
-   "metadata": {},
    "source": [
     "*** \n",
     "## BERT训练和优化\n",
@@ -1295,11 +1294,11 @@
     "```\n",
     "\n",
     "至于为什么，应该是因为 word_embedding 和 prediction 权重太大了，以 bert-base 为例，其尺寸为(30522, 768)，降低训练难度。\n"
-   ]
+   ],
+   "metadata": {}
   },
   {
    "cell_type": "markdown",
-   "metadata": {},
    "source": [
     "***\n",
     "### 4.2 Fine-Tuning\n",
@@ -1319,7 +1318,7 @@
     "- paperplanet：都 9102 年了，别再用 Adam + L2 regularization了 [2]\n",
     "\n",
     "通常，我们会选择模型的 weight 部分参与 decay 过程，而另一部分（包括 LayerNorm 的 weight）不参与（代码最初来源应该是 Huggingface 的示例）\n",
-    "补充：关于这么做的理由，我暂时没有找到合理的解答，但是找到了一些相关的[讨论](https://forums.fast.ai/t/is-weight-decay-applied-to-the-bias-term/73212/4forums.fast.ai)\n",
+    "补充：关于这么做的理由，我暂时没有找到合理的解答。\n",
     "\n",
     "```\n",
     "# model: a Bert-based-model object\n",
@@ -1378,14 +1377,15 @@
     "\n",
     "## 致谢\n",
     "本文主要由浙江大学李泺秋撰写，本项目同学负责整理和汇总。"
-   ]
+   ],
+   "metadata": {}
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "source": [],
    "outputs": [],
-   "source": []
+   "metadata": {}
   }
  ],
  "metadata": {
diff --git a/docs/篇章3-编写一个Transformer模型：BERT/3.2-如何应用一个BERT.md b/docs/篇章3-编写一个Transformer模型：BERT/3.2-如何应用一个BERT.md
index 1b5bf6d..2c7c145 100644
--- a/docs/篇章3-编写一个Transformer模型：BERT/3.2-如何应用一个BERT.md
+++ b/docs/篇章3-编写一个Transformer模型：BERT/3.2-如何应用一个BERT.md
@@ -1190,7 +1190,7 @@ AdamW 是在 Adam+L2 正则化的基础上进行改进的算法，与一般的 A
 - paperplanet：都 9102 年了，别再用 Adam + L2 regularization了 [2]
 
 通常，我们会选择模型的 weight 部分参与 decay 过程，而另一部分（包括 LayerNorm 的 weight）不参与（代码最初来源应该是 Huggingface 的示例）
-补充：关于这么做的理由，我暂时没有找到合理的解答，但是找到了一些相关的[讨论](https://forums.fast.ai/t/is-weight-decay-applied-to-the-bias-term/73212/4forums.fast.ai)
+补充：关于这么做的理由，我暂时没有找到合理的解答。
 
 ```
 # model: a Bert-based-model object