diff --git a/docs/篇章3-编写一个Transformer模型：BERT/3.1-如何实现一个BERT.ipynb b/docs/篇章3-编写一个Transformer模型：BERT/3.1-如何实现一个BERT.ipynb
index cf8ec01..a4695ae 100644
--- a/docs/篇章3-编写一个Transformer模型：BERT/3.1-如何实现一个BERT.ipynb
+++ b/docs/篇章3-编写一个Transformer模型：BERT/3.1-如何实现一个BERT.ipynb
@@ -5,6 +5,10 @@
    "metadata": {},
    "source": [
     "## 前言\n",
+    "本文包含大量源码和讲解，通过段落和横线分割了各个模块，同时网站配备了侧边栏，帮助大家在各个小节中快速跳转，希望大家阅读完能对BERT有深刻的了解。同时建议通过pycharm、vscode等工具对bert源码进行单步调试，调试到对应的模块再对比看本章节的讲解。\n",
+    "\n",
+    "涉及到的jupyter可以在[代码库：篇章3-编写一个Transformer模型：BERT，下载](https://github.com/datawhalechina/learn-nlp-with-transformers/tree/main/docs/%E7%AF%87%E7%AB%A03-%E7%BC%96%E5%86%99%E4%B8%80%E4%B8%AATransformer%E6%A8%A1%E5%9E%8B%EF%BC%9ABERT)\n",
+    "\n",
     "本篇章将基于H[HuggingFace/Transformers, 48.9k Star](https://github.com/huggingface/transformers)进行学习。本章节的全部代码在[huggingface bert，注意由于版本更新较快，可能存在差别，请以4.4.2版本为准](https://github.com/huggingface/transformers/tree/master/src/transformers/models/bert)HuggingFace 是一家总部位于纽约的聊天机器人初创服务商，很早就捕捉到 BERT 大潮流的信号并着手实现基于 pytorch 的 BERT 模型。这一项目最初名为 pytorch-pretrained-bert，在复现了原始效果的同时，提供了易用的方法以方便在这一强大模型的基础上进行各种玩耍和研究。\n",
     "\n",
     "随着使用人数的增加，这一项目也发展成为一个较大的开源社区，合并了各种预训练语言模型以及增加了 Tensorflow 的实现，并且在 2019 年下半年改名为 Transformers。截止写文章时（2021 年 3 月 30 日）这一项目已经拥有 43k+ 的star，可以说 Transformers 已经成为事实上的 NLP 基本工具。\n",
@@ -1596,7 +1600,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": 30,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -1611,7 +1615,7 @@
     "        hidden_states = self.dense(hidden_states)\n",
     "        hidden_states = self.dropout(hidden_states)\n",
     "        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n",
-    "        return hidden_states\n"
+    "        return hidden_states"
    ]
   },
   {
@@ -1641,9 +1645,18 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 22,
+   "execution_count": 28,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "input to bert pooler size: 768\n",
+      "torch.Size([1, 768])\n"
+     ]
+    }
+   ],
    "source": [
     "class BertPooler(nn.Module):\n",
     "    def __init__(self, config):\n",
@@ -1657,9 +1670,27 @@
     "        first_token_tensor = hidden_states[:, 0]\n",
     "        pooled_output = self.dense(first_token_tensor)\n",
     "        pooled_output = self.activation(pooled_output)\n",
-    "        return pooled_output"
+    "        return pooled_output\n",
+    "from transformers.models.bert.configuration_bert import *\n",
+    "import torch\n",
+    "config = BertConfig.from_pretrained(\"bert-base-uncased\")\n",
+    "bert_pooler = BertPooler(config=config)\n",
+    "print(\"input to bert pooler size: {}\".format(config.hidden_size))\n",
+    "batch_size = 1\n",
+    "seq_len = 2\n",
+    "hidden_size = 768\n",
+    "x = torch.rand(batch_size, seq_len, hidden_size)\n",
+    "y = bert_pooler(x)\n",
+    "print(y.size())"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
diff --git a/docs/篇章3-编写一个Transformer模型：BERT/3.1-如何实现一个BERT.md b/docs/篇章3-编写一个Transformer模型：BERT/3.1-如何实现一个BERT.md
index 54e61a6..4a14444 100644
--- a/docs/篇章3-编写一个Transformer模型：BERT/3.1-如何实现一个BERT.md
+++ b/docs/篇章3-编写一个Transformer模型：BERT/3.1-如何实现一个BERT.md
@@ -1,9 +1,13 @@
 ## 前言
+本文包含大量源码和讲解，通过段落和横线分割了各个模块，同时网站配备了侧边栏，帮助大家在各个小节中快速跳转，希望大家阅读完能对BERT有深刻的了解。同时建议通过pycharm、vscode等工具对bert源码进行单步调试，调试到对应的模块再对比看本章节的讲解。
+
+涉及到的jupyter可以在[代码库：篇章3-编写一个Transformer模型：BERT，下载](https://github.com/datawhalechina/learn-nlp-with-transformers/tree/main/docs/%E7%AF%87%E7%AB%A03-%E7%BC%96%E5%86%99%E4%B8%80%E4%B8%AATransformer%E6%A8%A1%E5%9E%8B%EF%BC%9ABERT)
+
 本篇章将基于H[HuggingFace/Transformers, 48.9k Star](https://github.com/huggingface/transformers)进行学习。本章节的全部代码在[huggingface bert，注意由于版本更新较快，可能存在差别，请以4.4.2版本为准](https://github.com/huggingface/transformers/tree/master/src/transformers/models/bert)HuggingFace 是一家总部位于纽约的聊天机器人初创服务商，很早就捕捉到 BERT 大潮流的信号并着手实现基于 pytorch 的 BERT 模型。这一项目最初名为 pytorch-pretrained-bert，在复现了原始效果的同时，提供了易用的方法以方便在这一强大模型的基础上进行各种玩耍和研究。
 
 随着使用人数的增加，这一项目也发展成为一个较大的开源社区，合并了各种预训练语言模型以及增加了 Tensorflow 的实现，并且在 2019 年下半年改名为 Transformers。截止写文章时（2021 年 3 月 30 日）这一项目已经拥有 43k+ 的star，可以说 Transformers 已经成为事实上的 NLP 基本工具。
 
-## Pytorch版本BERT学习
+## 本小节主要内容
 ![图：BERT结构](./pictures/3-6-bert.png) 图：BERT结构，来源IrEne: Interpretable Energy Prediction for Transformers
 
 本文基于 Transformers 版本 4.4.2（2021 年 3 月 19 日发布）项目中，pytorch 版的 BERT 相关代码，从代码结构、具体实现与原理，以及使用的角度进行分析。
@@ -16,10 +20,431 @@
             - BertAttention
             - BertIntermediate
             - BertOutput
+        - BertPooler
+
+*** 
+## 1-Tokenization分词-BertTokenizer
+和BERT 有关的 Tokenizer 主要写在[`models/bert/tokenization_bert.py`](https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/tokenization_bert.py)中。
+
+
+```python
+import collections
+import os
+import unicodedata
+from typing import List, Optional, Tuple
+
+from transformers.tokenization_utils import PreTrainedTokenizer, _is_control, _is_punctuation, _is_whitespace
+from transformers.utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    "vocab_file": {
+        "bert-base-uncased": "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt",
+    }
+}
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "bert-base-uncased": 512,
+}
+
+PRETRAINED_INIT_CONFIGURATION = {
+    "bert-base-uncased": {"do_lower_case": True},
+}
+
+
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = collections.OrderedDict()
+    with open(vocab_file, "r", encoding="utf-8") as reader:
+        tokens = reader.readlines()
+    for index, token in enumerate(tokens):
+        token = token.rstrip("\n")
+        vocab[token] = index
+    return vocab
+
+
+def whitespace_tokenize(text):
+    """Runs basic whitespace cleaning and splitting on a piece of text."""
+    text = text.strip()
+    if not text:
+        return []
+    tokens = text.split()
+    return tokens
+
+
+class BertTokenizer(PreTrainedTokenizer):
+
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        do_basic_tokenize=True,
+        never_split=None,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        tokenize_chinese_chars=True,
+        strip_accents=None,
+        **kwargs
+    ):
+        super().__init__(
+            do_lower_case=do_lower_case,
+            do_basic_tokenize=do_basic_tokenize,
+            never_split=never_split,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            tokenize_chinese_chars=tokenize_chinese_chars,
+            strip_accents=strip_accents,
+            **kwargs,
+        )
+
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained "
+                "model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"
+            )
+        self.vocab = load_vocab(vocab_file)
+        self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])
+        self.do_basic_tokenize = do_basic_tokenize
+        if do_basic_tokenize:
+            self.basic_tokenizer = BasicTokenizer(
+                do_lower_case=do_lower_case,
+                never_split=never_split,
+                tokenize_chinese_chars=tokenize_chinese_chars,
+                strip_accents=strip_accents,
+            )
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
+
+    @property
+    def do_lower_case(self):
+        return self.basic_tokenizer.do_lower_case
+
+    @property
+    def vocab_size(self):
+        return len(self.vocab)
+
+    def get_vocab(self):
+        return dict(self.vocab, **self.added_tokens_encoder)
+
+    def _tokenize(self, text):
+        split_tokens = []
+        if self.do_basic_tokenize:
+            for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
+
+                # If the token is part of the never_split set
+                if token in self.basic_tokenizer.never_split:
+                    split_tokens.append(token)
+                else:
+                    split_tokens += self.wordpiece_tokenizer.tokenize(token)
+        else:
+            split_tokens = self.wordpiece_tokenizer.tokenize(text)
+        return split_tokens
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        return self.vocab.get(token, self.vocab.get(self.unk_token))
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.ids_to_tokens.get(index, self.unk_token)
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (string) in a single string."""
+        out_string = " ".join(tokens).replace(" ##", "").strip()
+        return out_string
+
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens. A BERT sequence has the following format:
+        - single sequence: ``[CLS] X [SEP]``
+        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (:obj:`List[int]`, `optional`):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        cls = [self.cls_token_id]
+        sep = [self.sep_token_id]
+        return cls + token_ids_0 + sep + token_ids_1 + sep
+
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        """
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``prepare_for_model`` method.
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs.
+            token_ids_1 (:obj:`List[int]`, `optional`):
+                Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
+                Whether or not the token list is already formatted with special tokens for the model.
+        Returns:
+            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+            )
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence
+        pair mask has the following format:
+        ::
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs.
+            token_ids_1 (:obj:`List[int]`, `optional`):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
+            sequence(s).
+        """
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(cls + token_ids_0 + sep) * [0]
+        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
+
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        index = 0
+        if os.path.isdir(save_directory):
+            vocab_file = os.path.join(
+                save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
+            )
+        else:
+            vocab_file = (filename_prefix + "-" if filename_prefix else "") + save_directory
+        with open(vocab_file, "w", encoding="utf-8") as writer:
+            for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):
+                if index != token_index:
+                    logger.warning(
+                        f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive."
+                        " Please check that the vocabulary is not corrupted!"
+                    )
+                    index = token_index
+                writer.write(token + "\n")
+                index += 1
+        return (vocab_file,)
+
+
+class BasicTokenizer(object):
+
+    def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True, strip_accents=None):
+        if never_split is None:
+            never_split = []
+        self.do_lower_case = do_lower_case
+        self.never_split = set(never_split)
+        self.tokenize_chinese_chars = tokenize_chinese_chars
+        self.strip_accents = strip_accents
+
+    def tokenize(self, text, never_split=None):
+        """
+        Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see
+        WordPieceTokenizer.
+        Args:
+            **never_split**: (`optional`) list of str
+                Kept for backward compatibility purposes. Now implemented directly at the base class level (see
+                :func:`PreTrainedTokenizer.tokenize`) List of token not to split.
+        """
+        # union() returns a new set by concatenating the two sets.
+        never_split = self.never_split.union(set(never_split)) if never_split else self.never_split
+        text = self._clean_text(text)
+
+        # This was added on November 1st, 2018 for the multilingual and Chinese
+        # models. This is also applied to the English models now, but it doesn't
+        # matter since the English models were not trained on any Chinese data
+        # and generally don't have any Chinese data in them (there are Chinese
+        # characters in the vocabulary because Wikipedia does have some Chinese
+        # words in the English Wikipedia.).
+        if self.tokenize_chinese_chars:
+            text = self._tokenize_chinese_chars(text)
+        orig_tokens = whitespace_tokenize(text)
+        split_tokens = []
+        for token in orig_tokens:
+            if token not in never_split:
+                if self.do_lower_case:
+                    token = token.lower()
+                    if self.strip_accents is not False:
+                        token = self._run_strip_accents(token)
+                elif self.strip_accents:
+                    token = self._run_strip_accents(token)
+            split_tokens.extend(self._run_split_on_punc(token, never_split))
+
+        output_tokens = whitespace_tokenize(" ".join(split_tokens))
+        return output_tokens
+
+    def _run_strip_accents(self, text):
+        """Strips accents from a piece of text."""
+        text = unicodedata.normalize("NFD", text)
+        output = []
+        for char in text:
+            cat = unicodedata.category(char)
+            if cat == "Mn":
+                continue
+            output.append(char)
+        return "".join(output)
+
+    def _run_split_on_punc(self, text, never_split=None):
+        """Splits punctuation on a piece of text."""
+        if never_split is not None and text in never_split:
+            return [text]
+        chars = list(text)
+        i = 0
+        start_new_word = True
+        output = []
+        while i < len(chars):
+            char = chars[i]
+            if _is_punctuation(char):
+                output.append([char])
+                start_new_word = True
+            else:
+                if start_new_word:
+                    output.append([])
+                start_new_word = False
+                output[-1].append(char)
+            i += 1
+
+        return ["".join(x) for x in output]
+
+    def _tokenize_chinese_chars(self, text):
+        """Adds whitespace around any CJK character."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if self._is_chinese_char(cp):
+                output.append(" ")
+                output.append(char)
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+    def _is_chinese_char(self, cp):
+        """Checks whether CP is the codepoint of a CJK character."""
+        # This defines a "chinese character" as anything in the CJK Unicode block:
+        #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+        #
+        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+        # despite its name. The modern Korean Hangul alphabet is a different block,
+        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+        # space-separated words, so they are not treated specially and handled
+        # like the all of the other languages.
+        if (
+            (cp >= 0x4E00 and cp <= 0x9FFF)
+            or (cp >= 0x3400 and cp <= 0x4DBF)  #
+            or (cp >= 0x20000 and cp <= 0x2A6DF)  #
+            or (cp >= 0x2A700 and cp <= 0x2B73F)  #
+            or (cp >= 0x2B740 and cp <= 0x2B81F)  #
+            or (cp >= 0x2B820 and cp <= 0x2CEAF)  #
+            or (cp >= 0xF900 and cp <= 0xFAFF)
+            or (cp >= 0x2F800 and cp <= 0x2FA1F)  #
+        ):  #
+            return True
+
+        return False
+
+    def _clean_text(self, text):
+        """Performs invalid character removal and whitespace cleanup on text."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if cp == 0 or cp == 0xFFFD or _is_control(char):
+                continue
+            if _is_whitespace(char):
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+
+class WordpieceTokenizer(object):
+    """Runs WordPiece tokenization."""
+
+    def __init__(self, vocab, unk_token, max_input_chars_per_word=100):
+        self.vocab = vocab
+        self.unk_token = unk_token
+        self.max_input_chars_per_word = max_input_chars_per_word
+
+    def tokenize(self, text):
+        """
+        Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform
+        tokenization using the given vocabulary.
+        For example, :obj:`input = "unaffable"` wil return as output :obj:`["un", "##aff", "##able"]`.
+        Args:
+          text: A single token or whitespace separated tokens. This should have
+            already been passed through `BasicTokenizer`.
+        Returns:
+          A list of wordpiece tokens.
+        """
+
+        output_tokens = []
+        for token in whitespace_tokenize(text):
+            chars = list(token)
+            if len(chars) > self.max_input_chars_per_word:
+                output_tokens.append(self.unk_token)
+                continue
+
+            is_bad = False
+            start = 0
+            sub_tokens = []
+            while start < len(chars):
+                end = len(chars)
+                cur_substr = None
+                while start < end:
+                    substr = "".join(chars[start:end])
+                    if start > 0:
+                        substr = "##" + substr
+                    if substr in self.vocab:
+                        cur_substr = substr
+                        break
+                    end -= 1
+                if cur_substr is None:
+                    is_bad = True
+                    break
+                sub_tokens.append(cur_substr)
+                start = end
+
+            if is_bad:
+                output_tokens.append(self.unk_token)
+            else:
+                output_tokens.extend(sub_tokens)
+        return output_tokens
+```
 
-### 1-Tokenization分词-BertTokenizer
-和BERT 有关的 Tokenizer 主要写`models/bert/tokenization_bert.py`和`models/bert/tokenization_bert_fast.py`中。
-这两份代码分别对应基本的`BertTokenizer`，以及不进行 token 到 index 映射的`BertTokenizerFast`，这里主要讲解第一个。
 ```
 class BertTokenizer(PreTrainedTokenizer):
     """
@@ -31,7 +456,7 @@ class BertTokenizer(PreTrainedTokenizer):
     """
 ```
 
-`BertTokenizer` 是基于`BasicTokenizer`和W`ordPieceTokenizer`的分词器：
+`BertTokenizer` 是基于`BasicTokenizer`和`WordPieceTokenizer`的分词器：
 - BasicTokenizer负责处理的第一步——按标点、空格等分割句子，并处理是否统一小写，以及清理非法字符。
     - 对于中文字符，通过预处理（加空格）来按字分割；
     - 同时可以通过never_split指定对某些词不进行分割；
@@ -48,15 +473,30 @@ BertTokenizer 有以下常用方法：
 - encode：对于单个句子输入，分解词并加入特殊词形成“[CLS], x, [SEP]”的结构并转换为词表对应下标的列表；对于两个句子输入（多个句子只取前两个），分解词并加入特殊词形成“[CLS], x1, [SEP], x2, [SEP]”的结构并转换为下标列表；
 - decode：可以将 encode 方法的输出变为完整句子。
 以及，类自身的方法：
-```
-from transformers import BertTokenizer
-bt = BertTokenizer.from_pretrained('./bert-base-uncased/')
+
+
+```python
+bt = BertTokenizer.from_pretrained('bert-base-uncased')
 bt('I like natural language progressing!')
-{'input_ids': [101, 1045, 2066, 3019, 2653, 27673, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
+# {'input_ids': [101, 1045, 2066, 3019, 2653, 27673, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
 ```
 
-### 2-Model-BertModel
-和 BERT 模型有关的代码主要写在`/models/bert/modeling_bert.py`中，这一份代码有一千多行，包含 BERT 模型的基本结构和基于它的微调模型等。
+    Downloading: 100%|██████████| 232k/232k [00:00<00:00, 698kB/s]
+    Downloading: 100%|██████████| 28.0/28.0 [00:00<00:00, 11.1kB/s]
+    Downloading: 100%|██████████| 466k/466k [00:00<00:00, 863kB/s]
+
+
+
+
+
+    {'input_ids': [101, 1045, 2066, 3019, 2653, 27673, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
+
+
+
+*** 
+## 2-Model-BertModel
+和 BERT 模型有关的代码主要写在[`/models/bert/modeling_bert.py`](https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/modeling_bert.py)中，这一份代码有一千多行，包含 BERT 模型的基本结构和基于它的微调模型等。
+
 下面从 BERT 模型本体入手分析：
 ```
 class BertModel(BertPreTrainedModel):
@@ -163,7 +603,184 @@ def forward(
 
 ** 剪枝是一个复杂的操作，需要将保留的注意力头部分的 Wq、Kq、Vq 和拼接后全连接部分的权重拷贝到一个新的较小的权重矩阵（注意先禁止 grad 再拷贝），并实时记录被剪掉的头以防下标出错。具体参考BertAttention部分的prune_heads方法.**
 
-#### 2.1-BertEmbeddings
+
+```python
+from transformers.models.bert.modeling_bert import *
+class BertModel(BertPreTrainedModel):
+    """
+    The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
+    cross-attention is added between the self-attention layers, following the architecture described in `Attention is
+    all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
+    Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
+    To behave as an decoder the model needs to be initialized with the :obj:`is_decoder` argument of the configuration
+    set to :obj:`True`. To be used in a Seq2Seq model, the model needs to initialized with both :obj:`is_decoder`
+    argument and :obj:`add_cross_attention` set to :obj:`True`; an :obj:`encoder_hidden_states` is then expected as an
+    input to the forward pass.
+    """
+
+    def __init__(self, config, add_pooling_layer=True):
+        super().__init__(config)
+        self.config = config
+
+        self.embeddings = BertEmbeddings(config)
+        self.encoder = BertEncoder(config)
+
+        self.pooler = BertPooler(config) if add_pooling_layer else None
+
+        self.init_weights()
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def _prune_heads(self, heads_to_prune):
+        """
+        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
+        class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layer[layer].attention.prune_heads(heads)
+
+    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=BaseModelOutputWithPoolingAndCrossAttentions,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
+            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
+            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
+        use_cache (:obj:`bool`, `optional`):
+            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
+            decoding (see :obj:`past_key_values`).
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if self.config.is_decoder:
+            use_cache = use_cache if use_cache is not None else self.config.use_cache
+        else:
+            use_cache = False
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = input_ids.size()
+            batch_size, seq_length = input_shape
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.size()[:-1]
+            batch_size, seq_length = input_shape
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        device = input_ids.device if input_ids is not None else inputs_embeds.device
+
+        # past_key_values_length
+        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0
+
+        if attention_mask is None:
+            attention_mask = torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)
+
+        if token_type_ids is None:
+            if hasattr(self.embeddings, "token_type_ids"):
+                buffered_token_type_ids = self.embeddings.token_type_ids[:, :seq_length]
+                buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
+                token_type_ids = buffered_token_type_ids_expanded
+            else:
+                token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
+
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)
+
+        # If a 2D or 3D attention mask is provided for the cross-attention
+        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        if self.config.is_decoder and encoder_hidden_states is not None:
+            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
+            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
+            if encoder_attention_mask is None:
+                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)
+            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+        else:
+            encoder_extended_attention_mask = None
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            past_key_values_length=past_key_values_length,
+        )
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask=extended_attention_mask,
+            head_mask=head_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_extended_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = encoder_outputs[0]
+        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
+
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPoolingAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            past_key_values=encoder_outputs.past_key_values,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            cross_attentions=encoder_outputs.cross_attentions,
+        )
+```
+
+***
+### 2.1-BertEmbeddings
 包含三个部分求和得到：
 ![Bert-embedding](./pictures/3-0-embedding.png) 图：Bert-embedding
 
@@ -175,7 +792,70 @@ def forward(
 
 ** [这里为什么要用 LayerNorm+Dropout 呢？为什么要用 LayerNorm 而不是 BatchNorm？可以参考一个不错的回答：transformer 为什么使用 layer normalization，而不是其他的归一化方法？](https://www.zhihu.com/question/395811291/answer/1260290120)**
 
-#### 2.2-BertEncoder
+
+```python
+class BertEmbeddings(nn.Module):
+    """Construct the embeddings from word, position and token_type embeddings."""
+
+    def __init__(self, config):
+        super().__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
+        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
+        if version.parse(torch.__version__) > version.parse("1.6.0"):
+            self.register_buffer(
+                "token_type_ids",
+                torch.zeros(self.position_ids.size(), dtype=torch.long, device=self.position_ids.device),
+                persistent=False,
+            )
+
+    def forward(
+        self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=0
+    ):
+        if input_ids is not None:
+            input_shape = input_ids.size()
+        else:
+            input_shape = inputs_embeds.size()[:-1]
+
+        seq_length = input_shape[1]
+
+        if position_ids is None:
+            position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]
+
+        # Setting the token_type_ids to the registered buffer in constructor where it is all zeros, which usually occurs
+        # when its auto-generated, registered buffer helps users when tracing the model without passing token_type_ids, solves
+        # issue #5664
+        if token_type_ids is None:
+            if hasattr(self, "token_type_ids"):
+                buffered_token_type_ids = self.token_type_ids[:, :seq_length]
+                buffered_token_type_ids_expanded = buffered_token_type_ids.expand(input_shape[0], seq_length)
+                token_type_ids = buffered_token_type_ids_expanded
+            else:
+                token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = inputs_embeds + token_type_embeddings
+        if self.position_embedding_type == "absolute":
+            position_embeddings = self.position_embeddings(position_ids)
+            embeddings += position_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+```
+
+*** 
+### 2.2-BertEncoder
 
 包含多层 BertLayer，这一块本身没有特别需要说明的地方，不过有一个细节值得参考：利用 gradient checkpointing 技术以降低训练时的显存占用。
 
@@ -186,62 +866,107 @@ def forward(
 
 再往深一层走，就进入了 Encoder 的某一层：
 
-##### 2.2.1 BertLayer
-这一层包装了 BertAttention 和 BertIntermediate+BertOutput（即 Attention 后的 FFN 部分），以及这里直接忽略的 cross-attention 部分（将 BERT 作为 Decoder 时涉及的部分）。
 
-理论上，这里顺序调用三个子模块就可以，没有什么值得说明的地方。
+```python
+class BertEncoder(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])
 
-然而这里又出现了一个细节：
-```
- # 这是forward的一部分
-        self_attention_outputs = self.attention(
-            hidden_states,
-            attention_mask,
-            head_mask,
-            output_attentions=output_attentions,
-            past_key_value=self_attn_past_key_value,
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=True,
+    ):
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+        all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None
+
+        next_decoder_cache = () if use_cache else None
+        for i, layer_module in enumerate(self.layer):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+            past_key_value = past_key_values[i] if past_key_values is not None else None
+
+            if getattr(self.config, "gradient_checkpointing", False) and self.training:
+
+                if use_cache:
+                    logger.warning(
+                        "`use_cache=True` is incompatible with `config.gradient_checkpointing=True`. Setting "
+                        "`use_cache=False`..."
+                    )
+                    use_cache = False
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, past_key_value, output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(layer_module),
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    past_key_value,
+                    output_attentions,
+                )
+
+            hidden_states = layer_outputs[0]
+            if use_cache:
+                next_decoder_cache += (layer_outputs[-1],)
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+                if self.config.add_cross_attention:
+                    all_cross_attentions = all_cross_attentions + (layer_outputs[2],)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    hidden_states,
+                    next_decoder_cache,
+                    all_hidden_states,
+                    all_self_attentions,
+                    all_cross_attentions,
+                ]
+                if v is not None
+            )
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=next_decoder_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+            cross_attentions=all_cross_attentions,
         )
-        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights
-
-        # 中间省略一部分……
-
-        layer_output = apply_chunking_to_forward(
-            self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
-        )
-        outputs = (layer_output,) + outputs
-
-        # 省略一部分……
-
-        return outputs
-
-    # 这是feed_forward_chunk的部分
-    def feed_forward_chunk(self, attention_output):
-        intermediate_output = self.intermediate(attention_output)
-        layer_output = self.output(intermediate_output, attention_output)
-        return layer_output
-```
-看到上面那个`apply_chunking_to_forward`和`feed_forward_chunk`了吗（为什么要整这么复杂，直接调用它不香吗）？
-那么这个`apply_chunking_to_forward`到底是啥？深入看看：
-```
-def apply_chunking_to_forward(
-    forward_fn: Callable[..., torch.Tensor], chunk_size: int, chunk_dim: int, *input_tensors
-) -> torch.Tensor:
-    """
-    This function chunks the :obj:`input_tensors` into smaller input tensor parts of size :obj:`chunk_size` over the
-    dimension :obj:`chunk_dim`. It then applies a layer :obj:`forward_fn` to each chunk independently to save memory.
-
-    If the :obj:`forward_fn` is independent across the :obj:`chunk_dim` this function will yield the same result as
-    directly applying :obj:`forward_fn` to :obj:`input_tensors`.
-    ...
-    """
 ```
 
-原来又是一个节约显存的技术——包装了一个切分小 batch 或者低维数操作的功能：这里参数chunk_size其实就是切分的 batch 大小，而chunk_dim就是一次计算维数的大小，最后拼接起来返回。
-不过，在默认操作中不会特意设置这两个值（在源代码中默认为 0 和 1），所以会直接等效于正常的 forward 过程。
-
-继续往下深入，就是 Transformer 的核心：BertAttention 部分，以及紧随其后的 FFN 部分。
-
-##### 2.2.1.1 BertAttention
+*** 
+#### 2.2.1.1 BertAttention
 
 本以为 attention 的实现就在这里，没想到还要再下一层……其中，self 成员就是多头注意力的实现，而 output 成员实现 attention 后的全连接 +dropout+residual+LayerNorm 一系列操作。
 
@@ -279,6 +1004,58 @@ class BertAttention(nn.Module):
 - `prune_linear_layer`则负责将 Wk/Wq/Wv 权重矩阵（连同 bias）中按照 index 保留没有被剪枝的维度后转移到新的矩阵。
 接下来就到重头戏——Self-Attention 的具体实现。
 
+
+```python
+class BertAttention(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.self = BertSelfAttention(config)
+        self.output = BertSelfOutput(config)
+        self.pruned_heads = set()
+
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        heads, index = find_pruneable_heads_and_indices(
+            heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads
+        )
+
+        # Prune linear layers
+        self.self.query = prune_linear_layer(self.self.query, index)
+        self.self.key = prune_linear_layer(self.self.key, index)
+        self.self.value = prune_linear_layer(self.self.value, index)
+        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
+
+        # Update hyper params and store pruned heads
+        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
+        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+    ):
+        self_outputs = self.self(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            encoder_hidden_states,
+            encoder_attention_mask,
+            past_key_value,
+            output_attentions,
+        )
+        attention_output = self.output(self_outputs[0], hidden_states)
+        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
+        return outputs
+```
+
+*** 
 ##### 2.2.1.1.1 BertSelfAttention
 
 **预警：这一块可以说是模型的核心区域，也是唯一涉及到公式的地方，所以将贴出大量代码。**
@@ -495,6 +1272,135 @@ OK，这里涉及到 `BertModel` 的继承细节了：`BertModel`继承自`BertP
 - context_layer 即 attention 矩阵与 value 矩阵的乘积，原始的大小为：(batch_size, num_attention_heads, sequence_length, attention_head_size) ；
 - context_layer 进行转置和 view 操作以后，形状就恢复了(batch_size, sequence_length, hidden_size)。
 
+
+
+```python
+class BertSelfAttention(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
+                f"heads ({config.num_attention_heads})"
+            )
+
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            self.max_position_embeddings = config.max_position_embeddings
+            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)
+
+        self.is_decoder = config.is_decoder
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
+        x = x.view(*new_x_shape)
+        return x.permute(0, 2, 1, 3)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+    ):
+        mixed_query_layer = self.query(hidden_states)
+
+        # If this is instantiated as a cross-attention module, the keys
+        # and values come from an encoder; the attention mask needs to be
+        # such that the encoder's padding tokens are not attended to.
+        is_cross_attention = encoder_hidden_states is not None
+
+        if is_cross_attention and past_key_value is not None:
+            # reuse k,v, cross_attentions
+            key_layer = past_key_value[0]
+            value_layer = past_key_value[1]
+            attention_mask = encoder_attention_mask
+        elif is_cross_attention:
+            key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
+            value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
+            attention_mask = encoder_attention_mask
+        elif past_key_value is not None:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+            key_layer = torch.cat([past_key_value[0], key_layer], dim=2)
+            value_layer = torch.cat([past_key_value[1], value_layer], dim=2)
+        else:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+
+        if self.is_decoder:
+            # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.
+            # Further calls to cross_attention layer can then reuse all cross-attention
+            # key/value_states (first "if" case)
+            # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of
+            # all previous decoder key/value_states. Further calls to uni-directional self-attention
+            # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)
+            # if encoder bi-directional self-attention `past_key_value` is always `None`
+            past_key_value = (key_layer, value_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
+
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            seq_length = hidden_states.size()[1]
+            position_ids_l = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
+            position_ids_r = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
+            distance = position_ids_l - position_ids_r
+            positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
+            positional_embedding = positional_embedding.to(dtype=query_layer.dtype)  # fp16 compatibility
+
+            if self.position_embedding_type == "relative_key":
+                relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores
+            elif self.position_embedding_type == "relative_key_query":
+                relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
+
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+        if attention_mask is not None:
+            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
+            attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.Softmax(dim=-1)(attention_scores)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs = attention_probs * head_mask
+
+        context_layer = torch.matmul(attention_probs, value_layer)
+
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
+        context_layer = context_layer.view(*new_context_layer_shape)
+
+        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
+
+        if self.is_decoder:
+            outputs = outputs + (past_key_value,)
+        return outputs
+```
+
+*** 
 ##### 2.2.1.1.2 BertSelfOutput
 ```
 class BertSelfOutput(nn.Module):
@@ -513,7 +1419,25 @@ class BertSelfOutput(nn.Module):
 
 **这里又出现了 LayerNorm 和 Dropout 的组合，只不过这里是先 Dropout，进行残差连接后再进行 LayerNorm。至于为什么要做残差连接，最直接的目的就是降低网络层数过深带来的训练难度，对原始输入更加敏感～**
 
-##### 2.2.1.2 BertIntermediate
+
+```python
+
+class BertSelfOutput(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+```
+
+*** 
+#### 2.2.1.2 BertIntermediate
 
 看完了 BertAttention，在 Attention 后面还有一个全连接+激活的操作：
 ```
@@ -535,7 +1459,25 @@ class BertIntermediate(nn.Module):
 - 这里的全连接做了一个扩展，以 bert-base 为例，扩展维度为 3072，是原始维度 768 的 4 倍之多；
 - 这里的激活函数默认实现为 gelu（Gaussian Error Linerar Units(GELUS）当然，它是无法直接计算的，可以用一个包含tanh的表达式进行近似（略)。
 
-##### 2.2.1.3 BertOutput
+
+```python
+class BertIntermediate(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+```
+
+*** 
+#### 2.2.1.3 BertOutput
 
 在这里又是一个全连接 +dropout+LayerNorm，还有一个残差连接 residual connect：
 ```
@@ -556,7 +1498,24 @@ class BertOutput(nn.Module):
 这里的操作和 BertSelfOutput 不能说没有关系，只能说一模一样…… 非常容易混淆的两个组件。
 以下内容还包含基于 BERT 的应用模型，以及 BERT 相关的优化器和用法，将在下一篇文章作详细介绍。
 
-##### 2.2.3 BertPooler
+
+```python
+class BertOutput(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+```
+
+*** 
+### 2.2.3 BertPooler
 这一层只是简单地取出了句子的第一个token，即`[CLS]`对应的向量，然后过一个全连接层和一个激活函数后输出：（这一部分是可选的，因为pooling有很多不同的操作）
 
 ```
@@ -575,7 +1534,44 @@ class BertPooler(nn.Module):
         return pooled_output
 ```
 
-### 小总结
+
+```python
+class BertPooler(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+from transformers.models.bert.configuration_bert import *
+import torch
+config = BertConfig.from_pretrained("bert-base-uncased")
+bert_pooler = BertPooler(config=config)
+print("input to bert pooler size: {}".format(config.hidden_size))
+batch_size = 1
+seq_len = 2
+hidden_size = 768
+x = torch.rand(batch_size, seq_len, hidden_size)
+y = bert_pooler(x)
+print(y.size())
+```
+
+    input to bert pooler size: 768
+    torch.Size([1, 768])
+
+
+
+```python
+
+```
+
+## 小总结
 本小节对Bert模型的实现进行分析了学习，希望读者能对Bert实现有一个更为细致的把握。
 
 值得注意的是，在 HuggingFace 实现的 Bert 模型中，使用了多种节约显存的技术：
diff --git a/docs/篇章3-编写一个Transformer模型：BERT/3.2-如何应用一个BERT.ipynb b/docs/篇章3-编写一个Transformer模型：BERT/3.2-如何应用一个BERT.ipynb
index e69de29..07a4899 100644
--- a/docs/篇章3-编写一个Transformer模型：BERT/3.2-如何应用一个BERT.ipynb
+++ b/docs/篇章3-编写一个Transformer模型：BERT/3.2-如何应用一个BERT.ipynb
@@ -0,0 +1,1415 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 前言\n",
+    "接着上一小节，我们对Huggingface开源代码库中的Bert模型进行了深入学习，这一节我们对如何应用BERT进行详细的讲解。\n",
+    "\n",
+    "涉及到的jupyter可以在[代码库：篇章3-编写一个Transformer模型：BERT，下载](https://github.com/datawhalechina/learn-nlp-with-transformers/tree/main/docs/%E7%AF%87%E7%AB%A03-%E7%BC%96%E5%86%99%E4%B8%80%E4%B8%AATransformer%E6%A8%A1%E5%9E%8B%EF%BC%9ABERT)\n",
+    "\n",
+    "本文基于 Transformers 版本 4.4.2（2021 年 3 月 19 日发布）项目中，pytorch 版的 BERT 相关代码，从代码结构、具体实现与原理，以及使用的角度进行分析，包含以下内容：\n",
+    "\n",
+    "3. BERT-based Models应用模型\n",
+    "4. BERT训练和优化\n",
+    "5. Bert解决NLP任务\n",
+    "  - BertForSequenceClassification\n",
+    "  - BertForMultiChoice\n",
+    "  - BertForTokenClassification\n",
+    "  - BertForQuestionAnswering\n",
+    "6. BERT训练与优化\n",
+    "7. Pre-Training\n",
+    "  - Fine-Tuning\n",
+    "  - AdamW\n",
+    "  - Warmup\n",
+    "\n",
+    "## 3-BERT-based Models\n",
+    "基于 BERT 的模型都写在/models/bert/modeling_bert.py里面，包括 BERT 预训练模型和 BERT 分类等模型。\n",
+    "\n",
+    "首先，以下所有的模型都是基于`BertPreTrainedModel`这一抽象基类的，而后者则基于一个更大的基类`PreTrainedModel`。这里我们关注`BertPreTrainedModel`的功能：\n",
+    "\n",
+    "用于初始化模型权重，同时维护继承自`PreTrainedModel`的一些标记身份或者加载模型时的类变量。\n",
+    "下面，首先从预训练模型开始分析。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "*** \n",
+    "### 3.1 BertForPreTraining\n",
+    "\n",
+    "众所周知，BERT 预训练任务包括两个：\n",
+    "\n",
+    "- Masked Language Model（MLM）：在句子中随机用`[MASK]`替换一部分单词，然后将句子传入 BERT 中编码每一个单词的信息，最终用`[MASK]`的编码信息预测该位置的正确单词，这一任务旨在训练模型根据上下文理解单词的意思；\n",
+    "- Next Sentence Prediction（NSP）：将句子对 A 和 B 输入 BERT，使用`[CLS]`的编码信息进行预测 B 是否 A 的下一句，这一任务旨在训练模型理解预测句子间的关系。\n",
+    "\n",
+    "\n",
+    "![图Bert预训练](./pictures/3-3-bert-lm.png) 图Bert预训练\n",
+    "\n",
+    "而对应到代码中，这一融合两个任务的模型就是BertForPreTraining，其中包含两个组件：\n",
+    "```\n",
+    "class BertForPreTraining(BertPreTrainedModel):\n",
+    "    def __init__(self, config):\n",
+    "        super().__init__(config)\n",
+    "\n",
+    "        self.bert = BertModel(config)\n",
+    "        self.cls = BertPreTrainingHeads(config)\n",
+    "\n",
+    "        self.init_weights()\n",
+    "    # ...\n",
+    "```\n",
+    "这里的BertModel在上一章节中已经详细介绍了（注意，这里设置的是默认`add_pooling_layer=True`，即会提取`[CLS]`对应的输出用于 NSP 任务），而`BertPreTrainingHeads`则是负责两个任务的预测模块：\n",
+    "```\n",
+    "class BertPreTrainingHeads(nn.Module):\n",
+    "    def __init__(self, config):\n",
+    "        super().__init__()\n",
+    "        self.predictions = BertLMPredictionHead(config)\n",
+    "        self.seq_relationship = nn.Linear(config.hidden_size, 2)\n",
+    "\n",
+    "    def forward(self, sequence_output, pooled_output):\n",
+    "        prediction_scores = self.predictions(sequence_output)\n",
+    "        seq_relationship_score = self.seq_relationship(pooled_output)\n",
+    "        return prediction_scores, seq_relationship_score \n",
+    "```\n",
+    "又是一层封装：`BertPreTrainingHeads`包裹了`BertLMPredictionHead` 和一个代表 NSP 任务的线性层。这里不把 NSP 对应的任务也封装一个`BertXXXPredictionHead`。\n",
+    "\n",
+    "**其实是有封装这个类的，不过它叫做BertOnlyNSPHead，在这里用不上**\n",
+    "\n",
+    "继续下探`BertPreTrainingHeads`：\n",
+    "```\n",
+    "class BertLMPredictionHead(nn.Module):\n",
+    "    def __init__(self, config):\n",
+    "        super().__init__()\n",
+    "        self.transform = BertPredictionHeadTransform(config)\n",
+    "\n",
+    "        # The output weights are the same as the input embeddings, but there is\n",
+    "        # an output-only bias for each token.\n",
+    "        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)\n",
+    "\n",
+    "        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n",
+    "\n",
+    "        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n",
+    "        self.decoder.bias = self.bias\n",
+    "\n",
+    "    def forward(self, hidden_states):\n",
+    "        hidden_states = self.transform(hidden_states)\n",
+    "        hidden_states = self.decoder(hidden_states)\n",
+    "        return hidden_states\n",
+    "```\n",
+    "\n",
+    "这个类用于预测`[MASK]`位置的输出在每个词作为类别的分类输出，注意到：\n",
+    "\n",
+    "- 该类重新初始化了一个全 0 向量作为预测权重的 bias；\n",
+    "- 该类的输出形状为[batch_size, seq_length, vocab_size]，即预测每个句子每个词是什么类别的概率值（注意这里没有做 softmax）；\n",
+    "- 又一个封装的类：BertPredictionHeadTransform，用来完成一些线性变换：\n",
+    "```\n",
+    "class BertPredictionHeadTransform(nn.Module):\n",
+    "    def __init__(self, config):\n",
+    "        super().__init__()\n",
+    "        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n",
+    "        if isinstance(config.hidden_act, str):\n",
+    "            self.transform_act_fn = ACT2FN[config.hidden_act]\n",
+    "        else:\n",
+    "            self.transform_act_fn = config.hidden_act\n",
+    "        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n",
+    "\n",
+    "    def forward(self, hidden_states):\n",
+    "        hidden_states = self.dense(hidden_states)\n",
+    "        hidden_states = self.transform_act_fn(hidden_states)\n",
+    "        hidden_states = self.LayerNorm(hidden_states)\n",
+    "        return hidden_states\n",
+    "```\n",
+    "\n",
+    "回到`BertForPreTraining`，继续看两块 `loss` 是怎么处理的。它的前向传播和BertModel的有所不同，多了`labels`和`next_sentence_label` 两个输入：\n",
+    "\n",
+    "- labels：形状为[batch_size, seq_length] ，代表 MLM 任务的标签，注意这里对于原本未被遮盖的词设置为 -100，被遮盖词才会有它们对应的 id，和任务设置是反过来的。\n",
+    "\n",
+    "  - 例如，原始句子是I want to [MASK] an apple，这里我把单词eat给遮住了输入模型，对应的label设置为[-100, -100, -100, 【eat对应的id】, -100, -100]；\n",
+    "  - 为什么要设置为 -100 而不是其他数？因为torch.nn.CrossEntropyLoss默认的ignore_index=-100，也就是说对于标签为 100 的类别输入不会计算 loss。\n",
+    "\n",
+    "- next_sentence_label：这一个输入很简单，就是 0 和 1 的二分类标签。\n",
+    "\n",
+    "```\n",
+    "# ...\n",
+    "    def forward(\n",
+    "        self,\n",
+    "        input_ids=None,\n",
+    "        attention_mask=None,\n",
+    "        token_type_ids=None,\n",
+    "        position_ids=None,\n",
+    "        head_mask=None,\n",
+    "        inputs_embeds=None,\n",
+    "        labels=None,\n",
+    "        next_sentence_label=None,\n",
+    "        output_attentions=None,\n",
+    "        output_hidden_states=None,\n",
+    "        return_dict=None,\n",
+    "    ): ...\n",
+    "```\n",
+    "\n",
+    "接下来两部分 loss 的组合：\n",
+    "```\n",
+    " # ...\n",
+    "        total_loss = None\n",
+    "        if labels is not None and next_sentence_label is not None:\n",
+    "            loss_fct = CrossEntropyLoss()\n",
+    "            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n",
+    "            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n",
+    "            total_loss = masked_lm_loss + next_sentence_loss\n",
+    "        # ...\n",
+    "```\n",
+    "\n",
+    "直接相加，就是这么单纯的策略。\n",
+    "当然，这份代码里面也包含了对于只想对单个目标进行预训练的 BERT 模型（具体细节不作展开）：\n",
+    "- BertForMaskedLM：只进行 MLM 任务的预训练；\n",
+    "  - 基于BertOnlyMLMHead，而后者也是对BertLMPredictionHead的另一层封装；\n",
+    "- BertLMHeadModel：这个和上一个的区别在于，这一模型是作为 decoder 运行的版本；\n",
+    "  - 同样基于BertOnlyMLMHead；\n",
+    "- BertForNextSentencePrediction：只进行 NSP 任务的预训练。\n",
+    "  - 基于BertOnlyNSPHead，内容就是一个线性层。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']\n",
+      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
+     ]
+    }
+   ],
+   "source": [
+    "_CHECKPOINT_FOR_DOC = \"bert-base-uncased\"\n",
+    "_CONFIG_FOR_DOC = \"BertConfig\"\n",
+    "_TOKENIZER_FOR_DOC = \"BertTokenizer\"\n",
+    "from transformers.models.bert.modeling_bert import *\n",
+    "from transformers.models.bert.configuration_bert import *\n",
+    "class BertForPreTraining(BertPreTrainedModel):\n",
+    "    def __init__(self, config):\n",
+    "        super().__init__(config)\n",
+    "\n",
+    "        self.bert = BertModel(config)\n",
+    "        self.cls = BertPreTrainingHeads(config)\n",
+    "\n",
+    "        self.init_weights()\n",
+    "\n",
+    "    def get_output_embeddings(self):\n",
+    "        return self.cls.predictions.decoder\n",
+    "\n",
+    "    def set_output_embeddings(self, new_embeddings):\n",
+    "        self.cls.predictions.decoder = new_embeddings\n",
+    "\n",
+    "    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n",
+    "    @replace_return_docstrings(output_type=BertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)\n",
+    "    def forward(\n",
+    "        self,\n",
+    "        input_ids=None,\n",
+    "        attention_mask=None,\n",
+    "        token_type_ids=None,\n",
+    "        position_ids=None,\n",
+    "        head_mask=None,\n",
+    "        inputs_embeds=None,\n",
+    "        labels=None,\n",
+    "        next_sentence_label=None,\n",
+    "        output_attentions=None,\n",
+    "        output_hidden_states=None,\n",
+    "        return_dict=None,\n",
+    "    ):\n",
+    "        r\"\"\"\n",
+    "        labels (:obj:`torch.LongTensor` of shape ``(batch_size, sequence_length)``, `optional`):\n",
+    "            Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,\n",
+    "            config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored\n",
+    "            (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``\n",
+    "        next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):\n",
+    "            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair\n",
+    "            (see :obj:`input_ids` docstring) Indices should be in ``[0, 1]``:\n",
+    "            - 0 indicates sequence B is a continuation of sequence A,\n",
+    "            - 1 indicates sequence B is a random sequence.\n",
+    "        kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):\n",
+    "            Used to hide legacy arguments that have been deprecated.\n",
+    "        Returns:\n",
+    "        Example::\n",
+    "from transformers import BertTokenizer, BertForPreTraining\n",
+    "import torch\n",
+    "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n",
+    "model = BertForPreTraining.from_pretrained('bert-base-uncased')\n",
+    "inputs = tokenizer(\"Hello, my dog is cute\", return_tensors=\"pt\")\n",
+    "outputs = model(**inputs)\n",
+    "prediction_logits = outputs.prediction_logits\n",
+    "seq_relationship_logits = outputs.seq_relationship_logits\n",
+    "        \"\"\"\n",
+    "        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n",
+    "\n",
+    "        outputs = self.bert(\n",
+    "            input_ids,\n",
+    "            attention_mask=attention_mask,\n",
+    "            token_type_ids=token_type_ids,\n",
+    "            position_ids=position_ids,\n",
+    "            head_mask=head_mask,\n",
+    "            inputs_embeds=inputs_embeds,\n",
+    "            output_attentions=output_attentions,\n",
+    "            output_hidden_states=output_hidden_states,\n",
+    "            return_dict=return_dict,\n",
+    "        )\n",
+    "\n",
+    "        sequence_output, pooled_output = outputs[:2]\n",
+    "        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)\n",
+    "\n",
+    "        total_loss = None\n",
+    "        if labels is not None and next_sentence_label is not None:\n",
+    "            loss_fct = CrossEntropyLoss()\n",
+    "            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n",
+    "            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n",
+    "            total_loss = masked_lm_loss + next_sentence_loss\n",
+    "\n",
+    "        if not return_dict:\n",
+    "            output = (prediction_scores, seq_relationship_score) + outputs[2:]\n",
+    "            return ((total_loss,) + output) if total_loss is not None else output\n",
+    "\n",
+    "        return BertForPreTrainingOutput(\n",
+    "            loss=total_loss,\n",
+    "            prediction_logits=prediction_scores,\n",
+    "            seq_relationship_logits=seq_relationship_score,\n",
+    "            hidden_states=outputs.hidden_states,\n",
+    "            attentions=outputs.attentions,\n",
+    "        )\n",
+    "\n",
+    "from transformers import BertTokenizer, BertForPreTraining\n",
+    "import torch\n",
+    "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n",
+    "model = BertForPreTraining.from_pretrained('bert-base-uncased')\n",
+    "inputs = tokenizer(\"Hello, my dog is cute\", return_tensors=\"pt\")\n",
+    "outputs = model(**inputs)\n",
+    "prediction_logits = outputs.prediction_logits\n",
+    "seq_relationship_logits = outputs.seq_relationship_logits"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertLMHeadModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']\n",
+      "- This IS expected if you are initializing BertLMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing BertLMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
+     ]
+    }
+   ],
+   "source": [
+    "@add_start_docstrings(\n",
+    "    \"\"\"Bert Model with a `language modeling` head on top for CLM fine-tuning. \"\"\", BERT_START_DOCSTRING\n",
+    ")\n",
+    "class BertLMHeadModel(BertPreTrainedModel):\n",
+    "\n",
+    "    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n",
+    "    _keys_to_ignore_on_load_missing = [r\"position_ids\", r\"predictions.decoder.bias\"]\n",
+    "\n",
+    "    def __init__(self, config):\n",
+    "        super().__init__(config)\n",
+    "\n",
+    "        if not config.is_decoder:\n",
+    "            logger.warning(\"If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`\")\n",
+    "\n",
+    "        self.bert = BertModel(config, add_pooling_layer=False)\n",
+    "        self.cls = BertOnlyMLMHead(config)\n",
+    "\n",
+    "        self.init_weights()\n",
+    "\n",
+    "    def get_output_embeddings(self):\n",
+    "        return self.cls.predictions.decoder\n",
+    "\n",
+    "    def set_output_embeddings(self, new_embeddings):\n",
+    "        self.cls.predictions.decoder = new_embeddings\n",
+    "\n",
+    "    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n",
+    "    @replace_return_docstrings(output_type=CausalLMOutputWithCrossAttentions, config_class=_CONFIG_FOR_DOC)\n",
+    "    def forward(\n",
+    "        self,\n",
+    "        input_ids=None,\n",
+    "        attention_mask=None,\n",
+    "        token_type_ids=None,\n",
+    "        position_ids=None,\n",
+    "        head_mask=None,\n",
+    "        inputs_embeds=None,\n",
+    "        encoder_hidden_states=None,\n",
+    "        encoder_attention_mask=None,\n",
+    "        labels=None,\n",
+    "        past_key_values=None,\n",
+    "        use_cache=None,\n",
+    "        output_attentions=None,\n",
+    "        output_hidden_states=None,\n",
+    "        return_dict=None,\n",
+    "    ):\n",
+    "        r\"\"\"\n",
+    "        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):\n",
+    "            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n",
+    "            the model is configured as a decoder.\n",
+    "        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n",
+    "            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in\n",
+    "            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:\n",
+    "            - 1 for tokens that are **not masked**,\n",
+    "            - 0 for tokens that are **masked**.\n",
+    "        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n",
+    "            Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in\n",
+    "            ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are\n",
+    "            ignored (masked), the loss is only computed for the tokens with labels n ``[0, ..., config.vocab_size]``\n",
+    "        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):\n",
+    "            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.\n",
+    "            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`\n",
+    "            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`\n",
+    "            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.\n",
+    "        use_cache (:obj:`bool`, `optional`):\n",
+    "            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up\n",
+    "            decoding (see :obj:`past_key_values`).\n",
+    "        Returns:\n",
+    "        Example::\n",
+    "            from transformers import BertTokenizer, BertLMHeadModel, BertConfig\n",
+    "            import torch\n",
+    "            tokenizer = BertTokenizer.from_pretrained('bert-base-cased')\n",
+    "            config = BertConfig.from_pretrained(\"bert-base-cased\")\n",
+    "            config.is_decoder = True\n",
+    "            model = BertLMHeadModel.from_pretrained('bert-base-cased', config=config)\n",
+    "            inputs = tokenizer(\"Hello, my dog is cute\", return_tensors=\"pt\")\n",
+    "            outputs = model(**inputs)\n",
+    "            prediction_logits = outputs.logits\n",
+    "        \"\"\"\n",
+    "        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n",
+    "        if labels is not None:\n",
+    "            use_cache = False\n",
+    "\n",
+    "        outputs = self.bert(\n",
+    "            input_ids,\n",
+    "            attention_mask=attention_mask,\n",
+    "            token_type_ids=token_type_ids,\n",
+    "            position_ids=position_ids,\n",
+    "            head_mask=head_mask,\n",
+    "            inputs_embeds=inputs_embeds,\n",
+    "            encoder_hidden_states=encoder_hidden_states,\n",
+    "            encoder_attention_mask=encoder_attention_mask,\n",
+    "            past_key_values=past_key_values,\n",
+    "            use_cache=use_cache,\n",
+    "            output_attentions=output_attentions,\n",
+    "            output_hidden_states=output_hidden_states,\n",
+    "            return_dict=return_dict,\n",
+    "        )\n",
+    "\n",
+    "        sequence_output = outputs[0]\n",
+    "        prediction_scores = self.cls(sequence_output)\n",
+    "\n",
+    "        lm_loss = None\n",
+    "        if labels is not None:\n",
+    "            # we are doing next-token prediction; shift prediction scores and input ids by one\n",
+    "            shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()\n",
+    "            labels = labels[:, 1:].contiguous()\n",
+    "            loss_fct = CrossEntropyLoss()\n",
+    "            lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n",
+    "\n",
+    "        if not return_dict:\n",
+    "            output = (prediction_scores,) + outputs[2:]\n",
+    "            return ((lm_loss,) + output) if lm_loss is not None else output\n",
+    "\n",
+    "        return CausalLMOutputWithCrossAttentions(\n",
+    "            loss=lm_loss,\n",
+    "            logits=prediction_scores,\n",
+    "            past_key_values=outputs.past_key_values,\n",
+    "            hidden_states=outputs.hidden_states,\n",
+    "            attentions=outputs.attentions,\n",
+    "            cross_attentions=outputs.cross_attentions,\n",
+    "        )\n",
+    "\n",
+    "    def prepare_inputs_for_generation(self, input_ids, past=None, attention_mask=None, **model_kwargs):\n",
+    "        input_shape = input_ids.shape\n",
+    "        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly\n",
+    "        if attention_mask is None:\n",
+    "            attention_mask = input_ids.new_ones(input_shape)\n",
+    "\n",
+    "        # cut decoder_input_ids if past is used\n",
+    "        if past is not None:\n",
+    "            input_ids = input_ids[:, -1:]\n",
+    "\n",
+    "        return {\"input_ids\": input_ids, \"attention_mask\": attention_mask, \"past_key_values\": past}\n",
+    "\n",
+    "    def _reorder_cache(self, past, beam_idx):\n",
+    "        reordered_past = ()\n",
+    "        for layer_past in past:\n",
+    "            reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)\n",
+    "        return reordered_past\n",
+    "\n",
+    "from transformers import BertTokenizer, BertLMHeadModel, BertConfig\n",
+    "import torch\n",
+    "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n",
+    "config = BertConfig.from_pretrained(\"bert-base-uncased\")\n",
+    "config.is_decoder = True\n",
+    "model = BertLMHeadModel.from_pretrained('bert-base-uncased', config=config)\n",
+    "inputs = tokenizer(\"Hello, my dog is cute\", return_tensors=\"pt\")\n",
+    "outputs = model(**inputs)\n",
+    "prediction_logits = outputs.logits"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Downloading: 100%|██████████| 440M/440M [00:30<00:00, 14.5MB/s]\n",
+      "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForNextSentencePrediction: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']\n",
+      "- This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
+     ]
+    }
+   ],
+   "source": [
+    "class BertForNextSentencePrediction(BertPreTrainedModel):\n",
+    "    def __init__(self, config):\n",
+    "        super().__init__(config)\n",
+    "\n",
+    "        self.bert = BertModel(config)\n",
+    "        self.cls = BertOnlyNSPHead(config)\n",
+    "\n",
+    "        self.init_weights()\n",
+    "\n",
+    "    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n",
+    "    @replace_return_docstrings(output_type=NextSentencePredictorOutput, config_class=_CONFIG_FOR_DOC)\n",
+    "    def forward(\n",
+    "        self,\n",
+    "        input_ids=None,\n",
+    "        attention_mask=None,\n",
+    "        token_type_ids=None,\n",
+    "        position_ids=None,\n",
+    "        head_mask=None,\n",
+    "        inputs_embeds=None,\n",
+    "        labels=None,\n",
+    "        output_attentions=None,\n",
+    "        output_hidden_states=None,\n",
+    "        return_dict=None,\n",
+    "        **kwargs,\n",
+    "    ):\n",
+    "        r\"\"\"\n",
+    "        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n",
+    "            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair\n",
+    "            (see ``input_ids`` docstring). Indices should be in ``[0, 1]``:\n",
+    "            - 0 indicates sequence B is a continuation of sequence A,\n",
+    "            - 1 indicates sequence B is a random sequence.\n",
+    "        Returns:\n",
+    "        Example::\n",
+    "from transformers import BertTokenizer, BertForNextSentencePrediction\n",
+    "import torch\n",
+    "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n",
+    "model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')\n",
+    "prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n",
+    "next_sentence = \"The sky is blue due to the shorter wavelength of blue light.\"\n",
+    "encoding = tokenizer(prompt, next_sentence, return_tensors='pt')\n",
+    "outputs = model(**encoding, labels=torch.LongTensor([1]))\n",
+    "logits = outputs.logits\n",
+    "assert logits[0, 0] < logits[0, 1] # next sentence was random\n",
+    "        \"\"\"\n",
+    "\n",
+    "        if \"next_sentence_label\" in kwargs:\n",
+    "            warnings.warn(\n",
+    "                \"The `next_sentence_label` argument is deprecated and will be removed in a future version, use `labels` instead.\",\n",
+    "                FutureWarning,\n",
+    "            )\n",
+    "            labels = kwargs.pop(\"next_sentence_label\")\n",
+    "\n",
+    "        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n",
+    "\n",
+    "        outputs = self.bert(\n",
+    "            input_ids,\n",
+    "            attention_mask=attention_mask,\n",
+    "            token_type_ids=token_type_ids,\n",
+    "            position_ids=position_ids,\n",
+    "            head_mask=head_mask,\n",
+    "            inputs_embeds=inputs_embeds,\n",
+    "            output_attentions=output_attentions,\n",
+    "            output_hidden_states=output_hidden_states,\n",
+    "            return_dict=return_dict,\n",
+    "        )\n",
+    "\n",
+    "        pooled_output = outputs[1]\n",
+    "\n",
+    "        seq_relationship_scores = self.cls(pooled_output)\n",
+    "\n",
+    "        next_sentence_loss = None\n",
+    "        if labels is not None:\n",
+    "            loss_fct = CrossEntropyLoss()\n",
+    "            next_sentence_loss = loss_fct(seq_relationship_scores.view(-1, 2), labels.view(-1))\n",
+    "\n",
+    "        if not return_dict:\n",
+    "            output = (seq_relationship_scores,) + outputs[2:]\n",
+    "            return ((next_sentence_loss,) + output) if next_sentence_loss is not None else output\n",
+    "\n",
+    "        return NextSentencePredictorOutput(\n",
+    "            loss=next_sentence_loss,\n",
+    "            logits=seq_relationship_scores,\n",
+    "            hidden_states=outputs.hidden_states,\n",
+    "            attentions=outputs.attentions,\n",
+    "        )\n",
+    "from transformers import BertTokenizer, BertForNextSentencePrediction\n",
+    "import torch\n",
+    "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n",
+    "model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')\n",
+    "prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n",
+    "next_sentence = \"The sky is blue due to the shorter wavelength of blue light.\"\n",
+    "encoding = tokenizer(prompt, next_sentence, return_tensors='pt')\n",
+    "outputs = model(**encoding, labels=torch.LongTensor([1]))\n",
+    "logits = outputs.logits\n",
+    "assert logits[0, 0] < logits[0, 1] # next sentence was random"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "接下来介绍的是各种 Fine-tune 模型，基本都是分类任务：\n",
+    "\n",
+    "![Bert：finetune](./pictures/3-4-bert-ft.png) 图：Bert：finetune"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "*** \n",
+    "### 3.2 BertForSequenceClassification\n",
+    "这一模型用于句子分类（也可以是回归）任务，比如 GLUE benchmark 的各个任务。\n",
+    "- 句子分类的输入为句子（对），输出为单个分类标签。\n",
+    "\n",
+    "结构上很简单，就是`BertModel`（有 pooling）过一个 dropout 后接一个线性层输出分类：\n",
+    "```\n",
+    "class BertForSequenceClassification(BertPreTrainedModel):\n",
+    "    def __init__(self, config):\n",
+    "        super().__init__(config)\n",
+    "        self.num_labels = config.num_labels\n",
+    "\n",
+    "        self.bert = BertModel(config)\n",
+    "        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n",
+    "        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n",
+    "\n",
+    "        self.init_weights()\n",
+    "        # ...\n",
+    "```\n",
+    "\n",
+    "在前向传播时，和上面预训练模型一样需要传入labels输入。\n",
+    "\n",
+    "- 如果初始化的num_labels=1，那么就默认为回归任务，使用 MSELoss；\n",
+    "\n",
+    "- 否则认为是分类任务。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@add_start_docstrings(\n",
+    "    \"\"\"\n",
+    "    Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled\n",
+    "    output) e.g. for GLUE tasks.\n",
+    "    \"\"\",\n",
+    "    BERT_START_DOCSTRING,\n",
+    ")\n",
+    "class BertForSequenceClassification(BertPreTrainedModel):\n",
+    "    def __init__(self, config):\n",
+    "        super().__init__(config)\n",
+    "        self.num_labels = config.num_labels\n",
+    "        self.config = config\n",
+    "\n",
+    "        self.bert = BertModel(config)\n",
+    "        classifier_dropout = (\n",
+    "            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob\n",
+    "        )\n",
+    "        self.dropout = nn.Dropout(classifier_dropout)\n",
+    "        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n",
+    "\n",
+    "        self.init_weights()\n",
+    "\n",
+    "    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n",
+    "    @add_code_sample_docstrings(\n",
+    "        tokenizer_class=_TOKENIZER_FOR_DOC,\n",
+    "        checkpoint=_CHECKPOINT_FOR_DOC,\n",
+    "        output_type=SequenceClassifierOutput,\n",
+    "        config_class=_CONFIG_FOR_DOC,\n",
+    "    )\n",
+    "    def forward(\n",
+    "        self,\n",
+    "        input_ids=None,\n",
+    "        attention_mask=None,\n",
+    "        token_type_ids=None,\n",
+    "        position_ids=None,\n",
+    "        head_mask=None,\n",
+    "        inputs_embeds=None,\n",
+    "        labels=None,\n",
+    "        output_attentions=None,\n",
+    "        output_hidden_states=None,\n",
+    "        return_dict=None,\n",
+    "    ):\n",
+    "        r\"\"\"\n",
+    "        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n",
+    "            Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,\n",
+    "            config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n",
+    "            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n",
+    "        \"\"\"\n",
+    "        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n",
+    "\n",
+    "        outputs = self.bert(\n",
+    "            input_ids,\n",
+    "            attention_mask=attention_mask,\n",
+    "            token_type_ids=token_type_ids,\n",
+    "            position_ids=position_ids,\n",
+    "            head_mask=head_mask,\n",
+    "            inputs_embeds=inputs_embeds,\n",
+    "            output_attentions=output_attentions,\n",
+    "            output_hidden_states=output_hidden_states,\n",
+    "            return_dict=return_dict,\n",
+    "        )\n",
+    "\n",
+    "        pooled_output = outputs[1]\n",
+    "\n",
+    "        pooled_output = self.dropout(pooled_output)\n",
+    "        logits = self.classifier(pooled_output)\n",
+    "\n",
+    "        loss = None\n",
+    "        if labels is not None:\n",
+    "            if self.config.problem_type is None:\n",
+    "                if self.num_labels == 1:\n",
+    "                    self.config.problem_type = \"regression\"\n",
+    "                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):\n",
+    "                    self.config.problem_type = \"single_label_classification\"\n",
+    "                else:\n",
+    "                    self.config.problem_type = \"multi_label_classification\"\n",
+    "\n",
+    "            if self.config.problem_type == \"regression\":\n",
+    "                loss_fct = MSELoss()\n",
+    "                if self.num_labels == 1:\n",
+    "                    loss = loss_fct(logits.squeeze(), labels.squeeze())\n",
+    "                else:\n",
+    "                    loss = loss_fct(logits, labels)\n",
+    "            elif self.config.problem_type == \"single_label_classification\":\n",
+    "                loss_fct = CrossEntropyLoss()\n",
+    "                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n",
+    "            elif self.config.problem_type == \"multi_label_classification\":\n",
+    "                loss_fct = BCEWithLogitsLoss()\n",
+    "                loss = loss_fct(logits, labels)\n",
+    "        if not return_dict:\n",
+    "            output = (logits,) + outputs[2:]\n",
+    "            return ((loss,) + output) if loss is not None else output\n",
+    "\n",
+    "        return SequenceClassifierOutput(\n",
+    "            loss=loss,\n",
+    "            logits=logits,\n",
+    "            hidden_states=outputs.hidden_states,\n",
+    "            attentions=outputs.attentions,\n",
+    "        )"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Downloading: 100%|██████████| 213k/213k [00:00<00:00, 596kB/s]\n",
+      "Downloading: 100%|██████████| 29.0/29.0 [00:00<00:00, 12.4kB/s]\n",
+      "Downloading: 100%|██████████| 436k/436k [00:00<00:00, 808kB/s]\n",
+      "Downloading: 100%|██████████| 433/433 [00:00<00:00, 166kB/s]\n",
+      "Downloading: 100%|██████████| 433M/433M [00:29<00:00, 14.5MB/s]\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "not paraphrase: 10%\n",
+      "is paraphrase: 90%\n",
+      "not paraphrase: 94%\n",
+      "is paraphrase: 6%\n"
+     ]
+    }
+   ],
+   "source": [
+    "from transformers.models.bert.tokenization_bert import BertTokenizer\n",
+    "from transformers.models.bert.modeling_bert import BertForSequenceClassification\n",
+    "tokenizer = BertTokenizer.from_pretrained(\"bert-base-cased-finetuned-mrpc\")\n",
+    "model = BertForSequenceClassification.from_pretrained(\"bert-base-cased-finetuned-mrpc\")\n",
+    "\n",
+    "classes = [\"not paraphrase\", \"is paraphrase\"]\n",
+    "\n",
+    "sequence_0 = \"The company HuggingFace is based in New York City\"\n",
+    "sequence_1 = \"Apples are especially bad for your health\"\n",
+    "sequence_2 = \"HuggingFace's headquarters are situated in Manhattan\"\n",
+    "\n",
+    "# The tokekenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to the sequence, as well as compute the attention masks.\n",
+    "paraphrase = tokenizer(sequence_0, sequence_2, return_tensors=\"pt\")\n",
+    "not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors=\"pt\")\n",
+    "\n",
+    "paraphrase_classification_logits = model(**paraphrase).logits\n",
+    "not_paraphrase_classification_logits = model(**not_paraphrase).logits\n",
+    "\n",
+    "paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]\n",
+    "not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]\n",
+    "\n",
+    "# Should be paraphrase\n",
+    "for i in range(len(classes)):\n",
+    "    print(f\"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%\")\n",
+    "\n",
+    "# Should not be paraphrase\n",
+    "for i in range(len(classes)):\n",
+    "    print(f\"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "*** \n",
+    "### 3.3 BertForMultipleChoice\n",
+    "\n",
+    "这一模型用于多项选择，如 RocStories/SWAG 任务。\n",
+    "- 多项选择任务的输入为一组分次输入的句子，输出为选择某一句子的单个标签。\n",
+    "结构上与句子分类相似，只不过线性层输出维度为 1，即每次需要将每个样本的多个句子的输出拼接起来作为每个样本的预测分数。\n",
+    "- 实际上，具体操作时是把每个 batch 的多个句子一同放入的，所以一次处理的输入为[batch_size, num_choices]数量的句子，因此相同 batch 大小时，比句子分类等任务需要更多的显存，在训练时需要小心。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "*** \n",
+    "### 3.4 BertForTokenClassification\n",
+    "这一模型用于序列标注（词分类），如 NER 任务。\n",
+    "- 序列标注任务的输入为单个句子文本，输出为每个 token 对应的类别标签。\n",
+    "由于需要用到每个 token对应的输出而不只是某几个，所以这里的BertModel不用加入 pooling 层；\n",
+    "- 同时，这里将`_keys_to_ignore_on_load_unexpected`这一个类参数设置为`[r\"pooler\"]`，也就是在加载模型时对于出现不需要的权重不发生报错。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class BertForMultipleChoice(BertPreTrainedModel):\n",
+    "    def __init__(self, config):\n",
+    "        super().__init__(config)\n",
+    "\n",
+    "        self.bert = BertModel(config)\n",
+    "        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n",
+    "        self.classifier = nn.Linear(config.hidden_size, 1)\n",
+    "\n",
+    "        self.init_weights()\n",
+    "\n",
+    "    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, num_choices, sequence_length\"))\n",
+    "    @add_code_sample_docstrings(\n",
+    "        tokenizer_class=_TOKENIZER_FOR_DOC,\n",
+    "        checkpoint=_CHECKPOINT_FOR_DOC,\n",
+    "        output_type=MultipleChoiceModelOutput,\n",
+    "        config_class=_CONFIG_FOR_DOC,\n",
+    "    )\n",
+    "    def forward(\n",
+    "        self,\n",
+    "        input_ids=None,\n",
+    "        attention_mask=None,\n",
+    "        token_type_ids=None,\n",
+    "        position_ids=None,\n",
+    "        head_mask=None,\n",
+    "        inputs_embeds=None,\n",
+    "        labels=None,\n",
+    "        output_attentions=None,\n",
+    "        output_hidden_states=None,\n",
+    "        return_dict=None,\n",
+    "    ):\n",
+    "        r\"\"\"\n",
+    "        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n",
+    "            Labels for computing the multiple choice classification loss. Indices should be in ``[0, ...,\n",
+    "            num_choices-1]`` where :obj:`num_choices` is the size of the second dimension of the input tensors. (See\n",
+    "            :obj:`input_ids` above)\n",
+    "        \"\"\"\n",
+    "        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n",
+    "        num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]\n",
+    "\n",
+    "        input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None\n",
+    "        attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n",
+    "        token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n",
+    "        position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None\n",
+    "        inputs_embeds = (\n",
+    "            inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))\n",
+    "            if inputs_embeds is not None\n",
+    "            else None\n",
+    "        )\n",
+    "\n",
+    "        outputs = self.bert(\n",
+    "            input_ids,\n",
+    "            attention_mask=attention_mask,\n",
+    "            token_type_ids=token_type_ids,\n",
+    "            position_ids=position_ids,\n",
+    "            head_mask=head_mask,\n",
+    "            inputs_embeds=inputs_embeds,\n",
+    "            output_attentions=output_attentions,\n",
+    "            output_hidden_states=output_hidden_states,\n",
+    "            return_dict=return_dict,\n",
+    "        )\n",
+    "\n",
+    "        pooled_output = outputs[1]\n",
+    "\n",
+    "        pooled_output = self.dropout(pooled_output)\n",
+    "        logits = self.classifier(pooled_output)\n",
+    "        reshaped_logits = logits.view(-1, num_choices)\n",
+    "\n",
+    "        loss = None\n",
+    "        if labels is not None:\n",
+    "            loss_fct = CrossEntropyLoss()\n",
+    "            loss = loss_fct(reshaped_logits, labels)\n",
+    "\n",
+    "        if not return_dict:\n",
+    "            output = (reshaped_logits,) + outputs[2:]\n",
+    "            return ((loss,) + output) if loss is not None else output\n",
+    "\n",
+    "        return MultipleChoiceModelOutput(\n",
+    "            loss=loss,\n",
+    "            logits=reshaped_logits,\n",
+    "            hidden_states=outputs.hidden_states,\n",
+    "            attentions=outputs.attentions,\n",
+    "        )\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@add_start_docstrings(\n",
+    "    \"\"\"\n",
+    "    Bert Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for\n",
+    "    Named-Entity-Recognition (NER) tasks.\n",
+    "    \"\"\",\n",
+    "    BERT_START_DOCSTRING,\n",
+    ")\n",
+    "class BertForTokenClassification(BertPreTrainedModel):\n",
+    "\n",
+    "    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n",
+    "\n",
+    "    def __init__(self, config):\n",
+    "        super().__init__(config)\n",
+    "        self.num_labels = config.num_labels\n",
+    "\n",
+    "        self.bert = BertModel(config, add_pooling_layer=False)\n",
+    "        classifier_dropout = (\n",
+    "            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob\n",
+    "        )\n",
+    "        self.dropout = nn.Dropout(classifier_dropout)\n",
+    "        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n",
+    "\n",
+    "        self.init_weights()\n",
+    "\n",
+    "    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n",
+    "    @add_code_sample_docstrings(\n",
+    "        tokenizer_class=_TOKENIZER_FOR_DOC,\n",
+    "        checkpoint=_CHECKPOINT_FOR_DOC,\n",
+    "        output_type=TokenClassifierOutput,\n",
+    "        config_class=_CONFIG_FOR_DOC,\n",
+    "    )\n",
+    "    def forward(\n",
+    "        self,\n",
+    "        input_ids=None,\n",
+    "        attention_mask=None,\n",
+    "        token_type_ids=None,\n",
+    "        position_ids=None,\n",
+    "        head_mask=None,\n",
+    "        inputs_embeds=None,\n",
+    "        labels=None,\n",
+    "        output_attentions=None,\n",
+    "        output_hidden_states=None,\n",
+    "        return_dict=None,\n",
+    "    ):\n",
+    "        r\"\"\"\n",
+    "        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n",
+    "            Labels for computing the token classification loss. Indices should be in ``[0, ..., config.num_labels -\n",
+    "            1]``.\n",
+    "        \"\"\"\n",
+    "        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n",
+    "\n",
+    "        outputs = self.bert(\n",
+    "            input_ids,\n",
+    "            attention_mask=attention_mask,\n",
+    "            token_type_ids=token_type_ids,\n",
+    "            position_ids=position_ids,\n",
+    "            head_mask=head_mask,\n",
+    "            inputs_embeds=inputs_embeds,\n",
+    "            output_attentions=output_attentions,\n",
+    "            output_hidden_states=output_hidden_states,\n",
+    "            return_dict=return_dict,\n",
+    "        )\n",
+    "\n",
+    "        sequence_output = outputs[0]\n",
+    "\n",
+    "        sequence_output = self.dropout(sequence_output)\n",
+    "        logits = self.classifier(sequence_output)\n",
+    "\n",
+    "        loss = None\n",
+    "        if labels is not None:\n",
+    "            loss_fct = CrossEntropyLoss()\n",
+    "            # Only keep active parts of the loss\n",
+    "            if attention_mask is not None:\n",
+    "                active_loss = attention_mask.view(-1) == 1\n",
+    "                active_logits = logits.view(-1, self.num_labels)\n",
+    "                active_labels = torch.where(\n",
+    "                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n",
+    "                )\n",
+    "                loss = loss_fct(active_logits, active_labels)\n",
+    "            else:\n",
+    "                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n",
+    "\n",
+    "        if not return_dict:\n",
+    "            output = (logits,) + outputs[2:]\n",
+    "            return ((loss,) + output) if loss is not None else output\n",
+    "\n",
+    "        return TokenClassifierOutput(\n",
+    "            loss=loss,\n",
+    "            logits=logits,\n",
+    "            hidden_states=outputs.hidden_states,\n",
+    "            attentions=outputs.attentions,\n",
+    "        )\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Downloading: 100%|██████████| 998/998 [00:00<00:00, 382kB/s]\n",
+      "Downloading: 100%|██████████| 1.33G/1.33G [01:30<00:00, 14.7MB/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "from transformers import BertForTokenClassification, BertTokenizer\n",
+    "import torch\n",
+    "\n",
+    "model = BertForTokenClassification.from_pretrained(\"dbmdz/bert-large-cased-finetuned-conll03-english\")\n",
+    "tokenizer = BertTokenizer.from_pretrained(\"bert-base-cased\")\n",
+    "\n",
+    "label_list = [\n",
+    "\"O\",       # Outside of a named entity\n",
+    "\"B-MISC\",  # Beginning of a miscellaneous entity right after another miscellaneous entity\n",
+    "\"I-MISC\",  # Miscellaneous entity\n",
+    "\"B-PER\",   # Beginning of a person's name right after another person's name\n",
+    "\"I-PER\",   # Person's name\n",
+    "\"B-ORG\",   # Beginning of an organisation right after another organisation\n",
+    "\"I-ORG\",   # Organisation\n",
+    "\"B-LOC\",   # Beginning of a location right after another location\n",
+    "\"I-LOC\"    # Location\n",
+    "]\n",
+    "\n",
+    "sequence = \"Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge.\"\n",
+    "\n",
+    "# Bit of a hack to get the tokens with the special tokens\n",
+    "tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))\n",
+    "inputs = tokenizer.encode(sequence, return_tensors=\"pt\")\n",
+    "\n",
+    "outputs = model(inputs).logits\n",
+    "predictions = torch.argmax(outputs, dim=2)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "('[CLS]', 'O')\n",
+      "('Hu', 'I-ORG')\n",
+      "('##gging', 'I-ORG')\n",
+      "('Face', 'I-ORG')\n",
+      "('Inc', 'I-ORG')\n",
+      "('.', 'O')\n",
+      "('is', 'O')\n",
+      "('a', 'O')\n",
+      "('company', 'O')\n",
+      "('based', 'O')\n",
+      "('in', 'O')\n",
+      "('New', 'I-LOC')\n",
+      "('York', 'I-LOC')\n",
+      "('City', 'I-LOC')\n",
+      "('.', 'O')\n",
+      "('Its', 'O')\n",
+      "('headquarters', 'O')\n",
+      "('are', 'O')\n",
+      "('in', 'O')\n",
+      "('D', 'I-LOC')\n",
+      "('##UM', 'I-LOC')\n",
+      "('##BO', 'I-LOC')\n",
+      "(',', 'O')\n",
+      "('therefore', 'O')\n",
+      "('very', 'O')\n",
+      "('close', 'O')\n",
+      "('to', 'O')\n",
+      "('the', 'O')\n",
+      "('Manhattan', 'I-LOC')\n",
+      "('Bridge', 'I-LOC')\n",
+      "('.', 'O')\n",
+      "('[SEP]', 'O')\n"
+     ]
+    }
+   ],
+   "source": [
+    "for token, prediction in zip(tokens, predictions[0].numpy()):\n",
+    "    print((token, model.config.id2label[prediction]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "*** \n",
+    "### 3.5 BertForQuestionAnswering\n",
+    "这一模型用于解决问答任务，例如 SQuAD 任务。\n",
+    "- 问答任务的输入为问题 +（对于 BERT 只能是一个）回答组成的句子对，输出为起始位置和结束位置用于标出回答中的具体文本。\n",
+    "这里需要两个输出，即对起始位置的预测和对结束位置的预测，两个输出的长度都和句子长度一样，从其中挑出最大的预测值对应的下标作为预测的位置。\n",
+    "- 对超出句子长度的非法 label，会将其压缩（torch.clamp_）到合理范围。\n",
+    "\n",
+    "以上就是关于 BERT 源码的介绍，下面介绍一些关于 BERT 模型实用的训练细节。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@add_start_docstrings(\n",
+    "    \"\"\"\n",
+    "    Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear\n",
+    "    layers on top of the hidden-states output to compute `span start logits` and `span end logits`).\n",
+    "    \"\"\",\n",
+    "    BERT_START_DOCSTRING,\n",
+    ")\n",
+    "class BertForQuestionAnswering(BertPreTrainedModel):\n",
+    "\n",
+    "    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n",
+    "\n",
+    "    def __init__(self, config):\n",
+    "        super().__init__(config)\n",
+    "        self.num_labels = config.num_labels\n",
+    "\n",
+    "        self.bert = BertModel(config, add_pooling_layer=False)\n",
+    "        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n",
+    "\n",
+    "        self.init_weights()\n",
+    "\n",
+    "    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n",
+    "    @add_code_sample_docstrings(\n",
+    "        tokenizer_class=_TOKENIZER_FOR_DOC,\n",
+    "        checkpoint=_CHECKPOINT_FOR_DOC,\n",
+    "        output_type=QuestionAnsweringModelOutput,\n",
+    "        config_class=_CONFIG_FOR_DOC,\n",
+    "    )\n",
+    "    def forward(\n",
+    "        self,\n",
+    "        input_ids=None,\n",
+    "        attention_mask=None,\n",
+    "        token_type_ids=None,\n",
+    "        position_ids=None,\n",
+    "        head_mask=None,\n",
+    "        inputs_embeds=None,\n",
+    "        start_positions=None,\n",
+    "        end_positions=None,\n",
+    "        output_attentions=None,\n",
+    "        output_hidden_states=None,\n",
+    "        return_dict=None,\n",
+    "    ):\n",
+    "        r\"\"\"\n",
+    "        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n",
+    "            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n",
+    "            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the\n",
+    "            sequence are not taken into account for computing the loss.\n",
+    "        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n",
+    "            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n",
+    "            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the\n",
+    "            sequence are not taken into account for computing the loss.\n",
+    "        \"\"\"\n",
+    "        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n",
+    "\n",
+    "        outputs = self.bert(\n",
+    "            input_ids,\n",
+    "            attention_mask=attention_mask,\n",
+    "            token_type_ids=token_type_ids,\n",
+    "            position_ids=position_ids,\n",
+    "            head_mask=head_mask,\n",
+    "            inputs_embeds=inputs_embeds,\n",
+    "            output_attentions=output_attentions,\n",
+    "            output_hidden_states=output_hidden_states,\n",
+    "            return_dict=return_dict,\n",
+    "        )\n",
+    "\n",
+    "        sequence_output = outputs[0]\n",
+    "\n",
+    "        logits = self.qa_outputs(sequence_output)\n",
+    "        start_logits, end_logits = logits.split(1, dim=-1)\n",
+    "        start_logits = start_logits.squeeze(-1).contiguous()\n",
+    "        end_logits = end_logits.squeeze(-1).contiguous()\n",
+    "\n",
+    "        total_loss = None\n",
+    "        if start_positions is not None and end_positions is not None:\n",
+    "            # If we are on multi-GPU, split add a dimension\n",
+    "            if len(start_positions.size()) > 1:\n",
+    "                start_positions = start_positions.squeeze(-1)\n",
+    "            if len(end_positions.size()) > 1:\n",
+    "                end_positions = end_positions.squeeze(-1)\n",
+    "            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n",
+    "            ignored_index = start_logits.size(1)\n",
+    "            start_positions = start_positions.clamp(0, ignored_index)\n",
+    "            end_positions = end_positions.clamp(0, ignored_index)\n",
+    "\n",
+    "            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n",
+    "            start_loss = loss_fct(start_logits, start_positions)\n",
+    "            end_loss = loss_fct(end_logits, end_positions)\n",
+    "            total_loss = (start_loss + end_loss) / 2\n",
+    "\n",
+    "        if not return_dict:\n",
+    "            output = (start_logits, end_logits) + outputs[2:]\n",
+    "            return ((total_loss,) + output) if total_loss is not None else output\n",
+    "\n",
+    "        return QuestionAnsweringModelOutput(\n",
+    "            loss=total_loss,\n",
+    "            start_logits=start_logits,\n",
+    "            end_logits=end_logits,\n",
+    "            hidden_states=outputs.hidden_states,\n",
+    "            attentions=outputs.attentions,\n",
+    "        )\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Downloading: 100%|██████████| 443/443 [00:00<00:00, 186kB/s]\n",
+      "Downloading: 100%|██████████| 232k/232k [00:00<00:00, 438kB/s]\n",
+      "Downloading: 100%|██████████| 466k/466k [00:00<00:00, 845kB/s]\n",
+      "Downloading: 100%|██████████| 28.0/28.0 [00:00<00:00, 10.5kB/s]\n",
+      "Downloading: 100%|██████████| 1.34G/1.34G [01:28<00:00, 15.1MB/s]\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Question: How many pretrained models are available in 🤗 Transformers?\n",
+      "Answer: over 32 +\n",
+      "Question: What does 🤗 Transformers provide?\n",
+      "Answer: general - purpose architectures\n",
+      "Question: 🤗 Transformers provides interoperability between which frameworks?\n",
+      "Answer: tensorflow 2. 0 and pytorch\n"
+     ]
+    }
+   ],
+   "source": [
+    "from transformers import AutoTokenizer, AutoModelForQuestionAnswering\n",
+    "import torch\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"bert-large-uncased-whole-word-masking-finetuned-squad\")\n",
+    "model = AutoModelForQuestionAnswering.from_pretrained(\"bert-large-uncased-whole-word-masking-finetuned-squad\")\n",
+    "\n",
+    "text = \"🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.\"\n",
+    "\n",
+    "questions = [\n",
+    "\"How many pretrained models are available in 🤗 Transformers?\",\n",
+    "\"What does 🤗 Transformers provide?\",\n",
+    "\"🤗 Transformers provides interoperability between which frameworks?\",\n",
+    "]\n",
+    "\n",
+    "for question in questions:\n",
+    "    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors=\"pt\")\n",
+    "    input_ids = inputs[\"input_ids\"].tolist()[0]\n",
+    "    outputs = model(**inputs)\n",
+    "    answer_start_scores = outputs.start_logits\n",
+    "    answer_end_scores = outputs.end_logits\n",
+    "    answer_start = torch.argmax(\n",
+    "        answer_start_scores\n",
+    "    )  # Get the most likely beginning of answer with the argmax of the score\n",
+    "    answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score\n",
+    "    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))\n",
+    "    print(f\"Question: {question}\")\n",
+    "    print(f\"Answer: {answer}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "*** \n",
+    "## BERT训练和优化\n",
+    "### 4.1 Pre-Training\n",
+    "预训练阶段，除了众所周知的 15%、80% mask 比例，有一个值得注意的地方就是参数共享。\n",
+    "不止 BERT，所有 huggingface 实现的 PLM 的 word embedding 和 masked language model 的预测权重在初始化过程中都是共享的：\n",
+    "```\n",
+    "class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin):\n",
+    "    # ...\n",
+    "    def tie_weights(self):\n",
+    "        \"\"\"\n",
+    "        Tie the weights between the input embeddings and the output embeddings.\n",
+    "\n",
+    "        If the :obj:`torchscript` flag is set in the configuration, can't handle parameter sharing so we are cloning\n",
+    "        the weights instead.\n",
+    "        \"\"\"\n",
+    "        output_embeddings = self.get_output_embeddings()\n",
+    "        if output_embeddings is not None and self.config.tie_word_embeddings:\n",
+    "            self._tie_or_clone_weights(output_embeddings, self.get_input_embeddings())\n",
+    "\n",
+    "        if self.config.is_encoder_decoder and self.config.tie_encoder_decoder:\n",
+    "            if hasattr(self, self.base_model_prefix):\n",
+    "                self = getattr(self, self.base_model_prefix)\n",
+    "            self._tie_encoder_decoder_weights(self.encoder, self.decoder, self.base_model_prefix)\n",
+    "    # ...\n",
+    "```\n",
+    "\n",
+    "至于为什么，应该是因为 word_embedding 和 prediction 权重太大了，以 bert-base 为例，其尺寸为(30522, 768)，降低训练难度。\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "***\n",
+    "### 4.2 Fine-Tuning\n",
+    "微调也就是下游任务阶段，也有两个值得注意的地方。\n",
+    "#### 4.2.1 AdamW\n",
+    "首先介绍一下 BERT 的优化器：AdamW（AdamWeightDecayOptimizer）。\n",
+    "\n",
+    "这一优化器来自 ICLR 2017 的 Best Paper：《Fixing Weight Decay Regularization in Adam》中提出的一种用于修复 Adam 的权重衰减错误的新方法。论文指出，L2 正则化和权重衰减在大部分情况下并不等价，只在 SGD 优化的情况下是等价的；而大多数框架中对于 Adam+L2 正则使用的是权重衰减的方式，两者不能混为一谈。\n",
+    "\n",
+    "AdamW 是在 Adam+L2 正则化的基础上进行改进的算法，与一般的 Adam+L2 的区别如下：\n",
+    "\n",
+    "![图：AdamW](./pictures/3-5-adamw.png) 图：AdamW\n",
+    "\n",
+    "关于 AdamW 的分析可以参考：\n",
+    "\n",
+    "- AdamW and Super-convergence is now the fastest way to train neural nets [1]\n",
+    "- paperplanet：都 9102 年了，别再用 Adam + L2 regularization了 [2]\n",
+    "\n",
+    "通常，我们会选择模型的 weight 部分参与 decay 过程，而另一部分（包括 LayerNorm 的 weight）不参与（代码最初来源应该是 Huggingface 的示例）\n",
+    "补充：关于这么做的理由，我暂时没有找到合理的解答，但是找到了一些相关的[讨论](https://forums.fast.ai/t/is-weight-decay-applied-to-the-bias-term/73212/4forums.fast.ai)\n",
+    "\n",
+    "```\n",
+    "# model: a Bert-based-model object\n",
+    "    # learning_rate: default 2e-5 for text classification\n",
+    "    param_optimizer = list(model.named_parameters())\n",
+    "    no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']\n",
+    "    optimizer_grouped_parameters = [\n",
+    "        {'params': [p for n, p in param_optimizer if not any(\n",
+    "            nd in n for nd in no_decay)], 'weight_decay': 0.01},\n",
+    "        {'params': [p for n, p in param_optimizer if any(\n",
+    "            nd in n for nd in no_decay)], 'weight_decay': 0.0}\n",
+    "    ]\n",
+    "    optimizer = AdamW(optimizer_grouped_parameters,\n",
+    "                      lr=learning_rate)\n",
+    "    # ...\n",
+    "```\n",
+    "\n",
+    "#### 4.2.2 Warmup\n",
+    "\n",
+    "BERT 的训练中另一个特点在于 Warmup，其含义为：\n",
+    "\n",
+    "在训练初期使用较小的学习率（从 0 开始），在一定步数（比如 1000 步）内逐渐提高到正常大小（比如上面的 2e-5），避免模型过早进入局部最优而过拟合；\n",
+    "- 在训练后期再慢慢将学习率降低到 0，避免后期训练还出现较大的参数变化。\n",
+    "- 在 Huggingface 的实现中，可以使用多种 warmup 策略：\n",
+    "```\n",
+    "TYPE_TO_SCHEDULER_FUNCTION = {\n",
+    "    SchedulerType.LINEAR: get_linear_schedule_with_warmup,\n",
+    "    SchedulerType.COSINE: get_cosine_schedule_with_warmup,\n",
+    "    SchedulerType.COSINE_WITH_RESTARTS: get_cosine_with_hard_restarts_schedule_with_warmup,\n",
+    "    SchedulerType.POLYNOMIAL: get_polynomial_decay_schedule_with_warmup,\n",
+    "    SchedulerType.CONSTANT: get_constant_schedule,\n",
+    "    SchedulerType.CONSTANT_WITH_WARMUP: get_constant_schedule_with_warmup,\n",
+    "}\n",
+    "```\n",
+    "具体而言：\n",
+    "- CONSTANT：保持固定学习率不变；\n",
+    "- CONSTANT_WITH_WARMUP：在每一个 step 中线性调整学习率；\n",
+    "- LINEAR：上文提到的两段式调整；\n",
+    "- COSINE：和两段式调整类似，只不过采用的是三角函数式的曲线调整；\n",
+    "- COSINE_WITH_RESTARTS：训练中将上面 COSINE 的调整重复 n 次；\n",
+    "- POLYNOMIAL：按指数曲线进行两段式调整。\n",
+    "具体使用参考transformers/optimization.py：\n",
+    "最常用的还是get_linear_scheduler_with_warmup即线性两段式调整学习率的方案。\n",
+    "\n",
+    "```\n",
+    "def get_scheduler(\n",
+    "    name: Union[str, SchedulerType],\n",
+    "    optimizer: Optimizer,\n",
+    "    num_warmup_steps: Optional[int] = None,\n",
+    "    num_training_steps: Optional[int] = None,\n",
+    "): ...\n",
+    "\n",
+    "```\n",
+    "\n",
+    "以上即为关于 transformers 库（4.4.2 版本）中 BERT 应用的相关代码的具体实现分析，欢迎与读者共同交流探讨。\n",
+    "\n",
+    "## 致谢\n",
+    "本文主要由浙江大学李泺秋撰写，本项目同学负责整理和汇总。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "interpreter": {
+   "hash": "3bfce0b4c492a35815b5705a19fe374a7eea0baaa08b34d90450caf1fe9ce20b"
+  },
+  "kernelspec": {
+   "display_name": "Python 3.8.10 64-bit ('venv': virtualenv)",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
\ No newline at end of file
diff --git a/docs/篇章3-编写一个Transformer模型：BERT/3.2-如何应用一个BERT.md b/docs/篇章3-编写一个Transformer模型：BERT/3.2-如何应用一个BERT.md
index 49ec472..1b5bf6d 100644
--- a/docs/篇章3-编写一个Transformer模型：BERT/3.2-如何应用一个BERT.md
+++ b/docs/篇章3-编写一个Transformer模型：BERT/3.2-如何应用一个BERT.md
@@ -1,6 +1,8 @@
 ## 前言
 接着上一小节，我们对Huggingface开源代码库中的Bert模型进行了深入学习，这一节我们对如何应用BERT进行详细的讲解。
 
+涉及到的jupyter可以在[代码库：篇章3-编写一个Transformer模型：BERT，下载](https://github.com/datawhalechina/learn-nlp-with-transformers/tree/main/docs/%E7%AF%87%E7%AB%A03-%E7%BC%96%E5%86%99%E4%B8%80%E4%B8%AATransformer%E6%A8%A1%E5%9E%8B%EF%BC%9ABERT)
+
 本文基于 Transformers 版本 4.4.2（2021 年 3 月 19 日发布）项目中，pytorch 版的 BERT 相关代码，从代码结构、具体实现与原理，以及使用的角度进行分析，包含以下内容：
 
 3. BERT-based Models应用模型
@@ -17,15 +19,14 @@
   - Warmup
 
 ## 3-BERT-based Models
-基于 BERT 的模型都写在/models/bert/modeling_bert.py里面，包括 BERT 预训练模型和 BERT 分类等模型：
-BERT模型一图流（建议保存后放大查看）：
-![Bert模型-图流](./pictures/3-2-bert-flow.png) 图：Bert模型-图流
+基于 BERT 的模型都写在/models/bert/modeling_bert.py里面，包括 BERT 预训练模型和 BERT 分类等模型。
 
 首先，以下所有的模型都是基于`BertPreTrainedModel`这一抽象基类的，而后者则基于一个更大的基类`PreTrainedModel`。这里我们关注`BertPreTrainedModel`的功能：
 
 用于初始化模型权重，同时维护继承自`PreTrainedModel`的一些标记身份或者加载模型时的类变量。
 下面，首先从预训练模型开始分析。
 
+*** 
 ### 3.1 BertForPreTraining
 
 众所周知，BERT 预训练任务包括两个：
@@ -158,9 +159,384 @@ class BertPredictionHeadTransform(nn.Module):
 - BertForNextSentencePrediction：只进行 NSP 任务的预训练。
   - 基于BertOnlyNSPHead，内容就是一个线性层。
 
+
+```python
+_CHECKPOINT_FOR_DOC = "bert-base-uncased"
+_CONFIG_FOR_DOC = "BertConfig"
+_TOKENIZER_FOR_DOC = "BertTokenizer"
+from transformers.models.bert.modeling_bert import *
+from transformers.models.bert.configuration_bert import *
+class BertForPreTraining(BertPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.bert = BertModel(config)
+        self.cls = BertPreTrainingHeads(config)
+
+        self.init_weights()
+
+    def get_output_embeddings(self):
+        return self.cls.predictions.decoder
+
+    def set_output_embeddings(self, new_embeddings):
+        self.cls.predictions.decoder = new_embeddings
+
+    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @replace_return_docstrings(output_type=BertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        next_sentence_label=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape ``(batch_size, sequence_length)``, `optional`):
+            Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,
+            config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored
+            (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``
+        next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
+            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair
+            (see :obj:`input_ids` docstring) Indices should be in ``[0, 1]``:
+            - 0 indicates sequence B is a continuation of sequence A,
+            - 1 indicates sequence B is a random sequence.
+        kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
+            Used to hide legacy arguments that have been deprecated.
+        Returns:
+        Example::
+from transformers import BertTokenizer, BertForPreTraining
+import torch
+tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+model = BertForPreTraining.from_pretrained('bert-base-uncased')
+inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
+outputs = model(**inputs)
+prediction_logits = outputs.prediction_logits
+seq_relationship_logits = outputs.seq_relationship_logits
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output, pooled_output = outputs[:2]
+        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)
+
+        total_loss = None
+        if labels is not None and next_sentence_label is not None:
+            loss_fct = CrossEntropyLoss()
+            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
+            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))
+            total_loss = masked_lm_loss + next_sentence_loss
+
+        if not return_dict:
+            output = (prediction_scores, seq_relationship_score) + outputs[2:]
+            return ((total_loss,) + output) if total_loss is not None else output
+
+        return BertForPreTrainingOutput(
+            loss=total_loss,
+            prediction_logits=prediction_scores,
+            seq_relationship_logits=seq_relationship_score,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+from transformers import BertTokenizer, BertForPreTraining
+import torch
+tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+model = BertForPreTraining.from_pretrained('bert-base-uncased')
+inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
+outputs = model(**inputs)
+prediction_logits = outputs.prediction_logits
+seq_relationship_logits = outputs.seq_relationship_logits
+```
+
+    Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
+    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
+
+
+
+```python
+@add_start_docstrings(
+    """Bert Model with a `language modeling` head on top for CLM fine-tuning. """, BERT_START_DOCSTRING
+)
+class BertLMHeadModel(BertPreTrainedModel):
+
+    _keys_to_ignore_on_load_unexpected = [r"pooler"]
+    _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"]
+
+    def __init__(self, config):
+        super().__init__(config)
+
+        if not config.is_decoder:
+            logger.warning("If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`")
+
+        self.bert = BertModel(config, add_pooling_layer=False)
+        self.cls = BertOnlyMLMHead(config)
+
+        self.init_weights()
+
+    def get_output_embeddings(self):
+        return self.cls.predictions.decoder
+
+    def set_output_embeddings(self, new_embeddings):
+        self.cls.predictions.decoder = new_embeddings
+
+    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @replace_return_docstrings(output_type=CausalLMOutputWithCrossAttentions, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        labels=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
+            ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are
+            ignored (masked), the loss is only computed for the tokens with labels n ``[0, ..., config.vocab_size]``
+        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
+            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
+            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
+        use_cache (:obj:`bool`, `optional`):
+            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
+            decoding (see :obj:`past_key_values`).
+        Returns:
+        Example::
+            from transformers import BertTokenizer, BertLMHeadModel, BertConfig
+            import torch
+            tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
+            config = BertConfig.from_pretrained("bert-base-cased")
+            config.is_decoder = True
+            model = BertLMHeadModel.from_pretrained('bert-base-cased', config=config)
+            inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
+            outputs = model(**inputs)
+            prediction_logits = outputs.logits
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if labels is not None:
+            use_cache = False
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+        prediction_scores = self.cls(sequence_output)
+
+        lm_loss = None
+        if labels is not None:
+            # we are doing next-token prediction; shift prediction scores and input ids by one
+            shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
+            labels = labels[:, 1:].contiguous()
+            loss_fct = CrossEntropyLoss()
+            lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
+
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return ((lm_loss,) + output) if lm_loss is not None else output
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=lm_loss,
+            logits=prediction_scores,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            cross_attentions=outputs.cross_attentions,
+        )
+
+    def prepare_inputs_for_generation(self, input_ids, past=None, attention_mask=None, **model_kwargs):
+        input_shape = input_ids.shape
+        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
+        if attention_mask is None:
+            attention_mask = input_ids.new_ones(input_shape)
+
+        # cut decoder_input_ids if past is used
+        if past is not None:
+            input_ids = input_ids[:, -1:]
+
+        return {"input_ids": input_ids, "attention_mask": attention_mask, "past_key_values": past}
+
+    def _reorder_cache(self, past, beam_idx):
+        reordered_past = ()
+        for layer_past in past:
+            reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)
+        return reordered_past
+
+from transformers import BertTokenizer, BertLMHeadModel, BertConfig
+import torch
+tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+config = BertConfig.from_pretrained("bert-base-uncased")
+config.is_decoder = True
+model = BertLMHeadModel.from_pretrained('bert-base-uncased', config=config)
+inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
+outputs = model(**inputs)
+prediction_logits = outputs.logits
+```
+
+    Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertLMHeadModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
+    - This IS expected if you are initializing BertLMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
+    - This IS NOT expected if you are initializing BertLMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
+
+
+
+```python
+class BertForNextSentencePrediction(BertPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.bert = BertModel(config)
+        self.cls = BertOnlyNSPHead(config)
+
+        self.init_weights()
+
+    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @replace_return_docstrings(output_type=NextSentencePredictorOutput, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        **kwargs,
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair
+            (see ``input_ids`` docstring). Indices should be in ``[0, 1]``:
+            - 0 indicates sequence B is a continuation of sequence A,
+            - 1 indicates sequence B is a random sequence.
+        Returns:
+        Example::
+from transformers import BertTokenizer, BertForNextSentencePrediction
+import torch
+tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')
+prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
+next_sentence = "The sky is blue due to the shorter wavelength of blue light."
+encoding = tokenizer(prompt, next_sentence, return_tensors='pt')
+outputs = model(**encoding, labels=torch.LongTensor([1]))
+logits = outputs.logits
+assert logits[0, 0] < logits[0, 1] # next sentence was random
+        """
+
+        if "next_sentence_label" in kwargs:
+            warnings.warn(
+                "The `next_sentence_label` argument is deprecated and will be removed in a future version, use `labels` instead.",
+                FutureWarning,
+            )
+            labels = kwargs.pop("next_sentence_label")
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = outputs[1]
+
+        seq_relationship_scores = self.cls(pooled_output)
+
+        next_sentence_loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            next_sentence_loss = loss_fct(seq_relationship_scores.view(-1, 2), labels.view(-1))
+
+        if not return_dict:
+            output = (seq_relationship_scores,) + outputs[2:]
+            return ((next_sentence_loss,) + output) if next_sentence_loss is not None else output
+
+        return NextSentencePredictorOutput(
+            loss=next_sentence_loss,
+            logits=seq_relationship_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+from transformers import BertTokenizer, BertForNextSentencePrediction
+import torch
+tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')
+prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
+next_sentence = "The sky is blue due to the shorter wavelength of blue light."
+encoding = tokenizer(prompt, next_sentence, return_tensors='pt')
+outputs = model(**encoding, labels=torch.LongTensor([1]))
+logits = outputs.logits
+assert logits[0, 0] < logits[0, 1] # next sentence was random
+```
+
+    Downloading: 100%|██████████| 440M/440M [00:30<00:00, 14.5MB/s]
+    Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForNextSentencePrediction: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
+    - This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
+    - This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
+
+
 接下来介绍的是各种 Fine-tune 模型，基本都是分类任务：
+
 ![Bert：finetune](./pictures/3-4-bert-ft.png) 图：Bert：finetune
 
+*** 
 ### 3.2 BertForSequenceClassification
 这一模型用于句子分类（也可以是回归）任务，比如 GLUE benchmark 的各个任务。
 - 句子分类的输入为句子（对），输出为单个分类标签。
@@ -186,6 +562,155 @@ class BertForSequenceClassification(BertPreTrainedModel):
 
 - 否则认为是分类任务。
 
+
+```python
+@add_start_docstrings(
+    """
+    Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled
+    output) e.g. for GLUE tasks.
+    """,
+    BERT_START_DOCSTRING,
+)
+class BertForSequenceClassification(BertPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.config = config
+
+        self.bert = BertModel(config)
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+        self.init_weights()
+
+    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=SequenceClassifierOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,
+            config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
+            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = outputs[1]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+```
+
+
+```python
+from transformers.models.bert.tokenization_bert import BertTokenizer
+from transformers.models.bert.modeling_bert import BertForSequenceClassification
+tokenizer = BertTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
+model = BertForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
+
+classes = ["not paraphrase", "is paraphrase"]
+
+sequence_0 = "The company HuggingFace is based in New York City"
+sequence_1 = "Apples are especially bad for your health"
+sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
+
+# The tokekenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to the sequence, as well as compute the attention masks.
+paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
+not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")
+
+paraphrase_classification_logits = model(**paraphrase).logits
+not_paraphrase_classification_logits = model(**not_paraphrase).logits
+
+paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
+not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]
+
+# Should be paraphrase
+for i in range(len(classes)):
+    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
+
+# Should not be paraphrase
+for i in range(len(classes)):
+    print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")
+```
+
+    Downloading: 100%|██████████| 213k/213k [00:00<00:00, 596kB/s]
+    Downloading: 100%|██████████| 29.0/29.0 [00:00<00:00, 12.4kB/s]
+    Downloading: 100%|██████████| 436k/436k [00:00<00:00, 808kB/s]
+    Downloading: 100%|██████████| 433/433 [00:00<00:00, 166kB/s]
+    Downloading: 100%|██████████| 433M/433M [00:29<00:00, 14.5MB/s]
+
+
+    not paraphrase: 10%
+    is paraphrase: 90%
+    not paraphrase: 94%
+    is paraphrase: 6%
+
+
+*** 
 ### 3.3 BertForMultipleChoice
 
 这一模型用于多项选择，如 RocStories/SWAG 任务。
@@ -193,12 +718,272 @@ class BertForSequenceClassification(BertPreTrainedModel):
 结构上与句子分类相似，只不过线性层输出维度为 1，即每次需要将每个样本的多个句子的输出拼接起来作为每个样本的预测分数。
 - 实际上，具体操作时是把每个 batch 的多个句子一同放入的，所以一次处理的输入为[batch_size, num_choices]数量的句子，因此相同 batch 大小时，比句子分类等任务需要更多的显存，在训练时需要小心。
 
+*** 
 ### 3.4 BertForTokenClassification
 这一模型用于序列标注（词分类），如 NER 任务。
 - 序列标注任务的输入为单个句子文本，输出为每个 token 对应的类别标签。
 由于需要用到每个 token对应的输出而不只是某几个，所以这里的BertModel不用加入 pooling 层；
 - 同时，这里将`_keys_to_ignore_on_load_unexpected`这一个类参数设置为`[r"pooler"]`，也就是在加载模型时对于出现不需要的权重不发生报错。
 
+
+```python
+class BertForMultipleChoice(BertPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.bert = BertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+        self.init_weights()
+
+    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=MultipleChoiceModelOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for computing the multiple choice classification loss. Indices should be in ``[0, ...,
+            num_choices-1]`` where :obj:`num_choices` is the size of the second dimension of the input tensors. (See
+            :obj:`input_ids` above)
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
+
+        input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
+        attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
+        token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None
+        position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
+        inputs_embeds = (
+            inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))
+            if inputs_embeds is not None
+            else None
+        )
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = outputs[1]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        reshaped_logits = logits.view(-1, num_choices)
+
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return MultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+```
+
+
+```python
+@add_start_docstrings(
+    """
+    Bert Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for
+    Named-Entity-Recognition (NER) tasks.
+    """,
+    BERT_START_DOCSTRING,
+)
+class BertForTokenClassification(BertPreTrainedModel):
+
+    _keys_to_ignore_on_load_unexpected = [r"pooler"]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+
+        self.bert = BertModel(config, add_pooling_layer=False)
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+        self.init_weights()
+
+    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=TokenClassifierOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Labels for computing the token classification loss. Indices should be in ``[0, ..., config.num_labels -
+            1]``.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            # Only keep active parts of the loss
+            if attention_mask is not None:
+                active_loss = attention_mask.view(-1) == 1
+                active_logits = logits.view(-1, self.num_labels)
+                active_labels = torch.where(
+                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)
+                )
+                loss = loss_fct(active_logits, active_labels)
+            else:
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+```
+
+
+```python
+from transformers import BertForTokenClassification, BertTokenizer
+import torch
+
+model = BertForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
+tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
+
+label_list = [
+"O",       # Outside of a named entity
+"B-MISC",  # Beginning of a miscellaneous entity right after another miscellaneous entity
+"I-MISC",  # Miscellaneous entity
+"B-PER",   # Beginning of a person's name right after another person's name
+"I-PER",   # Person's name
+"B-ORG",   # Beginning of an organisation right after another organisation
+"I-ORG",   # Organisation
+"B-LOC",   # Beginning of a location right after another location
+"I-LOC"    # Location
+]
+
+sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge."
+
+# Bit of a hack to get the tokens with the special tokens
+tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
+inputs = tokenizer.encode(sequence, return_tensors="pt")
+
+outputs = model(inputs).logits
+predictions = torch.argmax(outputs, dim=2)
+```
+
+    Downloading: 100%|██████████| 998/998 [00:00<00:00, 382kB/s]
+    Downloading: 100%|██████████| 1.33G/1.33G [01:30<00:00, 14.7MB/s]
+
+
+
+```python
+for token, prediction in zip(tokens, predictions[0].numpy()):
+    print((token, model.config.id2label[prediction]))
+```
+
+    ('[CLS]', 'O')
+    ('Hu', 'I-ORG')
+    ('##gging', 'I-ORG')
+    ('Face', 'I-ORG')
+    ('Inc', 'I-ORG')
+    ('.', 'O')
+    ('is', 'O')
+    ('a', 'O')
+    ('company', 'O')
+    ('based', 'O')
+    ('in', 'O')
+    ('New', 'I-LOC')
+    ('York', 'I-LOC')
+    ('City', 'I-LOC')
+    ('.', 'O')
+    ('Its', 'O')
+    ('headquarters', 'O')
+    ('are', 'O')
+    ('in', 'O')
+    ('D', 'I-LOC')
+    ('##UM', 'I-LOC')
+    ('##BO', 'I-LOC')
+    (',', 'O')
+    ('therefore', 'O')
+    ('very', 'O')
+    ('close', 'O')
+    ('to', 'O')
+    ('the', 'O')
+    ('Manhattan', 'I-LOC')
+    ('Bridge', 'I-LOC')
+    ('.', 'O')
+    ('[SEP]', 'O')
+
+
+*** 
 ### 3.5 BertForQuestionAnswering
 这一模型用于解决问答任务，例如 SQuAD 任务。
 - 问答任务的输入为问题 +（对于 BERT 只能是一个）回答组成的句子对，输出为起始位置和结束位置用于标出回答中的具体文本。
@@ -207,6 +992,158 @@ class BertForSequenceClassification(BertPreTrainedModel):
 
 以上就是关于 BERT 源码的介绍，下面介绍一些关于 BERT 模型实用的训练细节。
 
+
+```python
+@add_start_docstrings(
+    """
+    Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear
+    layers on top of the hidden-states output to compute `span start logits` and `span end logits`).
+    """,
+    BERT_START_DOCSTRING,
+)
+class BertForQuestionAnswering(BertPreTrainedModel):
+
+    _keys_to_ignore_on_load_unexpected = [r"pooler"]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+
+        self.bert = BertModel(config, add_pooling_layer=False)
+        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
+
+        self.init_weights()
+
+    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=QuestionAnsweringModelOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        start_positions=None,
+        end_positions=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for position (index) of the start of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
+            sequence are not taken into account for computing the loss.
+        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for position (index) of the end of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
+            sequence are not taken into account for computing the loss.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = logits.split(1, dim=-1)
+        start_logits = start_logits.squeeze(-1).contiguous()
+        end_logits = end_logits.squeeze(-1).contiguous()
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if len(start_positions.size()) > 1:
+                start_positions = start_positions.squeeze(-1)
+            if len(end_positions.size()) > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.size(1)
+            start_positions = start_positions.clamp(0, ignored_index)
+            end_positions = end_positions.clamp(0, ignored_index)
+
+            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return ((total_loss,) + output) if total_loss is not None else output
+
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+```
+
+
+```python
+from transformers import AutoTokenizer, AutoModelForQuestionAnswering
+import torch
+
+tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
+model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
+
+text = "🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch."
+
+questions = [
+"How many pretrained models are available in 🤗 Transformers?",
+"What does 🤗 Transformers provide?",
+"🤗 Transformers provides interoperability between which frameworks?",
+]
+
+for question in questions:
+    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
+    input_ids = inputs["input_ids"].tolist()[0]
+    outputs = model(**inputs)
+    answer_start_scores = outputs.start_logits
+    answer_end_scores = outputs.end_logits
+    answer_start = torch.argmax(
+        answer_start_scores
+    )  # Get the most likely beginning of answer with the argmax of the score
+    answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score
+    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
+    print(f"Question: {question}")
+    print(f"Answer: {answer}")
+```
+
+    Downloading: 100%|██████████| 443/443 [00:00<00:00, 186kB/s]
+    Downloading: 100%|██████████| 232k/232k [00:00<00:00, 438kB/s]
+    Downloading: 100%|██████████| 466k/466k [00:00<00:00, 845kB/s]
+    Downloading: 100%|██████████| 28.0/28.0 [00:00<00:00, 10.5kB/s]
+    Downloading: 100%|██████████| 1.34G/1.34G [01:28<00:00, 15.1MB/s]
+
+
+    Question: How many pretrained models are available in 🤗 Transformers?
+    Answer: over 32 +
+    Question: What does 🤗 Transformers provide?
+    Answer: general - purpose architectures
+    Question: 🤗 Transformers provides interoperability between which frameworks?
+    Answer: tensorflow 2. 0 and pytorch
+
+
+*** 
 ## BERT训练和优化
 ### 4.1 Pre-Training
 预训练阶段，除了众所周知的 15%、80% mask 比例，有一个值得注意的地方就是参数共享。
@@ -234,6 +1171,8 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin):
 
 至于为什么，应该是因为 word_embedding 和 prediction 权重太大了，以 bert-base 为例，其尺寸为(30522, 768)，降低训练难度。
 
+
+***
 ### 4.2 Fine-Tuning
 微调也就是下游任务阶段，也有两个值得注意的地方。
 #### 4.2.1 AdamW
@@ -312,10 +1251,6 @@ def get_scheduler(
 本文主要由浙江大学李泺秋撰写，本项目同学负责整理和汇总。
 
 
+```python
 
-
-
-
-
-
-
+```