fix

2021-09-02 08:14:50 +08:00 · 2021-09-02 08:14:50 +08:00 · 8ad200eb01
parent 3584f014aa
commit 8ad200eb01
2 changed files with 49 additions and 411 deletions
--- a/docs/篇章4-使用Transformers解决NLP任务/4.5-生成任务-语言模型.ipynb
+++ b/docs/篇章4-使用Transformers解决NLP任务/4.5-生成任务-语言模型.ipynb
@ -97,29 +97,7 @@
          "output_type": "stream",
          "name": "stderr",
          "text": [
-            "Downloading: 8.33kB [00:00, 1.49MB/s]                   \n",
-            "Downloading: 5.83kB [00:00, 1.77MB/s]                   \n"
-          ]
-        },
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.91 MiB, post-processed: Unknown size, total: 17.41 MiB) to /Users/niepig/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20...\n"
-          ]
-        },
-        {
-          "output_type": "stream",
-          "name": "stderr",
-          "text": [
-            "Downloading: 100%|██████████| 4.72M/4.72M [00:02<00:00, 1.91MB/s]\n"
-          ]
-        },
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "Dataset wikitext downloaded and prepared to /Users/niepig/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20. Subsequent calls will reuse this data.\n"
+            "Reusing dataset wikitext (/Users/niepig/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20)\n"
          ]
        }
      ],
@ -262,15 +240,15 @@
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
-              "      <td>Plum cakes made with fresh plums came with other migrants from other traditions in which plum cake is prepared using plum as a primary ingredient . In some versions , the plums may become jam @-@ like inside the cake after cooking , or be prepared using plum jam . Plum cake prepared with plums is also a part of Ashkenazi Jewish cuisine , and is referred to as Pflaumenkuchen or Zwetschgenkuchen . Other plum @-@ based cakes are found in French , Italian and Polish cooking . \\n</td>\n",
+              "      <td>On 3 March 1967 , parliament decided to build four short take @-@ off and landing airports along the Helgeland coast between Trondheim and Bodø . Braathens placed an order for a de Havilland Canada DHC @-@ 6 Twin Otter and planned to start the company Braathens STOL . It applied to operate the route without subsidies , but the concession was rejected and granted with subsidies to Widerøe , which had been operating the routes using seaplanes . \\n</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
-              "      <td>= = = Language = = = \\n</td>\n",
+              "      <td></td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
-              "      <td></td>\n",
+              "      <td>Rao Ramesh was cast as a tantrik who helps Gill 's character in the present era . Mumaith Khan was selected for another item number , a remix version of the hit song \" Bangaru Kodipetta \" from Gharana Mogudu ( 1992 ) ; Gharana Mogudu 's music was also composed by M. M. Keeravani . Chiranjeevi made a special appearance after the song , making Magadheera the first film he appeared in after his entry into politics . When Rajamouli suggested the idea of a cameo appearance , Chiranjeevi was initially hesitant till the director narrated the complete sequence and the importance of the song . \\n</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
@ -278,23 +256,23 @@
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
-              "      <td>The town 's population not only recovered but grew ; the 1906 census of the Canadian Prairies listed the population at 1 @,@ 178 . A new study commissioned by the Dominion government determined that the cracks in the mountain continued to grow and that the risk of another slide remained . Consequently , parts of Frank closest to the mountain were dismantled or relocated to safer areas . \\n</td>\n",
-              "    </tr>\n",
-              "    <tr>\n",
-              "      <th>5</th>\n",
-              "      <td>The Litigators is a 2011 legal thriller novel by John Grisham , his 25th fiction novel overall . The Litigators is about a two @-@ partner Chicago law firm attempting to strike it rich in a class action lawsuit over a cholesterol reduction drug by a major pharmaceutical drug company . The protagonist is a Harvard Law School grad big law firm burnout who stumbles upon the boutique and joins it only to find himself litigating against his old law firm in this case . The book is regarded as more humorous than most of Grisham 's prior novels . \\n</td>\n",
-              "    </tr>\n",
-              "    <tr>\n",
-              "      <th>6</th>\n",
              "      <td></td>\n",
              "    </tr>\n",
              "    <tr>\n",
+              "      <th>5</th>\n",
+              "      <td>= = = Total Nonstop Action Wrestling ( 2015 – present ) = = = \\n</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>6</th>\n",
+              "      <td>The Daily Telegraph gave the visual novel the award for \" Best Script \" in its video game awards of 2011 , stating that \" Love 's layered narrative of a high school teacher embroiled in his student ’ s worries goes places most mainstream video games wouldn 't dare . \" \\n</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
              "      <th>7</th>\n",
-              "      <td>On December 7 , 2006 , Headquarters Marine Corps released a message stating that 2nd Battalion 9th Marines would be reactivated during 2007 as part of the continuing Global War on Terror . 2nd Battalion 9th Marines was re @-@ activated on July 13 , 2007 and replaced the Anti @-@ Terrorism Battalion ( ATBn ) . In September 2008 , Marines and Sailors from 2 / 9 deployed to Al Anbar Province in support of Operation Iraqi Freedom . They were based in the city of Ramadi and returned in April 2009 without any Marines or Sailors killed in action . July 2010 Marines and Sailors from 2 / 9 deployed to Marjah , Helmand Province , Afghanistan in support of Operation Enduring Freedom . In December 2010 Echo Company from 2 / 9 were attached to 3 / 5 in Sangin , Afghanistan where they earned the notorious nickname of \" Green Hats . \" They returned February 2011 . They redeployed back to Marjah December 2011 and returned July 2012 . Echo and Weapons companies deployed once more to Afghanistan from January through April 2013 , participating in combat operations out of Camp Leatherneck . On April 1 , 2015 the battalion was deactivated in a ceremony at Camp Lejeune . \\n</td>\n",
+              "      <td></td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>8</th>\n",
-              "      <td>( i ) = Indoor \\n</td>\n",
+              "      <td></td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>9</th>\n",
@ -383,18 +361,7 @@
        "    \n",
        "tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)"
      ],
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stderr",
-          "text": [
-            "Downloading: 100%|██████████| 762/762 [00:00<00:00, 358kB/s]\n",
-            "Downloading: 100%|██████████| 1.04M/1.04M [00:04<00:00, 235kB/s]\n",
-            "Downloading: 100%|██████████| 456k/456k [00:02<00:00, 217kB/s]\n",
-            "Downloading: 100%|██████████| 1.36M/1.36M [00:05<00:00, 252kB/s]\n"
-          ]
-        }
-      ],
+      "outputs": [],
      "metadata": {
        "id": "mQwZ5UssWdB_"
      }
@ -431,72 +398,11 @@
    },
    {
      "cell_type": "code",
-      "execution_count": 8,
+      "execution_count": null,
      "source": [
        "tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=[\"text\"])"
      ],
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stderr",
-          "text": [
-            "#0:   0%|          | 0/2 [00:00<?, ?ba/s]\n",
-            "\u001b[A\n",
-            "\n",
-            "\u001b[A\u001b[A\n",
-            "\n",
-            "#3: 100%|██████████| 2/2 [00:00<00:00,  6.42ba/s]\n",
-            "#1: 100%|██████████| 2/2 [00:00<00:00,  5.87ba/s]\n",
-            "#0: 100%|██████████| 2/2 [00:00<00:00,  5.56ba/s]\n",
-            "\n",
-            "#2: 100%|██████████| 2/2 [00:00<00:00,  4.73ba/s]\n",
-            "#0:   0%|          | 0/10 [00:00<?, ?ba/s]\n",
-            "\u001b[A\n",
-            "\n",
-            "#0:  10%|█         | 1/10 [00:00<00:03,  2.87ba/s]\n",
-            "\u001b[A\n",
-            "\n",
-            "#0:  20%|██        | 2/10 [00:00<00:02,  2.89ba/s]\n",
-            "\u001b[A\n",
-            "\n",
-            "#0:  30%|███       | 3/10 [00:00<00:02,  3.08ba/s]\n",
-            "\u001b[A\n",
-            "\n",
-            "#0:  40%|████      | 4/10 [00:01<00:01,  3.14ba/s]\n",
-            "\u001b[A\n",
-            "\n",
-            "#0:  50%|█████     | 5/10 [00:01<00:01,  3.33ba/s]\n",
-            "\u001b[A\n",
-            "\n",
-            "#0:  60%|██████    | 6/10 [00:01<00:01,  3.44ba/s]\n",
-            "\u001b[A\n",
-            "\n",
-            "\u001b[A\u001b[A\n",
-            "#0:  70%|███████   | 7/10 [00:02<00:01,  2.89ba/s]\n",
-            "\n",
-            "\u001b[A\u001b[A\n",
-            "#0:  80%|████████  | 8/10 [00:02<00:00,  2.89ba/s]\n",
-            "\n",
-            "#0:  90%|█████████ | 9/10 [00:02<00:00,  3.04ba/s]\n",
-            "#0: 100%|██████████| 10/10 [00:02<00:00,  3.37ba/s]\n",
-            "#2: 100%|██████████| 10/10 [00:02<00:00,  3.44ba/s]\n",
-            "#1: 100%|██████████| 10/10 [00:02<00:00,  3.33ba/s]\n",
-            "\n",
-            "\n",
-            "#3: 100%|██████████| 10/10 [00:03<00:00,  3.25ba/s]\n",
-            "#0:   0%|          | 0/1 [00:00<?, ?ba/s]\n",
-            "\u001b[A\n",
-            "\n",
-            "#0: 100%|██████████| 1/1 [00:00<00:00,  3.70ba/s]\n",
-            "#1: 100%|██████████| 1/1 [00:00<00:00,  2.79ba/s]\n",
-            "\n",
-            "\u001b[A\n",
-            "\n",
-            "#2: 100%|██████████| 1/1 [00:00<00:00,  2.74ba/s]\n",
-            "#3: 100%|██████████| 1/1 [00:00<00:00,  2.82ba/s]\n"
-          ]
-        }
-      ],
+      "outputs": [],
      "metadata": {
        "id": "rNb1U12YWdCA"
      }
@ -607,7 +513,7 @@
    },
    {
      "cell_type": "code",
-      "execution_count": 12,
+      "execution_count": null,
      "source": [
        "lm_datasets = tokenized_datasets.map(\n",
        "    group_texts,\n",
@ -616,68 +522,7 @@
        "    num_proc=4,\n",
        ")"
      ],
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stderr",
-          "text": [
-            "#0:   0%|          | 0/2 [00:00<?, ?ba/s]\n",
-            "\u001b[A\n",
-            "\n",
-            "\u001b[A\u001b[A\n",
-            "\n",
-            "#3: 100%|██████████| 2/2 [00:00<00:00,  6.12ba/s]\n",
-            "#1: 100%|██████████| 2/2 [00:00<00:00,  4.89ba/s]\n",
-            "#0: 100%|██████████| 2/2 [00:00<00:00,  4.60ba/s]\n",
-            "\n",
-            "#2: 100%|██████████| 2/2 [00:00<00:00,  3.94ba/s]\n",
-            "#0:   0%|          | 0/10 [00:00<?, ?ba/s]\n",
-            "\u001b[A\n",
-            "\n",
-            "#0:  10%|█         | 1/10 [00:00<00:03,  2.90ba/s]\n",
-            "\u001b[A\n",
-            "\n",
-            "#0:  20%|██        | 2/10 [00:00<00:02,  2.76ba/s]\n",
-            "\u001b[A\n",
-            "\n",
-            "#0:  30%|███       | 3/10 [00:01<00:02,  2.72ba/s]\n",
-            "\u001b[A\n",
-            "\n",
-            "#0:  40%|████      | 4/10 [00:01<00:02,  2.75ba/s]\n",
-            "\u001b[A\n",
-            "\n",
-            "#0:  50%|█████     | 5/10 [00:01<00:01,  2.92ba/s]\n",
-            "#0:  60%|██████    | 6/10 [00:02<00:01,  3.01ba/s]\n",
-            "\n",
-            "\u001b[A\u001b[A\n",
-            "\u001b[A\n",
-            "#0:  70%|███████   | 7/10 [00:02<00:01,  2.69ba/s]\n",
-            "\n",
-            "#0:  80%|████████  | 8/10 [00:02<00:00,  2.67ba/s]\n",
-            "\u001b[A\n",
-            "\n",
-            "#0: 100%|██████████| 10/10 [00:03<00:00,  3.00ba/s]\n",
-            "\n",
-            "\n",
-            "\u001b[A\u001b[A\n",
-            "#2: 100%|██████████| 10/10 [00:03<00:00,  3.04ba/s]\n",
-            "#1: 100%|██████████| 10/10 [00:03<00:00,  2.88ba/s]\n",
-            "\n",
-            "\n",
-            "#3: 100%|██████████| 10/10 [00:03<00:00,  2.79ba/s]\n",
-            "#0:   0%|          | 0/1 [00:00<?, ?ba/s]\n",
-            "\u001b[A\n",
-            "\n",
-            "#0: 100%|██████████| 1/1 [00:00<00:00,  3.41ba/s]\n",
-            "#1: 100%|██████████| 1/1 [00:00<00:00,  2.61ba/s]\n",
-            "\n",
-            "\n",
-            "#3: 100%|██████████| 1/1 [00:00<00:00,  2.69ba/s]\n",
-            "\n",
-            "#2: 100%|██████████| 1/1 [00:00<00:00,  2.55ba/s]\n"
-          ]
-        }
-      ],
+      "outputs": [],
      "metadata": {
        "id": "lmoi9YUZWdCC"
      }
@ -734,15 +579,7 @@
        "from transformers import AutoModelForCausalLM\n",
        "model = AutoModelForCausalLM.from_pretrained(model_checkpoint)"
      ],
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stderr",
-          "text": [
-            "Downloading: 100%|██████████| 353M/353M [00:21<00:00, 16.0MB/s]\n"
-          ]
-        }
-      ],
+      "outputs": [],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
@ -773,7 +610,7 @@
    },
    {
      "cell_type": "code",
-      "execution_count": 20,
+      "execution_count": 15,
      "source": [
        "\n",
        "import importlib.util\n",
@ -811,7 +648,7 @@
    },
    {
      "cell_type": "code",
-      "execution_count": 21,
+      "execution_count": 16,
      "source": [
        "from transformers import Trainer, TrainingArguments"
      ],
@ -822,7 +659,7 @@
    },
    {
      "cell_type": "code",
-      "execution_count": 22,
+      "execution_count": 17,
      "source": [
        "training_args = TrainingArguments(\n",
        "    \"test-clm\",\n",
@ -847,13 +684,13 @@
    },
    {
      "cell_type": "code",
-      "execution_count": 23,
+      "execution_count": 18,
      "source": [
        "trainer = Trainer(\n",
        "    model=model,\n",
        "    args=training_args,\n",
-        "    train_dataset=lm_datasets[\"train\"][:1000],\n",
-        "    eval_dataset=lm_datasets[\"validation\"][:1000],\n",
+        "    train_dataset=lm_datasets[\"train\"],\n",
+        "    eval_dataset=lm_datasets[\"validation\"],\n",
        ")"
      ],
      "outputs": [],
@ -872,7 +709,7 @@
    },
    {
      "cell_type": "code",
-      "execution_count": 24,
+      "execution_count": 19,
      "source": [
        "trainer.train()"
      ],
@ -881,23 +718,7 @@
          "output_type": "stream",
          "name": "stderr",
          "text": [
-            "  0%|          | 0/3 [00:00<?, ?it/s]"
-          ]
-        },
-        {
-          "output_type": "error",
-          "ename": "KeyError",
-          "evalue": "1",
-          "traceback": [
-            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
-            "\u001b[0;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
-            "\u001b[0;32m/var/folders/2k/x3py0v857kgcwqvvl00xxhxw0000gn/T/ipykernel_12460/4032920361.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mtrainer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtrain\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
-            "\u001b[0;32m~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/transformers/trainer.py\u001b[0m in \u001b[0;36mtrain\u001b[0;34m(self, resume_from_checkpoint, trial, **kwargs)\u001b[0m\n\u001b[1;32m   1032\u001b[0m             \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcontrol\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcallback_handler\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mon_epoch_begin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstate\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcontrol\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1033\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1034\u001b[0;31m             \u001b[0;32mfor\u001b[0m \u001b[0mstep\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minputs\u001b[0m \u001b[0;32min\u001b[0m \u001b[0menumerate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mepoch_iterator\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1035\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1036\u001b[0m                 \u001b[0;31m# Skip past any already trained steps if resuming training\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
-            "\u001b[0;32m~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py\u001b[0m in \u001b[0;36m__next__\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m    519\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_sampler_iter\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    520\u001b[0m                 \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_reset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 521\u001b[0;31m             \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_next_data\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    522\u001b[0m             \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_num_yielded\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    523\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_dataset_kind\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0m_DatasetKind\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mIterable\u001b[0m \u001b[0;32mand\u001b[0m\u001b[0;31m \u001b[0m\u001b[0;31m\\\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
-            "\u001b[0;32m~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py\u001b[0m in \u001b[0;36m_next_data\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m    559\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0m_next_data\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    560\u001b[0m         \u001b[0mindex\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_next_index\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m  \u001b[0;31m# may raise StopIteration\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 561\u001b[0;31m         \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_dataset_fetcher\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfetch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m)\u001b[0m  \u001b[0;31m# may raise StopIteration\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    562\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_pin_memory\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    563\u001b[0m             \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_utils\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpin_memory\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpin_memory\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
-            "\u001b[0;32m~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py\u001b[0m in \u001b[0;36mfetch\u001b[0;34m(self, possibly_batched_index)\u001b[0m\n\u001b[1;32m     42\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mfetch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpossibly_batched_index\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     43\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mauto_collation\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 44\u001b[0;31m             \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0midx\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0midx\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mpossibly_batched_index\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     45\u001b[0m         \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     46\u001b[0m             \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mpossibly_batched_index\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
-            "\u001b[0;32m~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py\u001b[0m in \u001b[0;36m<listcomp>\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m     42\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mfetch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpossibly_batched_index\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     43\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mauto_collation\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 44\u001b[0;31m             \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0midx\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0midx\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mpossibly_batched_index\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     45\u001b[0m         \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     46\u001b[0m             \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mpossibly_batched_index\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
-            "\u001b[0;31mKeyError\u001b[0m: 1"
+            "  0%|          | 31/7002 [04:16<14:27:52,  7.47s/it]"
          ]
        }
      ],
@ -1070,8 +891,8 @@
        "trainer = Trainer(\n",
        "    model=model,\n",
        "    args=training_args,\n",
-        "    train_dataset=lm_datasets[\"train\"][:1000],\n",
-        "    eval_dataset=lm_datasets[\"validation\"][:100],\n",
+        "    train_dataset=lm_datasets[\"train\"],\n",
+        "    eval_dataset=lm_datasets[\"validation\"],\n",
        "    data_collator=data_collator,\n",
        ")"
      ],
--- a/docs/篇章4-使用Transformers解决NLP任务/4.5-生成任务-语言模型.md
+++ b/docs/篇章4-使用Transformers解决NLP任务/4.5-生成任务-语言模型.md
@ -41,17 +41,7 @@ from datasets import load_dataset
 datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')
 ```

-    Downloading: 8.33kB [00:00, 1.49MB/s]                   
-    Downloading: 5.83kB [00:00, 1.77MB/s]                   
-
-
-    Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.91 MiB, post-processed: Unknown size, total: 17.41 MiB) to /Users/niepig/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20...
-
-
-    Downloading: 100%|██████████| 4.72M/4.72M [00:02<00:00, 1.91MB/s]
-
-
-    Dataset wikitext downloaded and prepared to /Users/niepig/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20. Subsequent calls will reuse this data.
+    Reusing dataset wikitext (/Users/niepig/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20)


 如果碰到以下错误：
@ -127,15 +117,15 @@ show_random_elements(datasets["train"])
  <tbody>
    <tr>
      <th>0</th>
-      <td>Plum cakes made with fresh plums came with other migrants from other traditions in which plum cake is prepared using plum as a primary ingredient . In some versions , the plums may become jam @-@ like inside the cake after cooking , or be prepared using plum jam . Plum cake prepared with plums is also a part of Ashkenazi Jewish cuisine , and is referred to as Pflaumenkuchen or Zwetschgenkuchen . Other plum @-@ based cakes are found in French , Italian and Polish cooking . \n</td>
+      <td>On 3 March 1967 , parliament decided to build four short take @-@ off and landing airports along the Helgeland coast between Trondheim and Bodø . Braathens placed an order for a de Havilland Canada DHC @-@ 6 Twin Otter and planned to start the company Braathens STOL . It applied to operate the route without subsidies , but the concession was rejected and granted with subsidies to Widerøe , which had been operating the routes using seaplanes . \n</td>
    </tr>
    <tr>
      <th>1</th>
-      <td>= = = Language = = = \n</td>
+      <td></td>
    </tr>
    <tr>
      <th>2</th>
-      <td></td>
+      <td>Rao Ramesh was cast as a tantrik who helps Gill 's character in the present era . Mumaith Khan was selected for another item number , a remix version of the hit song " Bangaru Kodipetta " from Gharana Mogudu ( 1992 ) ; Gharana Mogudu 's music was also composed by M. M. Keeravani . Chiranjeevi made a special appearance after the song , making Magadheera the first film he appeared in after his entry into politics . When Rajamouli suggested the idea of a cameo appearance , Chiranjeevi was initially hesitant till the director narrated the complete sequence and the importance of the song . \n</td>
    </tr>
    <tr>
      <th>3</th>
@ -143,23 +133,23 @@ show_random_elements(datasets["train"])
    </tr>
    <tr>
      <th>4</th>
-      <td>The town 's population not only recovered but grew ; the 1906 census of the Canadian Prairies listed the population at 1 @,@ 178 . A new study commissioned by the Dominion government determined that the cracks in the mountain continued to grow and that the risk of another slide remained . Consequently , parts of Frank closest to the mountain were dismantled or relocated to safer areas . \n</td>
-    </tr>
-    <tr>
-      <th>5</th>
-      <td>The Litigators is a 2011 legal thriller novel by John Grisham , his 25th fiction novel overall . The Litigators is about a two @-@ partner Chicago law firm attempting to strike it rich in a class action lawsuit over a cholesterol reduction drug by a major pharmaceutical drug company . The protagonist is a Harvard Law School grad big law firm burnout who stumbles upon the boutique and joins it only to find himself litigating against his old law firm in this case . The book is regarded as more humorous than most of Grisham 's prior novels . \n</td>
-    </tr>
-    <tr>
-      <th>6</th>
      <td></td>
    </tr>
+    <tr>
+      <th>5</th>
+      <td>= = = Total Nonstop Action Wrestling ( 2015 – present ) = = = \n</td>
+    </tr>
+    <tr>
+      <th>6</th>
+      <td>The Daily Telegraph gave the visual novel the award for " Best Script " in its video game awards of 2011 , stating that " Love 's layered narrative of a high school teacher embroiled in his student ’ s worries goes places most mainstream video games wouldn 't dare . " \n</td>
+    </tr>
    <tr>
      <th>7</th>
-      <td>On December 7 , 2006 , Headquarters Marine Corps released a message stating that 2nd Battalion 9th Marines would be reactivated during 2007 as part of the continuing Global War on Terror . 2nd Battalion 9th Marines was re @-@ activated on July 13 , 2007 and replaced the Anti @-@ Terrorism Battalion ( ATBn ) . In September 2008 , Marines and Sailors from 2 / 9 deployed to Al Anbar Province in support of Operation Iraqi Freedom . They were based in the city of Ramadi and returned in April 2009 without any Marines or Sailors killed in action . July 2010 Marines and Sailors from 2 / 9 deployed to Marjah , Helmand Province , Afghanistan in support of Operation Enduring Freedom . In December 2010 Echo Company from 2 / 9 were attached to 3 / 5 in Sangin , Afghanistan where they earned the notorious nickname of " Green Hats . " They returned February 2011 . They redeployed back to Marjah December 2011 and returned July 2012 . Echo and Weapons companies deployed once more to Afghanistan from January through April 2013 , participating in combat operations out of Camp Leatherneck . On April 1 , 2015 the battalion was deactivated in a ceremony at Camp Lejeune . \n</td>
+      <td></td>
    </tr>
    <tr>
      <th>8</th>
-      <td>( i ) = Indoor \n</td>
+      <td></td>
    </tr>
    <tr>
      <th>9</th>
@ -201,12 +191,6 @@ from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
 ```

-    Downloading: 100%|██████████| 762/762 [00:00<00:00, 358kB/s]
-    Downloading: 100%|██████████| 1.04M/1.04M [00:04<00:00, 235kB/s]
-    Downloading: 100%|██████████| 456k/456k [00:02<00:00, 217kB/s]
-    Downloading: 100%|██████████| 1.36M/1.36M [00:05<00:00, 252kB/s]
-
-
 我们现在可以对所有的文本调用分词器，该操作可以简单地使用来自Datasets库的map方法实现。首先，我们定义一个在文本上调用标记器的函数:


@ -223,62 +207,6 @@ def tokenize_function(examples):
 tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
 ```

-    #0:   0%|          | 0/2 [00:00<?, ?ba/s]
-    [A
-    
-    [A[A
-    
-    #3: 100%|██████████| 2/2 [00:00<00:00,  6.42ba/s]
-    #1: 100%|██████████| 2/2 [00:00<00:00,  5.87ba/s]
-    #0: 100%|██████████| 2/2 [00:00<00:00,  5.56ba/s]
-    
-    #2: 100%|██████████| 2/2 [00:00<00:00,  4.73ba/s]
-    #0:   0%|          | 0/10 [00:00<?, ?ba/s]
-    [A
-    
-    #0:  10%|█         | 1/10 [00:00<00:03,  2.87ba/s]
-    [A
-    
-    #0:  20%|██        | 2/10 [00:00<00:02,  2.89ba/s]
-    [A
-    
-    #0:  30%|███       | 3/10 [00:00<00:02,  3.08ba/s]
-    [A
-    
-    #0:  40%|████      | 4/10 [00:01<00:01,  3.14ba/s]
-    [A
-    
-    #0:  50%|█████     | 5/10 [00:01<00:01,  3.33ba/s]
-    [A
-    
-    #0:  60%|██████    | 6/10 [00:01<00:01,  3.44ba/s]
-    [A
-    
-    [A[A
-    #0:  70%|███████   | 7/10 [00:02<00:01,  2.89ba/s]
-    
-    [A[A
-    #0:  80%|████████  | 8/10 [00:02<00:00,  2.89ba/s]
-    
-    #0:  90%|█████████ | 9/10 [00:02<00:00,  3.04ba/s]
-    #0: 100%|██████████| 10/10 [00:02<00:00,  3.37ba/s]
-    #2: 100%|██████████| 10/10 [00:02<00:00,  3.44ba/s]
-    #1: 100%|██████████| 10/10 [00:02<00:00,  3.33ba/s]
-    
-    
-    #3: 100%|██████████| 10/10 [00:03<00:00,  3.25ba/s]
-    #0:   0%|          | 0/1 [00:00<?, ?ba/s]
-    [A
-    
-    #0: 100%|██████████| 1/1 [00:00<00:00,  3.70ba/s]
-    #1: 100%|██████████| 1/1 [00:00<00:00,  2.79ba/s]
-    
-    [A
-    
-    #2: 100%|██████████| 1/1 [00:00<00:00,  2.74ba/s]
-    #3: 100%|██████████| 1/1 [00:00<00:00,  2.82ba/s]
-
-
 如果我们现在查看数据集的一个元素，我们会看到文本已经被模型所需的input_ids所取代:


@ -339,62 +267,6 @@ lm_datasets = tokenized_datasets.map(
 )
 ```

-    #0:   0%|          | 0/2 [00:00<?, ?ba/s]
-    [A
-    
-    [A[A
-    
-    #3: 100%|██████████| 2/2 [00:00<00:00,  6.12ba/s]
-    #1: 100%|██████████| 2/2 [00:00<00:00,  4.89ba/s]
-    #0: 100%|██████████| 2/2 [00:00<00:00,  4.60ba/s]
-    
-    #2: 100%|██████████| 2/2 [00:00<00:00,  3.94ba/s]
-    #0:   0%|          | 0/10 [00:00<?, ?ba/s]
-    [A
-    
-    #0:  10%|█         | 1/10 [00:00<00:03,  2.90ba/s]
-    [A
-    
-    #0:  20%|██        | 2/10 [00:00<00:02,  2.76ba/s]
-    [A
-    
-    #0:  30%|███       | 3/10 [00:01<00:02,  2.72ba/s]
-    [A
-    
-    #0:  40%|████      | 4/10 [00:01<00:02,  2.75ba/s]
-    [A
-    
-    #0:  50%|█████     | 5/10 [00:01<00:01,  2.92ba/s]
-    #0:  60%|██████    | 6/10 [00:02<00:01,  3.01ba/s]
-    
-    [A[A
-    [A
-    #0:  70%|███████   | 7/10 [00:02<00:01,  2.69ba/s]
-    
-    #0:  80%|████████  | 8/10 [00:02<00:00,  2.67ba/s]
-    [A
-    
-    #0: 100%|██████████| 10/10 [00:03<00:00,  3.00ba/s]
-    
-    
-    [A[A
-    #2: 100%|██████████| 10/10 [00:03<00:00,  3.04ba/s]
-    #1: 100%|██████████| 10/10 [00:03<00:00,  2.88ba/s]
-    
-    
-    #3: 100%|██████████| 10/10 [00:03<00:00,  2.79ba/s]
-    #0:   0%|          | 0/1 [00:00<?, ?ba/s]
-    [A
-    
-    #0: 100%|██████████| 1/1 [00:00<00:00,  3.41ba/s]
-    #1: 100%|██████████| 1/1 [00:00<00:00,  2.61ba/s]
-    
-    
-    #3: 100%|██████████| 1/1 [00:00<00:00,  2.69ba/s]
-    
-    #2: 100%|██████████| 1/1 [00:00<00:00,  2.55ba/s]
-
-
 现在我们可以检查数据集是否发生了变化：现在样本包含了`block_size`连续字符块，可能跨越了几个原始文本。


@ -417,9 +289,6 @@ from transformers import AutoModelForCausalLM
 model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
 ```

-    Downloading: 100%|██████████| 353M/353M [00:21<00:00, 16.0MB/s]
-
-
 检查torch版本


@ -460,8 +329,8 @@ training_args = TrainingArguments(
 trainer = Trainer(
    model=model,
    args=training_args,
-    train_dataset=lm_datasets["train"][:1000],
-    eval_dataset=lm_datasets["validation"][:1000],
+    train_dataset=lm_datasets["train"],
+    eval_dataset=lm_datasets["validation"],
 )
 ```

@ -472,59 +341,7 @@ trainer = Trainer(
 trainer.train()
 ```

-      0%|          | 0/3 [00:00<?, ?it/s]
-
-
-    ---------------------------------------------------------------------------
-
-    KeyError                                  Traceback (most recent call last)
-
-    /var/folders/2k/x3py0v857kgcwqvvl00xxhxw0000gn/T/ipykernel_12460/4032920361.py in <module>
-    ----> 1 trainer.train()
-    
-
-    ~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
-       1032             self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control)
-       1033 
-    -> 1034             for step, inputs in enumerate(epoch_iterator):
-       1035 
-       1036                 # Skip past any already trained steps if resuming training
-
-
-    ~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py in __next__(self)
-        519             if self._sampler_iter is None:
-        520                 self._reset()
-    --> 521             data = self._next_data()
-        522             self._num_yielded += 1
-        523             if self._dataset_kind == _DatasetKind.Iterable and \
-
-
-    ~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _next_data(self)
-        559     def _next_data(self):
-        560         index = self._next_index()  # may raise StopIteration
-    --> 561         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
-        562         if self._pin_memory:
-        563             data = _utils.pin_memory.pin_memory(data)
-
-
-    ~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
-         42     def fetch(self, possibly_batched_index):
-         43         if self.auto_collation:
-    ---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
-         45         else:
-         46             data = self.dataset[possibly_batched_index]
-
-
-    ~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
-         42     def fetch(self, possibly_batched_index):
-         43         if self.auto_collation:
-    ---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
-         45         else:
-         46             data = self.dataset[possibly_batched_index]
-
-
-    KeyError: 1
-
+      0%|          | 31/7002 [04:16<14:27:52,  7.47s/it]

 一旦训练完成，我们就可以评估我们的模型，得到它在验证集上的perplexity，如下所示:

@ -597,8 +414,8 @@ data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probabi
 trainer = Trainer(
    model=model,
    args=training_args,
-    train_dataset=lm_datasets["train"][:1000],
-    eval_dataset=lm_datasets["validation"][:100],
+    train_dataset=lm_datasets["train"],
+    eval_dataset=lm_datasets["validation"],
    data_collator=data_collator,
 )
 ```