fix

2021-09-02 00:12:52 +08:00 · 2021-09-02 00:12:52 +08:00 · d34d1f7a14
parent 0f039659f5
commit d34d1f7a14
3 changed files with 1363 additions and 842 deletions
--- a/docs/篇章4-使用Transformers解决NLP任务/4.5-生成任务-语言模型.ipynb
+++ b/docs/篇章4-使用Transformers解决NLP任务/4.5-生成任务-语言模型.ipynb
--- a/docs/篇章4-使用Transformers解决NLP任务/4.5-生成任务-语言模型.md
+++ b/docs/篇章4-使用Transformers解决NLP任务/4.5-生成任务-语言模型.md
@ -41,6 +41,19 @@ from datasets import load_dataset
 datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')
 ```
    Downloading: 8.33kB [00:00, 1.49MB/s]                   
    Downloading: 5.83kB [00:00, 1.77MB/s]                   
    Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.91 MiB, post-processed: Unknown size, total: 17.41 MiB) to /Users/niepig/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20...
    Downloading: 100%|██████████| 4.72M/4.72M [00:02<00:00, 1.91MB/s]
    Dataset wikitext downloaded and prepared to /Users/niepig/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20. Subsequent calls will reuse this data.
 如果碰到以下错误：
 ![request Error](images/request_error.png)
@ -114,11 +127,11 @@ show_random_elements(datasets["train"])
  <tbody>
    <tr>
      <th>0</th>
-      <td>MD 194D is the designation for an unnamed 0 @.@ 02 @-@ mile ( 0 @.@ 032 km ) connector between MD 194 and MD 853E , the old alignment that parallels the northbound direction of the modern highway south of Angell Road . \n</td>
+      <td>Plum cakes made with fresh plums came with other migrants from other traditions in which plum cake is prepared using plum as a primary ingredient . In some versions , the plums may become jam @-@ like inside the cake after cooking , or be prepared using plum jam . Plum cake prepared with plums is also a part of Ashkenazi Jewish cuisine , and is referred to as Pflaumenkuchen or Zwetschgenkuchen . Other plum @-@ based cakes are found in French , Italian and Polish cooking . \n</td>
    </tr>
    <tr>
      <th>1</th>
-      <td>My sense , as though of hemlock I had drunk , \n</td>
+      <td>= = = Language = = = \n</td>
    </tr>
    <tr>
      <th>2</th>
@ -126,27 +139,27 @@ show_random_elements(datasets["train"])
    </tr>
    <tr>
      <th>3</th>
-      <td>A mimed stage show , Thunderbirds : F.A.B. , has toured internationally and popularised a staccato style of movement known colloquially as the " Thunderbirds walk " . The production has periodically been revived as Thunderbirds : F.A.B. – The Next Generation . \n</td>
+      <td></td>
    </tr>
    <tr>
      <th>4</th>
-      <td></td>
+      <td>The town 's population not only recovered but grew ; the 1906 census of the Canadian Prairies listed the population at 1 @,@ 178 . A new study commissioned by the Dominion government determined that the cracks in the mountain continued to grow and that the risk of another slide remained . Consequently , parts of Frank closest to the mountain were dismantled or relocated to safer areas . \n</td>
    </tr>
    <tr>
      <th>5</th>
-      <td></td>
+      <td>The Litigators is a 2011 legal thriller novel by John Grisham , his 25th fiction novel overall . The Litigators is about a two @-@ partner Chicago law firm attempting to strike it rich in a class action lawsuit over a cholesterol reduction drug by a major pharmaceutical drug company . The protagonist is a Harvard Law School grad big law firm burnout who stumbles upon the boutique and joins it only to find himself litigating against his old law firm in this case . The book is regarded as more humorous than most of Grisham 's prior novels . \n</td>
    </tr>
    <tr>
      <th>6</th>
-      <td>In his 1998 autobiography For the Love of the Game , Jordan wrote that he had been preparing for retirement as early as the summer of 1992 . The added exhaustion due to the Dream Team run in the 1992 Olympics solidified Jordan 's feelings about the game and his ever @-@ growing celebrity status . Jordan 's announcement sent shock waves throughout the NBA and appeared on the front pages of newspapers around the world . \n</td>
+      <td></td>
    </tr>
    <tr>
      <th>7</th>
-      <td>Research on new wildlife collars may be able to reduce human @-@ animal conflicts by predicting when and where predatory animals hunt . This can not only save human lives and the lives of their pets and livestock but also save these large predatory mammals that are important to the balance of ecosystems . \n</td>
+      <td>On December 7 , 2006 , Headquarters Marine Corps released a message stating that 2nd Battalion 9th Marines would be reactivated during 2007 as part of the continuing Global War on Terror . 2nd Battalion 9th Marines was re @-@ activated on July 13 , 2007 and replaced the Anti @-@ Terrorism Battalion ( ATBn ) . In September 2008 , Marines and Sailors from 2 / 9 deployed to Al Anbar Province in support of Operation Iraqi Freedom . They were based in the city of Ramadi and returned in April 2009 without any Marines or Sailors killed in action . July 2010 Marines and Sailors from 2 / 9 deployed to Marjah , Helmand Province , Afghanistan in support of Operation Enduring Freedom . In December 2010 Echo Company from 2 / 9 were attached to 3 / 5 in Sangin , Afghanistan where they earned the notorious nickname of " Green Hats . " They returned February 2011 . They redeployed back to Marjah December 2011 and returned July 2012 . Echo and Weapons companies deployed once more to Afghanistan from January through April 2013 , participating in combat operations out of Camp Leatherneck . On April 1 , 2015 the battalion was deactivated in a ceremony at Camp Lejeune . \n</td>
    </tr>
    <tr>
      <th>8</th>
-      <td>" Love Me Like You " ( Christmas Mix ) – 3 : 29 \n</td>
+      <td>( i ) = Indoor \n</td>
    </tr>
    <tr>
      <th>9</th>
@ -188,6 +201,12 @@ from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
 ```
    Downloading: 100%|██████████| 762/762 [00:00<00:00, 358kB/s]
    Downloading: 100%|██████████| 1.04M/1.04M [00:04<00:00, 235kB/s]
    Downloading: 100%|██████████| 456k/456k [00:02<00:00, 217kB/s]
    Downloading: 100%|██████████| 1.36M/1.36M [00:05<00:00, 252kB/s]
 我们现在可以对所有的文本调用分词器，该操作可以简单地使用来自Datasets库的map方法实现。首先，我们定义一个在文本上调用标记器的函数:
@ -204,6 +223,62 @@ def tokenize_function(examples):
 tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
 ```
    #0:   0%|          | 0/2 [00:00<?, ?ba/s]
    [A
    [A[A
    #3: 100%|██████████| 2/2 [00:00<00:00,  6.42ba/s]
    #1: 100%|██████████| 2/2 [00:00<00:00,  5.87ba/s]
    #0: 100%|██████████| 2/2 [00:00<00:00,  5.56ba/s]
    #2: 100%|██████████| 2/2 [00:00<00:00,  4.73ba/s]
    #0:   0%|          | 0/10 [00:00<?, ?ba/s]
    [A
    #0:  10%|█         | 1/10 [00:00<00:03,  2.87ba/s]
    [A
    #0:  20%|██        | 2/10 [00:00<00:02,  2.89ba/s]
    [A
    #0:  30%|███       | 3/10 [00:00<00:02,  3.08ba/s]
    [A
    #0:  40%|████      | 4/10 [00:01<00:01,  3.14ba/s]
    [A
    #0:  50%|█████     | 5/10 [00:01<00:01,  3.33ba/s]
    [A
    #0:  60%|██████    | 6/10 [00:01<00:01,  3.44ba/s]
    [A
    [A[A
    #0:  70%|███████   | 7/10 [00:02<00:01,  2.89ba/s]
    [A[A
    #0:  80%|████████  | 8/10 [00:02<00:00,  2.89ba/s]
    #0:  90%|█████████ | 9/10 [00:02<00:00,  3.04ba/s]
    #0: 100%|██████████| 10/10 [00:02<00:00,  3.37ba/s]
    #2: 100%|██████████| 10/10 [00:02<00:00,  3.44ba/s]
    #1: 100%|██████████| 10/10 [00:02<00:00,  3.33ba/s]
    #3: 100%|██████████| 10/10 [00:03<00:00,  3.25ba/s]
    #0:   0%|          | 0/1 [00:00<?, ?ba/s]
    [A
    #0: 100%|██████████| 1/1 [00:00<00:00,  3.70ba/s]
    #1: 100%|██████████| 1/1 [00:00<00:00,  2.79ba/s]
    [A
    #2: 100%|██████████| 1/1 [00:00<00:00,  2.74ba/s]
    #3: 100%|██████████| 1/1 [00:00<00:00,  2.82ba/s]
 如果我们现在查看数据集的一个元素，我们会看到文本已经被模型所需的input_ids所取代:
@ -264,6 +339,62 @@ lm_datasets = tokenized_datasets.map(
 )
 ```
    #0:   0%|          | 0/2 [00:00<?, ?ba/s]
    [A
    [A[A
    #3: 100%|██████████| 2/2 [00:00<00:00,  6.12ba/s]
    #1: 100%|██████████| 2/2 [00:00<00:00,  4.89ba/s]
    #0: 100%|██████████| 2/2 [00:00<00:00,  4.60ba/s]
    #2: 100%|██████████| 2/2 [00:00<00:00,  3.94ba/s]
    #0:   0%|          | 0/10 [00:00<?, ?ba/s]
    [A
    #0:  10%|█         | 1/10 [00:00<00:03,  2.90ba/s]
    [A
    #0:  20%|██        | 2/10 [00:00<00:02,  2.76ba/s]
    [A
    #0:  30%|███       | 3/10 [00:01<00:02,  2.72ba/s]
    [A
    #0:  40%|████      | 4/10 [00:01<00:02,  2.75ba/s]
    [A
    #0:  50%|█████     | 5/10 [00:01<00:01,  2.92ba/s]
    #0:  60%|██████    | 6/10 [00:02<00:01,  3.01ba/s]
    [A[A
    [A
    #0:  70%|███████   | 7/10 [00:02<00:01,  2.69ba/s]
    #0:  80%|████████  | 8/10 [00:02<00:00,  2.67ba/s]
    [A
    #0: 100%|██████████| 10/10 [00:03<00:00,  3.00ba/s]
    [A[A
    #2: 100%|██████████| 10/10 [00:03<00:00,  3.04ba/s]
    #1: 100%|██████████| 10/10 [00:03<00:00,  2.88ba/s]
    #3: 100%|██████████| 10/10 [00:03<00:00,  2.79ba/s]
    #0:   0%|          | 0/1 [00:00<?, ?ba/s]
    [A
    #0: 100%|██████████| 1/1 [00:00<00:00,  3.41ba/s]
    #1: 100%|██████████| 1/1 [00:00<00:00,  2.61ba/s]
    #3: 100%|██████████| 1/1 [00:00<00:00,  2.69ba/s]
    #2: 100%|██████████| 1/1 [00:00<00:00,  2.55ba/s]
 现在我们可以检查数据集是否发生了变化：现在样本包含了`block_size`连续字符块，可能跨越了几个原始文本。
@ -286,11 +417,7 @@ from transformers import AutoModelForCausalLM
 model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
 ```
-
+    Downloading: 100%|██████████| 353M/353M [00:21<00:00, 16.0MB/s]
    HBox(children=(FloatProgress(value=0.0, description='Downloading', max=352833716.0, style=ProgressStyle(descri…
 检查torch版本
@ -299,15 +426,14 @@ model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
 ```python
 import importlib.util
-import importlib_metadata
+# import importlib_metadata
 a = importlib.util.find_spec("torch") is not None
 print(a)
-_torch_version = importlib_metadata.version("torch")
+# _torch_version = importlib_metadata.version("torch")
-print(_torch_version)
+# print(_torch_version)
 ```
    True
    1.8.1+cu101
 和一些`TrainingArguments`:
@ -346,6 +472,60 @@ trainer = Trainer(
 trainer.train()
 ```
      0%|          | 0/3 [00:00<?, ?it/s]
    ---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    /var/folders/2k/x3py0v857kgcwqvvl00xxhxw0000gn/T/ipykernel_12460/4032920361.py in <module>
    ----> 1 trainer.train()
    ~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
       1032             self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control)
       1033 
    -> 1034             for step, inputs in enumerate(epoch_iterator):
       1035 
       1036                 # Skip past any already trained steps if resuming training
    ~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py in __next__(self)
        519             if self._sampler_iter is None:
        520                 self._reset()
    --> 521             data = self._next_data()
        522             self._num_yielded += 1
        523             if self._dataset_kind == _DatasetKind.Iterable and \
    ~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _next_data(self)
        559     def _next_data(self):
        560         index = self._next_index()  # may raise StopIteration
    --> 561         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
        562         if self._pin_memory:
        563             data = _utils.pin_memory.pin_memory(data)
    ~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
         42     def fetch(self, possibly_batched_index):
         43         if self.auto_collation:
    ---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
         45         else:
         46             data = self.dataset[possibly_batched_index]
    ~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
         42     def fetch(self, possibly_batched_index):
         43         if self.auto_collation:
    ---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
         45         else:
         46             data = self.dataset[possibly_batched_index]
    KeyError: 1
 一旦训练完成，我们就可以评估我们的模型，得到它在验证集上的perplexity，如下所示:
--- a/docs/篇章4-使用Transformers解决NLP任务/4.6-生成任务-机器翻译.ipynb
+++ b/docs/篇章4-使用Transformers解决NLP任务/4.6-生成任务-机器翻译.ipynb