This commit is contained in:
erenup 2021-09-02 00:12:52 +08:00
parent 0f039659f5
commit d34d1f7a14
3 changed files with 1363 additions and 842 deletions

View File

@ -41,6 +41,19 @@ from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1') datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')
``` ```
Downloading: 8.33kB [00:00, 1.49MB/s]
Downloading: 5.83kB [00:00, 1.77MB/s]
Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.91 MiB, post-processed: Unknown size, total: 17.41 MiB) to /Users/niepig/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20...
Downloading: 100%|██████████| 4.72M/4.72M [00:02<00:00, 1.91MB/s]
Dataset wikitext downloaded and prepared to /Users/niepig/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20. Subsequent calls will reuse this data.
如果碰到以下错误: 如果碰到以下错误:
![request Error](images/request_error.png) ![request Error](images/request_error.png)
@ -114,11 +127,11 @@ show_random_elements(datasets["train"])
<tbody> <tbody>
<tr> <tr>
<th>0</th> <th>0</th>
<td>MD 194D is the designation for an unnamed 0 @.@ 02 @-@ mile ( 0 @.@ 032 km ) connector between MD 194 and MD 853E , the old alignment that parallels the northbound direction of the modern highway south of Angell Road . \n</td> <td>Plum cakes made with fresh plums came with other migrants from other traditions in which plum cake is prepared using plum as a primary ingredient . In some versions , the plums may become jam @-@ like inside the cake after cooking , or be prepared using plum jam . Plum cake prepared with plums is also a part of Ashkenazi Jewish cuisine , and is referred to as Pflaumenkuchen or Zwetschgenkuchen . Other plum @-@ based cakes are found in French , Italian and Polish cooking . \n</td>
</tr> </tr>
<tr> <tr>
<th>1</th> <th>1</th>
<td>My sense , as though of hemlock I had drunk , \n</td> <td>= = = Language = = = \n</td>
</tr> </tr>
<tr> <tr>
<th>2</th> <th>2</th>
@ -126,27 +139,27 @@ show_random_elements(datasets["train"])
</tr> </tr>
<tr> <tr>
<th>3</th> <th>3</th>
<td>A mimed stage show , Thunderbirds : F.A.B. , has toured internationally and popularised a staccato style of movement known colloquially as the " Thunderbirds walk " . The production has periodically been revived as Thunderbirds : F.A.B. The Next Generation . \n</td> <td></td>
</tr> </tr>
<tr> <tr>
<th>4</th> <th>4</th>
<td></td> <td>The town 's population not only recovered but grew ; the 1906 census of the Canadian Prairies listed the population at 1 @,@ 178 . A new study commissioned by the Dominion government determined that the cracks in the mountain continued to grow and that the risk of another slide remained . Consequently , parts of Frank closest to the mountain were dismantled or relocated to safer areas . \n</td>
</tr> </tr>
<tr> <tr>
<th>5</th> <th>5</th>
<td></td> <td>The Litigators is a 2011 legal thriller novel by John Grisham , his 25th fiction novel overall . The Litigators is about a two @-@ partner Chicago law firm attempting to strike it rich in a class action lawsuit over a cholesterol reduction drug by a major pharmaceutical drug company . The protagonist is a Harvard Law School grad big law firm burnout who stumbles upon the boutique and joins it only to find himself litigating against his old law firm in this case . The book is regarded as more humorous than most of Grisham 's prior novels . \n</td>
</tr> </tr>
<tr> <tr>
<th>6</th> <th>6</th>
<td>In his 1998 autobiography For the Love of the Game , Jordan wrote that he had been preparing for retirement as early as the summer of 1992 . The added exhaustion due to the Dream Team run in the 1992 Olympics solidified Jordan 's feelings about the game and his ever @-@ growing celebrity status . Jordan 's announcement sent shock waves throughout the NBA and appeared on the front pages of newspapers around the world . \n</td> <td></td>
</tr> </tr>
<tr> <tr>
<th>7</th> <th>7</th>
<td>Research on new wildlife collars may be able to reduce human @-@ animal conflicts by predicting when and where predatory animals hunt . This can not only save human lives and the lives of their pets and livestock but also save these large predatory mammals that are important to the balance of ecosystems . \n</td> <td>On December 7 , 2006 , Headquarters Marine Corps released a message stating that 2nd Battalion 9th Marines would be reactivated during 2007 as part of the continuing Global War on Terror . 2nd Battalion 9th Marines was re @-@ activated on July 13 , 2007 and replaced the Anti @-@ Terrorism Battalion ( ATBn ) . In September 2008 , Marines and Sailors from 2 / 9 deployed to Al Anbar Province in support of Operation Iraqi Freedom . They were based in the city of Ramadi and returned in April 2009 without any Marines or Sailors killed in action . July 2010 Marines and Sailors from 2 / 9 deployed to Marjah , Helmand Province , Afghanistan in support of Operation Enduring Freedom . In December 2010 Echo Company from 2 / 9 were attached to 3 / 5 in Sangin , Afghanistan where they earned the notorious nickname of " Green Hats . " They returned February 2011 . They redeployed back to Marjah December 2011 and returned July 2012 . Echo and Weapons companies deployed once more to Afghanistan from January through April 2013 , participating in combat operations out of Camp Leatherneck . On April 1 , 2015 the battalion was deactivated in a ceremony at Camp Lejeune . \n</td>
</tr> </tr>
<tr> <tr>
<th>8</th> <th>8</th>
<td>" Love Me Like You " ( Christmas Mix ) 3 : 29 \n</td> <td>( i ) = Indoor \n</td>
</tr> </tr>
<tr> <tr>
<th>9</th> <th>9</th>
@ -188,6 +201,12 @@ from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True) tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
``` ```
Downloading: 100%|██████████| 762/762 [00:00<00:00, 358kB/s]
Downloading: 100%|██████████| 1.04M/1.04M [00:04<00:00, 235kB/s]
Downloading: 100%|██████████| 456k/456k [00:02<00:00, 217kB/s]
Downloading: 100%|██████████| 1.36M/1.36M [00:05<00:00, 252kB/s]
我们现在可以对所有的文本调用分词器该操作可以简单地使用来自Datasets库的map方法实现。首先我们定义一个在文本上调用标记器的函数: 我们现在可以对所有的文本调用分词器该操作可以简单地使用来自Datasets库的map方法实现。首先我们定义一个在文本上调用标记器的函数:
@ -204,6 +223,62 @@ def tokenize_function(examples):
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"]) tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
``` ```
#0: 0%| | 0/2 [00:00<?, ?ba/s]


#3: 100%|██████████| 2/2 [00:00<00:00, 6.42ba/s]
#1: 100%|██████████| 2/2 [00:00<00:00, 5.87ba/s]
#0: 100%|██████████| 2/2 [00:00<00:00, 5.56ba/s]
#2: 100%|██████████| 2/2 [00:00<00:00, 4.73ba/s]
#0: 0%| | 0/10 [00:00<?, ?ba/s]

#0: 10%|█ | 1/10 [00:00<00:03, 2.87ba/s]

#0: 20%|██ | 2/10 [00:00<00:02, 2.89ba/s]

#0: 30%|███ | 3/10 [00:00<00:02, 3.08ba/s]

#0: 40%|████ | 4/10 [00:01<00:01, 3.14ba/s]

#0: 50%|█████ | 5/10 [00:01<00:01, 3.33ba/s]

#0: 60%|██████ | 6/10 [00:01<00:01, 3.44ba/s]


#0: 70%|███████ | 7/10 [00:02<00:01, 2.89ba/s]

#0: 80%|████████ | 8/10 [00:02<00:00, 2.89ba/s]
#0: 90%|█████████ | 9/10 [00:02<00:00, 3.04ba/s]
#0: 100%|██████████| 10/10 [00:02<00:00, 3.37ba/s]
#2: 100%|██████████| 10/10 [00:02<00:00, 3.44ba/s]
#1: 100%|██████████| 10/10 [00:02<00:00, 3.33ba/s]
#3: 100%|██████████| 10/10 [00:03<00:00, 3.25ba/s]
#0: 0%| | 0/1 [00:00<?, ?ba/s]

#0: 100%|██████████| 1/1 [00:00<00:00, 3.70ba/s]
#1: 100%|██████████| 1/1 [00:00<00:00, 2.79ba/s]

#2: 100%|██████████| 1/1 [00:00<00:00, 2.74ba/s]
#3: 100%|██████████| 1/1 [00:00<00:00, 2.82ba/s]
如果我们现在查看数据集的一个元素我们会看到文本已经被模型所需的input_ids所取代: 如果我们现在查看数据集的一个元素我们会看到文本已经被模型所需的input_ids所取代:
@ -264,6 +339,62 @@ lm_datasets = tokenized_datasets.map(
) )
``` ```
#0: 0%| | 0/2 [00:00<?, ?ba/s]


#3: 100%|██████████| 2/2 [00:00<00:00, 6.12ba/s]
#1: 100%|██████████| 2/2 [00:00<00:00, 4.89ba/s]
#0: 100%|██████████| 2/2 [00:00<00:00, 4.60ba/s]
#2: 100%|██████████| 2/2 [00:00<00:00, 3.94ba/s]
#0: 0%| | 0/10 [00:00<?, ?ba/s]

#0: 10%|█ | 1/10 [00:00<00:03, 2.90ba/s]

#0: 20%|██ | 2/10 [00:00<00:02, 2.76ba/s]

#0: 30%|███ | 3/10 [00:01<00:02, 2.72ba/s]

#0: 40%|████ | 4/10 [00:01<00:02, 2.75ba/s]

#0: 50%|█████ | 5/10 [00:01<00:01, 2.92ba/s]
#0: 60%|██████ | 6/10 [00:02<00:01, 3.01ba/s]


#0: 70%|███████ | 7/10 [00:02<00:01, 2.69ba/s]
#0: 80%|████████ | 8/10 [00:02<00:00, 2.67ba/s]

#0: 100%|██████████| 10/10 [00:03<00:00, 3.00ba/s]

#2: 100%|██████████| 10/10 [00:03<00:00, 3.04ba/s]
#1: 100%|██████████| 10/10 [00:03<00:00, 2.88ba/s]
#3: 100%|██████████| 10/10 [00:03<00:00, 2.79ba/s]
#0: 0%| | 0/1 [00:00<?, ?ba/s]

#0: 100%|██████████| 1/1 [00:00<00:00, 3.41ba/s]
#1: 100%|██████████| 1/1 [00:00<00:00, 2.61ba/s]
#3: 100%|██████████| 1/1 [00:00<00:00, 2.69ba/s]
#2: 100%|██████████| 1/1 [00:00<00:00, 2.55ba/s]
现在我们可以检查数据集是否发生了变化:现在样本包含了`block_size`连续字符块,可能跨越了几个原始文本。 现在我们可以检查数据集是否发生了变化:现在样本包含了`block_size`连续字符块,可能跨越了几个原始文本。
@ -286,11 +417,7 @@ from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_checkpoint) model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
``` ```
Downloading: 100%|██████████| 353M/353M [00:21<00:00, 16.0MB/s]
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=352833716.0, style=ProgressStyle(descri…
检查torch版本 检查torch版本
@ -299,15 +426,14 @@ model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
```python ```python
import importlib.util import importlib.util
import importlib_metadata # import importlib_metadata
a = importlib.util.find_spec("torch") is not None a = importlib.util.find_spec("torch") is not None
print(a) print(a)
_torch_version = importlib_metadata.version("torch") # _torch_version = importlib_metadata.version("torch")
print(_torch_version) # print(_torch_version)
``` ```
True True
1.8.1+cu101
和一些`TrainingArguments`: 和一些`TrainingArguments`:
@ -346,6 +472,60 @@ trainer = Trainer(
trainer.train() trainer.train()
``` ```
0%| | 0/3 [00:00<?, ?it/s]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/var/folders/2k/x3py0v857kgcwqvvl00xxhxw0000gn/T/ipykernel_12460/4032920361.py in <module>
----> 1 trainer.train()
~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
1032 self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control)
1033
-> 1034 for step, inputs in enumerate(epoch_iterator):
1035
1036 # Skip past any already trained steps if resuming training
~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py in __next__(self)
519 if self._sampler_iter is None:
520 self._reset()
--> 521 data = self._next_data()
522 self._num_yielded += 1
523 if self._dataset_kind == _DatasetKind.Iterable and \
~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _next_data(self)
559 def _next_data(self):
560 index = self._next_index() # may raise StopIteration
--> 561 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
562 if self._pin_memory:
563 data = _utils.pin_memory.pin_memory(data)
~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]
~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]
KeyError: 1
一旦训练完成我们就可以评估我们的模型得到它在验证集上的perplexity如下所示: 一旦训练完成我们就可以评估我们的模型得到它在验证集上的perplexity如下所示: