fix
This commit is contained in:
parent
0f039659f5
commit
d34d1f7a14
File diff suppressed because it is too large
Load Diff
|
@ -41,6 +41,19 @@ from datasets import load_dataset
|
||||||
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')
|
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Downloading: 8.33kB [00:00, 1.49MB/s]
|
||||||
|
Downloading: 5.83kB [00:00, 1.77MB/s]
|
||||||
|
|
||||||
|
|
||||||
|
Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.91 MiB, post-processed: Unknown size, total: 17.41 MiB) to /Users/niepig/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20...
|
||||||
|
|
||||||
|
|
||||||
|
Downloading: 100%|██████████| 4.72M/4.72M [00:02<00:00, 1.91MB/s]
|
||||||
|
|
||||||
|
|
||||||
|
Dataset wikitext downloaded and prepared to /Users/niepig/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20. Subsequent calls will reuse this data.
|
||||||
|
|
||||||
|
|
||||||
如果碰到以下错误:
|
如果碰到以下错误:
|
||||||

|

|
||||||
|
|
||||||
|
@ -114,11 +127,11 @@ show_random_elements(datasets["train"])
|
||||||
<tbody>
|
<tbody>
|
||||||
<tr>
|
<tr>
|
||||||
<th>0</th>
|
<th>0</th>
|
||||||
<td>MD 194D is the designation for an unnamed 0 @.@ 02 @-@ mile ( 0 @.@ 032 km ) connector between MD 194 and MD 853E , the old alignment that parallels the northbound direction of the modern highway south of Angell Road . \n</td>
|
<td>Plum cakes made with fresh plums came with other migrants from other traditions in which plum cake is prepared using plum as a primary ingredient . In some versions , the plums may become jam @-@ like inside the cake after cooking , or be prepared using plum jam . Plum cake prepared with plums is also a part of Ashkenazi Jewish cuisine , and is referred to as Pflaumenkuchen or Zwetschgenkuchen . Other plum @-@ based cakes are found in French , Italian and Polish cooking . \n</td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<th>1</th>
|
<th>1</th>
|
||||||
<td>My sense , as though of hemlock I had drunk , \n</td>
|
<td>= = = Language = = = \n</td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<th>2</th>
|
<th>2</th>
|
||||||
|
@ -126,27 +139,27 @@ show_random_elements(datasets["train"])
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<th>3</th>
|
<th>3</th>
|
||||||
<td>A mimed stage show , Thunderbirds : F.A.B. , has toured internationally and popularised a staccato style of movement known colloquially as the " Thunderbirds walk " . The production has periodically been revived as Thunderbirds : F.A.B. – The Next Generation . \n</td>
|
<td></td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<th>4</th>
|
<th>4</th>
|
||||||
<td></td>
|
<td>The town 's population not only recovered but grew ; the 1906 census of the Canadian Prairies listed the population at 1 @,@ 178 . A new study commissioned by the Dominion government determined that the cracks in the mountain continued to grow and that the risk of another slide remained . Consequently , parts of Frank closest to the mountain were dismantled or relocated to safer areas . \n</td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<th>5</th>
|
<th>5</th>
|
||||||
<td></td>
|
<td>The Litigators is a 2011 legal thriller novel by John Grisham , his 25th fiction novel overall . The Litigators is about a two @-@ partner Chicago law firm attempting to strike it rich in a class action lawsuit over a cholesterol reduction drug by a major pharmaceutical drug company . The protagonist is a Harvard Law School grad big law firm burnout who stumbles upon the boutique and joins it only to find himself litigating against his old law firm in this case . The book is regarded as more humorous than most of Grisham 's prior novels . \n</td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<th>6</th>
|
<th>6</th>
|
||||||
<td>In his 1998 autobiography For the Love of the Game , Jordan wrote that he had been preparing for retirement as early as the summer of 1992 . The added exhaustion due to the Dream Team run in the 1992 Olympics solidified Jordan 's feelings about the game and his ever @-@ growing celebrity status . Jordan 's announcement sent shock waves throughout the NBA and appeared on the front pages of newspapers around the world . \n</td>
|
<td></td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<th>7</th>
|
<th>7</th>
|
||||||
<td>Research on new wildlife collars may be able to reduce human @-@ animal conflicts by predicting when and where predatory animals hunt . This can not only save human lives and the lives of their pets and livestock but also save these large predatory mammals that are important to the balance of ecosystems . \n</td>
|
<td>On December 7 , 2006 , Headquarters Marine Corps released a message stating that 2nd Battalion 9th Marines would be reactivated during 2007 as part of the continuing Global War on Terror . 2nd Battalion 9th Marines was re @-@ activated on July 13 , 2007 and replaced the Anti @-@ Terrorism Battalion ( ATBn ) . In September 2008 , Marines and Sailors from 2 / 9 deployed to Al Anbar Province in support of Operation Iraqi Freedom . They were based in the city of Ramadi and returned in April 2009 without any Marines or Sailors killed in action . July 2010 Marines and Sailors from 2 / 9 deployed to Marjah , Helmand Province , Afghanistan in support of Operation Enduring Freedom . In December 2010 Echo Company from 2 / 9 were attached to 3 / 5 in Sangin , Afghanistan where they earned the notorious nickname of " Green Hats . " They returned February 2011 . They redeployed back to Marjah December 2011 and returned July 2012 . Echo and Weapons companies deployed once more to Afghanistan from January through April 2013 , participating in combat operations out of Camp Leatherneck . On April 1 , 2015 the battalion was deactivated in a ceremony at Camp Lejeune . \n</td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<th>8</th>
|
<th>8</th>
|
||||||
<td>" Love Me Like You " ( Christmas Mix ) – 3 : 29 \n</td>
|
<td>( i ) = Indoor \n</td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<th>9</th>
|
<th>9</th>
|
||||||
|
@ -188,6 +201,12 @@ from transformers import AutoTokenizer
|
||||||
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
|
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Downloading: 100%|██████████| 762/762 [00:00<00:00, 358kB/s]
|
||||||
|
Downloading: 100%|██████████| 1.04M/1.04M [00:04<00:00, 235kB/s]
|
||||||
|
Downloading: 100%|██████████| 456k/456k [00:02<00:00, 217kB/s]
|
||||||
|
Downloading: 100%|██████████| 1.36M/1.36M [00:05<00:00, 252kB/s]
|
||||||
|
|
||||||
|
|
||||||
我们现在可以对所有的文本调用分词器,该操作可以简单地使用来自Datasets库的map方法实现。首先,我们定义一个在文本上调用标记器的函数:
|
我们现在可以对所有的文本调用分词器,该操作可以简单地使用来自Datasets库的map方法实现。首先,我们定义一个在文本上调用标记器的函数:
|
||||||
|
|
||||||
|
|
||||||
|
@ -204,6 +223,62 @@ def tokenize_function(examples):
|
||||||
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
|
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
|
||||||
```
|
```
|
||||||
|
|
||||||
|
#0: 0%| | 0/2 [00:00<?, ?ba/s]
|
||||||
|
[A
|
||||||
|
|
||||||
|
[A[A
|
||||||
|
|
||||||
|
#3: 100%|██████████| 2/2 [00:00<00:00, 6.42ba/s]
|
||||||
|
#1: 100%|██████████| 2/2 [00:00<00:00, 5.87ba/s]
|
||||||
|
#0: 100%|██████████| 2/2 [00:00<00:00, 5.56ba/s]
|
||||||
|
|
||||||
|
#2: 100%|██████████| 2/2 [00:00<00:00, 4.73ba/s]
|
||||||
|
#0: 0%| | 0/10 [00:00<?, ?ba/s]
|
||||||
|
[A
|
||||||
|
|
||||||
|
#0: 10%|█ | 1/10 [00:00<00:03, 2.87ba/s]
|
||||||
|
[A
|
||||||
|
|
||||||
|
#0: 20%|██ | 2/10 [00:00<00:02, 2.89ba/s]
|
||||||
|
[A
|
||||||
|
|
||||||
|
#0: 30%|███ | 3/10 [00:00<00:02, 3.08ba/s]
|
||||||
|
[A
|
||||||
|
|
||||||
|
#0: 40%|████ | 4/10 [00:01<00:01, 3.14ba/s]
|
||||||
|
[A
|
||||||
|
|
||||||
|
#0: 50%|█████ | 5/10 [00:01<00:01, 3.33ba/s]
|
||||||
|
[A
|
||||||
|
|
||||||
|
#0: 60%|██████ | 6/10 [00:01<00:01, 3.44ba/s]
|
||||||
|
[A
|
||||||
|
|
||||||
|
[A[A
|
||||||
|
#0: 70%|███████ | 7/10 [00:02<00:01, 2.89ba/s]
|
||||||
|
|
||||||
|
[A[A
|
||||||
|
#0: 80%|████████ | 8/10 [00:02<00:00, 2.89ba/s]
|
||||||
|
|
||||||
|
#0: 90%|█████████ | 9/10 [00:02<00:00, 3.04ba/s]
|
||||||
|
#0: 100%|██████████| 10/10 [00:02<00:00, 3.37ba/s]
|
||||||
|
#2: 100%|██████████| 10/10 [00:02<00:00, 3.44ba/s]
|
||||||
|
#1: 100%|██████████| 10/10 [00:02<00:00, 3.33ba/s]
|
||||||
|
|
||||||
|
|
||||||
|
#3: 100%|██████████| 10/10 [00:03<00:00, 3.25ba/s]
|
||||||
|
#0: 0%| | 0/1 [00:00<?, ?ba/s]
|
||||||
|
[A
|
||||||
|
|
||||||
|
#0: 100%|██████████| 1/1 [00:00<00:00, 3.70ba/s]
|
||||||
|
#1: 100%|██████████| 1/1 [00:00<00:00, 2.79ba/s]
|
||||||
|
|
||||||
|
[A
|
||||||
|
|
||||||
|
#2: 100%|██████████| 1/1 [00:00<00:00, 2.74ba/s]
|
||||||
|
#3: 100%|██████████| 1/1 [00:00<00:00, 2.82ba/s]
|
||||||
|
|
||||||
|
|
||||||
如果我们现在查看数据集的一个元素,我们会看到文本已经被模型所需的input_ids所取代:
|
如果我们现在查看数据集的一个元素,我们会看到文本已经被模型所需的input_ids所取代:
|
||||||
|
|
||||||
|
|
||||||
|
@ -264,6 +339,62 @@ lm_datasets = tokenized_datasets.map(
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
#0: 0%| | 0/2 [00:00<?, ?ba/s]
|
||||||
|
[A
|
||||||
|
|
||||||
|
[A[A
|
||||||
|
|
||||||
|
#3: 100%|██████████| 2/2 [00:00<00:00, 6.12ba/s]
|
||||||
|
#1: 100%|██████████| 2/2 [00:00<00:00, 4.89ba/s]
|
||||||
|
#0: 100%|██████████| 2/2 [00:00<00:00, 4.60ba/s]
|
||||||
|
|
||||||
|
#2: 100%|██████████| 2/2 [00:00<00:00, 3.94ba/s]
|
||||||
|
#0: 0%| | 0/10 [00:00<?, ?ba/s]
|
||||||
|
[A
|
||||||
|
|
||||||
|
#0: 10%|█ | 1/10 [00:00<00:03, 2.90ba/s]
|
||||||
|
[A
|
||||||
|
|
||||||
|
#0: 20%|██ | 2/10 [00:00<00:02, 2.76ba/s]
|
||||||
|
[A
|
||||||
|
|
||||||
|
#0: 30%|███ | 3/10 [00:01<00:02, 2.72ba/s]
|
||||||
|
[A
|
||||||
|
|
||||||
|
#0: 40%|████ | 4/10 [00:01<00:02, 2.75ba/s]
|
||||||
|
[A
|
||||||
|
|
||||||
|
#0: 50%|█████ | 5/10 [00:01<00:01, 2.92ba/s]
|
||||||
|
#0: 60%|██████ | 6/10 [00:02<00:01, 3.01ba/s]
|
||||||
|
|
||||||
|
[A[A
|
||||||
|
[A
|
||||||
|
#0: 70%|███████ | 7/10 [00:02<00:01, 2.69ba/s]
|
||||||
|
|
||||||
|
#0: 80%|████████ | 8/10 [00:02<00:00, 2.67ba/s]
|
||||||
|
[A
|
||||||
|
|
||||||
|
#0: 100%|██████████| 10/10 [00:03<00:00, 3.00ba/s]
|
||||||
|
|
||||||
|
|
||||||
|
[A[A
|
||||||
|
#2: 100%|██████████| 10/10 [00:03<00:00, 3.04ba/s]
|
||||||
|
#1: 100%|██████████| 10/10 [00:03<00:00, 2.88ba/s]
|
||||||
|
|
||||||
|
|
||||||
|
#3: 100%|██████████| 10/10 [00:03<00:00, 2.79ba/s]
|
||||||
|
#0: 0%| | 0/1 [00:00<?, ?ba/s]
|
||||||
|
[A
|
||||||
|
|
||||||
|
#0: 100%|██████████| 1/1 [00:00<00:00, 3.41ba/s]
|
||||||
|
#1: 100%|██████████| 1/1 [00:00<00:00, 2.61ba/s]
|
||||||
|
|
||||||
|
|
||||||
|
#3: 100%|██████████| 1/1 [00:00<00:00, 2.69ba/s]
|
||||||
|
|
||||||
|
#2: 100%|██████████| 1/1 [00:00<00:00, 2.55ba/s]
|
||||||
|
|
||||||
|
|
||||||
现在我们可以检查数据集是否发生了变化:现在样本包含了`block_size`连续字符块,可能跨越了几个原始文本。
|
现在我们可以检查数据集是否发生了变化:现在样本包含了`block_size`连续字符块,可能跨越了几个原始文本。
|
||||||
|
|
||||||
|
|
||||||
|
@ -286,11 +417,7 @@ from transformers import AutoModelForCausalLM
|
||||||
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
|
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Downloading: 100%|██████████| 353M/353M [00:21<00:00, 16.0MB/s]
|
||||||
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=352833716.0, style=ProgressStyle(descri…
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
检查torch版本
|
检查torch版本
|
||||||
|
@ -299,15 +426,14 @@ model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
|
||||||
```python
|
```python
|
||||||
|
|
||||||
import importlib.util
|
import importlib.util
|
||||||
import importlib_metadata
|
# import importlib_metadata
|
||||||
a = importlib.util.find_spec("torch") is not None
|
a = importlib.util.find_spec("torch") is not None
|
||||||
print(a)
|
print(a)
|
||||||
_torch_version = importlib_metadata.version("torch")
|
# _torch_version = importlib_metadata.version("torch")
|
||||||
print(_torch_version)
|
# print(_torch_version)
|
||||||
```
|
```
|
||||||
|
|
||||||
True
|
True
|
||||||
1.8.1+cu101
|
|
||||||
|
|
||||||
|
|
||||||
和一些`TrainingArguments`:
|
和一些`TrainingArguments`:
|
||||||
|
@ -346,6 +472,60 @@ trainer = Trainer(
|
||||||
trainer.train()
|
trainer.train()
|
||||||
```
|
```
|
||||||
|
|
||||||
|
0%| | 0/3 [00:00<?, ?it/s]
|
||||||
|
|
||||||
|
|
||||||
|
---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
KeyError Traceback (most recent call last)
|
||||||
|
|
||||||
|
/var/folders/2k/x3py0v857kgcwqvvl00xxhxw0000gn/T/ipykernel_12460/4032920361.py in <module>
|
||||||
|
----> 1 trainer.train()
|
||||||
|
|
||||||
|
|
||||||
|
~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
|
||||||
|
1032 self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control)
|
||||||
|
1033
|
||||||
|
-> 1034 for step, inputs in enumerate(epoch_iterator):
|
||||||
|
1035
|
||||||
|
1036 # Skip past any already trained steps if resuming training
|
||||||
|
|
||||||
|
|
||||||
|
~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py in __next__(self)
|
||||||
|
519 if self._sampler_iter is None:
|
||||||
|
520 self._reset()
|
||||||
|
--> 521 data = self._next_data()
|
||||||
|
522 self._num_yielded += 1
|
||||||
|
523 if self._dataset_kind == _DatasetKind.Iterable and \
|
||||||
|
|
||||||
|
|
||||||
|
~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _next_data(self)
|
||||||
|
559 def _next_data(self):
|
||||||
|
560 index = self._next_index() # may raise StopIteration
|
||||||
|
--> 561 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
|
||||||
|
562 if self._pin_memory:
|
||||||
|
563 data = _utils.pin_memory.pin_memory(data)
|
||||||
|
|
||||||
|
|
||||||
|
~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
|
||||||
|
42 def fetch(self, possibly_batched_index):
|
||||||
|
43 if self.auto_collation:
|
||||||
|
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
|
||||||
|
45 else:
|
||||||
|
46 data = self.dataset[possibly_batched_index]
|
||||||
|
|
||||||
|
|
||||||
|
~/Desktop/zhihu/learn-nlp-with-transformers/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
|
||||||
|
42 def fetch(self, possibly_batched_index):
|
||||||
|
43 if self.auto_collation:
|
||||||
|
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
|
||||||
|
45 else:
|
||||||
|
46 data = self.dataset[possibly_batched_index]
|
||||||
|
|
||||||
|
|
||||||
|
KeyError: 1
|
||||||
|
|
||||||
|
|
||||||
一旦训练完成,我们就可以评估我们的模型,得到它在验证集上的perplexity,如下所示:
|
一旦训练完成,我们就可以评估我们的模型,得到它在验证集上的perplexity,如下所示:
|
||||||
|
|
||||||
|
|
||||||
|
|
File diff suppressed because it is too large
Load Diff
Loading…
Reference in New Issue