参考这个colab笔记本 (来自Huggingface 这里),如果我运行
tokenized_datasets["train"][:8]dtype是dict而不是Dataset,切片将返回一些数据。如果我在这里传递切片,我会得到一个关键错误,我认为这与我不再传递数据集有关。
from transformers import Trainer
trainer = Trainer(
model,
training_args,
train_dataset=tokenized_datasets["train"][:8],
eval_dataset=tokenized_datasets["validation"],
#data_collator=data_collator,
tokenizer=tokenizer,
)
trainer.train()
***** Running training *****
Num examples = 7
Num Epochs = 3
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 3
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-20-3435b262f1ae> in <module>()
----> 1 trainer.train()是否有一种简单的方法只传递数据集行的子集以进行培训或验证?
发布于 2021-12-23 22:20:41
您可以尝试使用torch,例如:
from torch.utils.data import Subset
train_dataset = Subset(tokenized_datasets["train"], list(range(8)))
... # init trainer这将为您提供数据集的子集,因此仍然满足接口要求。(如果HuggingFace的变压器也这么做的话,我认为他们也会这么做。)
https://stackoverflow.com/questions/70467910
复制相似问题