前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Transformers 4.37 中文文档(三)

Transformers 4.37 中文文档(三)

作者头像
ApacheCN_飞龙
发布2024-06-26 14:25:59
1010
发布2024-06-26 14:25:59
举报
文章被收录于专栏:信数据得永生

原文:huggingface.co/docs/transformers

问答

原始文本:huggingface.co/docs/transformers/v4.37.2/en/tasks/question_answering

www.youtube-nocookie.com/embed/ajPx5LwJD-I

问答任务根据问题返回答案。如果您曾经向 Alexa、Siri 或 Google 等虚拟助手询问天气情况,那么您之前使用过问答模型。问答任务有两种常见类型:

  • 提取式:从给定的上下文中提取答案。
  • 生成式:从上下文中生成一个正确回答问题的答案。

本指南将向您展示如何:

  1. SQuAD数据集上微调DistilBERT以进行提取式问答。
  2. 使用您微调的模型进行推理。

本教程中演示的任务由以下模型架构支持:

ALBERT, BART, BERT, BigBird, BigBird-Pegasus, BLOOM, CamemBERT, CANINE, ConvBERT, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ErnieM, Falcon, FlauBERT, FNet, Funnel Transformer, OpenAI GPT-2, GPT Neo, GPT NeoX, GPT-J, I-BERT, LayoutLMv2, LayoutLMv3, LED, LiLT, Longformer, LUKE, LXMERT, MarkupLM, mBART, MEGA, Megatron-BERT, MobileBERT, MPNet, MPT, MRA, MT5, MVP, Nezha, Nyströmformer, OPT, QDQBert, Reformer, RemBERT, RoBERTa, RoBERTa-PreLayerNorm, RoCBert, RoFormer, Splinter, SqueezeBERT, T5, UMT5, XLM, XLM-RoBERTa, XLM-RoBERTa-XL, XLNet, X-MOD, YOSO

在开始之前,请确保已安装所有必要的库:

代码语言:javascript
复制
pip install transformers datasets evaluate

我们鼓励您登录您的 Hugging Face 帐户,这样您就可以上传和与社区分享您的模型。在提示时,输入您的令牌以登录:

代码语言:javascript
复制
>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载 SQuAD 数据集

首先加载来自🤗数据集库的 SQuAD 数据集的较小子集。这将让您有机会进行实验,并确保一切正常,然后再花更多时间在完整数据集上进行训练。

代码语言:javascript
复制
>>> from datasets import load_dataset

>>> squad = load_dataset("squad", split="train[:5000]")

使用train_test_split方法将数据集的train拆分为训练集和测试集:

代码语言:javascript
复制
>>> squad = squad.train_test_split(test_size=0.2)

然后看一个例子:

代码语言:javascript
复制
>>> squad["train"][0]
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858\. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'
}

这里有几个重要的字段:

  • answers:答案标记的起始位置和答案文本。
  • context:模型需要从中提取答案的背景信息。
  • question:模型应该回答的问题。

预处理

www.youtube-nocookie.com/embed/qgaM0weJHpA

接下来的步骤是加载一个 DistilBERT tokenizer 来处理questioncontext字段:

代码语言:javascript
复制
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

有一些特定于问答任务的预处理步骤,您应该注意:

  1. 数据集中的一些示例可能具有超出模型最大输入长度的非常长的context。为了处理更长的序列,只截断context,设置truncation="only_second"
  2. 接下来,通过设置return_offset_mapping=True将答案的起始和结束位置映射到原始的context
  3. 有了映射,现在您可以找到答案的起始和结束标记。使用sequence_ids方法找到偏移的哪部分对应于question,哪部分对应于context

以下是您可以创建的函数,用于截断和映射answer的起始和结束标记到context

代码语言:javascript
复制
>>> def preprocess_function(examples):
...     questions = [q.strip() for q in examples["question"]]
...     inputs = tokenizer(
...         questions,
...         examples["context"],
...         max_length=384,
...         truncation="only_second",
...         return_offsets_mapping=True,
...         padding="max_length",
...     )

...     offset_mapping = inputs.pop("offset_mapping")
...     answers = examples["answers"]
...     start_positions = []
...     end_positions = []

...     for i, offset in enumerate(offset_mapping):
...         answer = answers[i]
...         start_char = answer["answer_start"][0]
...         end_char = answer["answer_start"][0] + len(answer["text"][0])
...         sequence_ids = inputs.sequence_ids(i)

...         # Find the start and end of the context
...         idx = 0
...         while sequence_ids[idx] != 1:
...             idx += 1
...         context_start = idx
...         while sequence_ids[idx] == 1:
...             idx += 1
...         context_end = idx - 1

...         # If the answer is not fully inside the context, label it (0, 0)
...         if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
...             start_positions.append(0)
...             end_positions.append(0)
...         else:
...             # Otherwise it's the start and end token positions
...             idx = context_start
...             while idx <= context_end and offset[idx][0] <= start_char:
...                 idx += 1
...             start_positions.append(idx - 1)

...             idx = context_end
...             while idx >= context_start and offset[idx][1] >= end_char:
...                 idx -= 1
...             end_positions.append(idx + 1)

...     inputs["start_positions"] = start_positions
...     inputs["end_positions"] = end_positions
...     return inputs

要在整个数据集上应用预处理函数,请使用🤗 Datasets map函数。您可以通过设置batched=True来加速map函数,以一次处理数据集的多个元素。删除您不需要的任何列:

代码语言:javascript
复制
>>> tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

现在使用 DefaultDataCollator 创建一批示例。与🤗 Transformers 中的其他数据收集器不同,DefaultDataCollator 不会应用任何额外的预处理,如填充。

Pytorch 隐藏 Pytorch 内容

代码语言:javascript
复制
>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator()

TensorFlow 隐藏 TensorFlow 内容

代码语言:javascript
复制
>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator(return_tensors="tf")

训练

Pytorch 隐藏 Pytorch 内容

如果您不熟悉使用 Trainer 微调模型,请查看这里的基本教程 here!

您现在可以开始训练您的模型了!使用 AutoModelForQuestionAnswering 加载 DistilBERT:

代码语言:javascript
复制
>>> from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

>>> model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

此时,只剩下三个步骤:

  1. 在 TrainingArguments 中定义您的训练超参数。唯一必需的参数是output_dir,指定保存模型的位置。通过设置push_to_hub=True(您需要登录到 Hugging Face 才能上传模型),您将把这个模型推送到 Hub。
  2. 将训练参数传递给 Trainer,同时还有模型、数据集、tokenizer 和数据收集器。
  3. 调用 train()来微调您的模型。
代码语言:javascript
复制
>>> training_args = TrainingArguments(
...     output_dir="my_awesome_qa_model",
...     evaluation_strategy="epoch",
...     learning_rate=2e-5,
...     per_device_train_batch_size=16,
...     per_device_eval_batch_size=16,
...     num_train_epochs=3,
...     weight_decay=0.01,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=tokenized_squad["train"],
...     eval_dataset=tokenized_squad["test"],
...     tokenizer=tokenizer,
...     data_collator=data_collator,
... )

>>> trainer.train()

训练完成后,使用 push_to_hub()方法将您的模型共享到 Hub,以便每个人都可以使用您的模型:

代码语言:javascript
复制
>>> trainer.push_to_hub()

TensorFlow 隐藏 TensorFlow 内容

如果您不熟悉使用 Keras 微调模型,请查看这里的基本教程 here!

要在 TensorFlow 中微调模型,请首先设置优化器函数、学习率调度和一些训练超参数:

代码语言:javascript
复制
>>> from transformers import create_optimizer

>>> batch_size = 16
>>> num_epochs = 2
>>> total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
>>> optimizer, schedule = create_optimizer(
...     init_lr=2e-5,
...     num_warmup_steps=0,
...     num_train_steps=total_train_steps,
... )

然后您可以使用 TFAutoModelForQuestionAnswering 加载 DistilBERT:

代码语言:javascript
复制
>>> from transformers import TFAutoModelForQuestionAnswering

>>> model = TFAutoModelForQuestionAnswering("distilbert-base-uncased")

使用 prepare_tf_dataset()将数据集转换为tf.data.Dataset格式:

代码语言:javascript
复制
>>> tf_train_set = model.prepare_tf_dataset(
...     tokenized_squad["train"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_validation_set = model.prepare_tf_dataset(
...     tokenized_squad["test"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )

使用compile为训练配置模型:

代码语言:javascript
复制
>>> import tensorflow as tf

>>> model.compile(optimizer=optimizer)

在开始训练之前的最后一件事是提供一种将您的模型推送到 Hub 的方法。这可以通过在 PushToHubCallback 中指定推送模型和标记器的位置来完成:

代码语言:javascript
复制
>>> from transformers.keras_callbacks import PushToHubCallback

>>> callback = PushToHubCallback(
...     output_dir="my_awesome_qa_model",
...     tokenizer=tokenizer,
... )

最后,您已经准备好开始训练您的模型了!使用您的训练和验证数据集、时代数以及回调函数调用fit来微调模型:

代码语言:javascript
复制
>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=[callback])

一旦训练完成,您的模型将自动上传到 Hub,这样每个人都可以使用它!

要了解如何为问题回答微调模型的更深入示例,请查看相应的PyTorch 笔记本TensorFlow 笔记本

评估

问题回答的评估需要大量的后处理。为了不占用太多时间,本指南跳过了评估步骤。Trainer 仍然在训练过程中计算评估损失,因此您不会完全不了解模型的性能。

如果您有更多时间,并且对如何评估问题回答模型感兴趣,请查看🤗 Hugging Face 课程中的问题回答章节!

推理

很好,现在您已经微调了一个模型,可以用它进行推理了!

提出一个问题和一些上下文,您希望模型进行预测:

代码语言:javascript
复制
>>> question = "How many programming languages does BLOOM support?"
>>> context = "BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages."

尝试使用您微调过的模型进行推理的最简单方法是在 pipeline()中使用它。用您的模型实例化一个问题回答的pipeline,并将文本传递给它:

代码语言:javascript
复制
>>> from transformers import pipeline

>>> question_answerer = pipeline("question-answering", model="my_awesome_qa_model")
>>> question_answerer(question=question, context=context)
{'score': 0.2058267742395401,
 'start': 10,
 'end': 95,
 'answer': '176 billion parameters and can generate text in 46 languages natural languages and 13'}

如果您愿意,也可以手动复制pipeline的结果:

Pytorch 隐藏 Pytorch 内容

对文本进行标记化并返回 PyTorch 张量:

代码语言:javascript
复制
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model")
>>> inputs = tokenizer(question, context, return_tensors="pt")

将您的输入传递给模型并返回logits

代码语言:javascript
复制
>>> import torch
>>> from transformers import AutoModelForQuestionAnswering

>>> model = AutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model")
>>> with torch.no_grad():
...     outputs = model(**inputs)

从模型输出中获取开始和结束位置的最高概率:

代码语言:javascript
复制
>>> answer_start_index = outputs.start_logits.argmax()
>>> answer_end_index = outputs.end_logits.argmax()

解码预测的标记以获取答案:

代码语言:javascript
复制
>>> predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
>>> tokenizer.decode(predict_answer_tokens)
'176 billion parameters and can generate text in 46 languages natural languages and 13'

TensorFlow 隐藏 TensorFlow 内容

对文本进行标记化并返回 TensorFlow 张量:

代码语言:javascript
复制
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model")
>>> inputs = tokenizer(question, text, return_tensors="tf")

将您的输入传递给模型并返回logits

代码语言:javascript
复制
>>> from transformers import TFAutoModelForQuestionAnswering

>>> model = TFAutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model")
>>> outputs = model(**inputs)

从模型输出中获取开始和结束位置的最高概率:

代码语言:javascript
复制
>>> answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0])
>>> answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0])

解码预测的标记以获取答案:

代码语言:javascript
复制
>>> predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
>>> tokenizer.decode(predict_answer_tokens)
'176 billion parameters and can generate text in 46 languages natural languages and 13'

因果语言建模

原始文本:huggingface.co/docs/transformers/v4.37.2/en/tasks/language_modeling

语言建模有两种类型,因果和掩码。本指南说明了因果语言建模。因果语言模型经常用于文本生成。您可以将这些模型用于创意应用,如选择自己的文本冒险或智能编码助手,如 Copilot 或 CodeParrot。

www.youtube-nocookie.com/embed/Vpjb1lu0MDk

因果语言建模预测令牌序列中的下一个令牌,模型只能关注左侧的令牌。这意味着模型无法看到未来的令牌。GPT-2 是因果语言模型的一个例子。

本指南将向您展示如何:

  1. ELI5数据集的r/askscience子集上微调DistilGPT2
  2. 使用您微调的模型进行推理。

您可以按照本指南中的相同步骤微调其他架构以进行因果语言建模。选择以下架构之一:

BART, BERT, Bert Generation, BigBird, BigBird-Pegasus, BioGpt, Blenderbot, BlenderbotSmall, BLOOM, CamemBERT, CodeLlama, CodeGen, CPM-Ant, CTRL, Data2VecText, ELECTRA, ERNIE, Falcon, Fuyu, GIT, GPT-Sw3, OpenAI GPT-2, GPTBigCode, GPT Neo, GPT NeoX, GPT NeoX Japanese, GPT-J, LLaMA, Marian, mBART, MEGA, Megatron-BERT, Mistral, Mixtral, MPT, MusicGen, MVP, OpenLlama, OpenAI GPT, OPT, Pegasus, Persimmon, Phi, PLBart, ProphetNet, QDQBert, Qwen2, Reformer, RemBERT, RoBERTa, RoBERTa-PreLayerNorm, RoCBert, RoFormer, RWKV, Speech2Text2, Transformer-XL, TrOCR, Whisper, XGLM, XLM, XLM-ProphetNet, XLM-RoBERTa, XLM-RoBERTa-XL, XLNet, X-MOD

在开始之前,请确保您已安装所有必要的库。

代码语言:javascript
复制
pip install transformers datasets evaluate

我们鼓励您登录您的 Hugging Face 帐户,这样您就可以上传和与社区分享您的模型。在提示时,输入您的令牌以登录:

代码语言:javascript
复制
>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载 ELI5 数据集

首先加载🤗数据集库中 r/askscience 子集的 ELI5 数据集的较小子集。这将让您有机会进行实验,并确保一切正常,然后再花更多时间在完整数据集上进行训练。

代码语言:javascript
复制
>>> from datasets import load_dataset

>>> eli5 = load_dataset("eli5", split="train_asks[:5000]")

使用train_test_split方法将数据集的train_asks拆分为训练集和测试集:

代码语言:javascript
复制
>>> eli5 = eli5.train_test_split(test_size=0.2)

然后看一个例子:

代码语言:javascript
复制
>>> eli5["train"][0]
{'answers': {'a_id': ['c3d1aib', 'c3d4lya'],
  'score': [6, 3],
  'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
   "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]},
 'answers_urls': {'url': []},
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']},
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls': {'url': []}}

虽然这看起来很多,但您实际上只对text字段感兴趣。语言建模任务的有趣之处在于您不需要标签(也称为无监督任务),因为下一个词就是标签。

预处理

www.youtube-nocookie.com/embed/ma1TrR7gE7I

下一步是加载一个 DistilGPT2 分词器来处理text子字段:

代码语言:javascript
复制
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("distilgpt2")

从上面的例子中,您会注意到text字段实际上是嵌套在answers中的。这意味着您需要使用flatten方法从其嵌套结构中提取text子字段:

代码语言:javascript
复制
>>> eli5 = eli5.flatten()
>>> eli5["train"][0]
{'answers.a_id': ['c3d1aib', 'c3d4lya'],
 'answers.score': [6, 3],
 'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
  "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"],
 'answers_urls.url': [],
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'],
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls.url': []}

现在,每个子字段都是一个单独的列,由answers前缀指示,而text字段现在是一个列表。不要单独对每个句子进行分词,而是将列表转换为字符串,以便可以联合对它们进行分词。

这是一个用于连接每个示例的字符串列表并对结果进行分词的第一个预处理函数:

代码语言:javascript
复制
>>> def preprocess_function(examples):
...     return tokenizer([" ".join(x) for x in examples["answers.text"]])

要在整个数据集上应用此预处理函数,请使用🤗数据集的map方法。通过设置batched=True以一次处理数据集的多个元素,并使用num_proc增加进程数量,可以加快map函数的速度。删除您不需要的任何列:

代码语言:javascript
复制
>>> tokenized_eli5 = eli5.map(
...     preprocess_function,
...     batched=True,
...     num_proc=4,
...     remove_columns=eli5["train"].column_names,
... )

该数据集包含令牌序列,但其中一些比模型的最大输入长度更长。

现在可以使用第二个预处理函数

  • 连接所有序列
  • 将连接的序列拆分为由block_size定义的较短块,该块应比最大输入长度短且足够短以适应您的 GPU RAM。
代码语言:javascript
复制
>>> block_size = 128

>>> def group_texts(examples):
...     # Concatenate all texts.
...     concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
...     total_length = len(concatenated_examples[list(examples.keys())[0]])
...     # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
...     # customize this part to your needs.
...     if total_length >= block_size:
...         total_length = (total_length // block_size) * block_size
...     # Split by chunks of block_size.
...     result = {
...         k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
...         for k, t in concatenated_examples.items()
...     }
...     result["labels"] = result["input_ids"].copy()
...     return result

在整个数据集上应用group_texts函数:

代码语言:javascript
复制
>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

现在使用 DataCollatorForLanguageModeling 创建一批示例。在整理过程中,将句子动态填充到批次中的最长长度,而不是将整个数据集填充到最大长度。

Pytorch 隐藏 Pytorch 内容

使用结束序列标记作为填充标记,并设置mlm=False。这将使用输入作为标签,向右移动一个元素:

代码语言:javascript
复制
>>> from transformers import DataCollatorForLanguageModeling

>>> tokenizer.pad_token = tokenizer.eos_token
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

TensorFlow 隐藏 TensorFlow 内容

使用结束序列标记作为填充标记,并设置mlm=False。这将使用输入作为标签,向右移动一个元素:

代码语言:javascript
复制
>>> from transformers import DataCollatorForLanguageModeling

>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")

训练

Pytorch 隐藏 Pytorch 内容

如果您不熟悉使用 Trainer 微调模型,请查看基本教程!

您现在可以开始训练模型了!使用 AutoModelForCausalLM 加载 DistilGPT2:

代码语言:javascript
复制
>>> from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

>>> model = AutoModelForCausalLM.from_pretrained("distilgpt2")

此时,只剩下三个步骤:

  1. 在 TrainingArguments 中定义您的训练超参数。唯一必需的参数是output_dir,指定保存模型的位置。通过设置push_to_hub=True将此模型推送到 Hub(您需要登录 Hugging Face 才能上传模型)。
  2. 将训练参数传递给 Trainer,以及模型、数据集和数据整理器。
  3. 调用 train()来微调您的模型。
代码语言:javascript
复制
>>> training_args = TrainingArguments(
...     output_dir="my_awesome_eli5_clm-model",
...     evaluation_strategy="epoch",
...     learning_rate=2e-5,
...     weight_decay=0.01,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=lm_dataset["train"],
...     eval_dataset=lm_dataset["test"],
...     data_collator=data_collator,
... )

>>> trainer.train()

训练完成后,使用 evaluate()方法评估您的模型并获取其困惑度:

代码语言:javascript
复制
>>> import math

>>> eval_results = trainer.evaluate()
>>> print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
Perplexity: 49.61

然后使用 push_to_hub()方法将您的模型分享到 Hub,这样每个人都可以使用您的模型:

代码语言:javascript
复制
>>> trainer.push_to_hub()

TensorFlow 隐藏 TensorFlow 内容

如果您不熟悉如何使用 Keras 微调模型,请查看基础教程!

要在 TensorFlow 中微调模型,请首先设置优化器函数、学习率调度和一些训练超参数:

代码语言:javascript
复制
>>> from transformers import create_optimizer, AdamWeightDecay

>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

然后,您可以使用 TFAutoModelForCausalLM 加载 DistilGPT2:

代码语言:javascript
复制
>>> from transformers import TFAutoModelForCausalLM

>>> model = TFAutoModelForCausalLM.from_pretrained("distilgpt2")

使用 prepare_tf_dataset()将数据集转换为tf.data.Dataset格式:

代码语言:javascript
复制
>>> tf_train_set = model.prepare_tf_dataset(
...     lm_dataset["train"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_test_set = model.prepare_tf_dataset(
...     lm_dataset["test"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )

使用compile为训练配置模型。请注意,Transformers 模型都有一个默认的与任务相关的损失函数,因此除非您想要指定一个,否则不需要:

代码语言:javascript
复制
>>> import tensorflow as tf

>>> model.compile(optimizer=optimizer)  # No loss argument!

这可以通过在 PushToHubCallback 中指定将模型和标记器推送到何处来完成:

代码语言:javascript
复制
>>> from transformers.keras_callbacks import PushToHubCallback

>>> callback = PushToHubCallback(
...     output_dir="my_awesome_eli5_clm-model",
...     tokenizer=tokenizer,
... )

最后,您已经准备好开始训练您的模型了!使用fit调用您的训练和验证数据集,时代数量以及微调模型的回调:

代码语言:javascript
复制
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback])

训练完成后,您的模型会自动上传到 Hub,这样每个人都可以使用它!

有关如何为因果语言建模微调模型的更深入示例,请查看相应的PyTorch 笔记本TensorFlow 笔记本

推理

很好,现在您已经微调了一个模型,可以用于推理!

想出一个您想要从中生成文本的提示:

代码语言:javascript
复制
>>> prompt = "Somatic hypermutation allows the immune system to"

尝试使用 pipeline()来进行推理是尝试微调模型的最简单方法。实例化一个用于文本生成的pipeline,并将文本传递给它:

代码语言:javascript
复制
>>> from transformers import pipeline

>>> generator = pipeline("text-generation", model="my_awesome_eli5_clm-model")
>>> generator(prompt)
[{'generated_text': "Somatic hypermutation allows the immune system to be able to effectively reverse the damage caused by an infection.\n\n\nThe damage caused by an infection is caused by the immune system's ability to perform its own self-correcting tasks."}]

Pytorch 隐藏 Pytorch 内容

对文本进行标记化,并将input_ids返回为 PyTorch 张量:

代码语言:javascript
复制
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model")
>>> inputs = tokenizer(prompt, return_tensors="pt").input_ids

使用 generate()方法生成文本。有关不同文本生成策略和控制生成的参数的更多详细信息,请查看文本生成策略页面。

代码语言:javascript
复制
>>> from transformers import AutoModelForCausalLM

>>> model = AutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model")
>>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

将生成的标记 ID 解码回文本:

代码语言:javascript
复制
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
["Somatic hypermutation allows the immune system to react to drugs with the ability to adapt to a different environmental situation. In other words, a system of 'hypermutation' can help the immune system to adapt to a different environmental situation or in some cases even a single life. In contrast, researchers at the University of Massachusetts-Boston have found that 'hypermutation' is much stronger in mice than in humans but can be found in humans, and that it's not completely unknown to the immune system. A study on how the immune system"]

TensorFlow 隐藏 TensorFlow 内容

对文本进行标记化,并将input_ids返回为 TensorFlow 张量:

代码语言:javascript
复制
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model")
>>> inputs = tokenizer(prompt, return_tensors="tf").input_ids

使用 generate()方法创建摘要。有关不同文本生成策略和控制生成的参数的更多详细信息,请查看文本生成策略页面。

代码语言:javascript
复制
>>> from transformers import TFAutoModelForCausalLM

>>> model = TFAutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model")
>>> outputs = model.generate(input_ids=inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

将生成的标记 ID 解码回文本:

代码语言:javascript
复制
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Somatic hypermutation allows the immune system to detect the presence of other viruses as they become more prevalent. Therefore, researchers have identified a high proportion of human viruses. The proportion of virus-associated viruses in our study increases with age. Therefore, we propose a simple algorithm to detect the presence of these new viruses in our samples as a sign of improved immunity. A first study based on this algorithm, which will be published in Science on Friday, aims to show that this finding could translate into the development of a better vaccine that is more effective for']

遮蔽语言建模

原始文本:huggingface.co/docs/transformers/v4.37.2/en/tasks/masked_language_modeling

www.youtube-nocookie.com/embed/mqElG5QJWUg

遮蔽语言建模预测序列中的一个遮蔽标记,模型可以双向关注标记。这意味着模型可以完全访问左侧和右侧的标记。遮蔽语言建模非常适合需要对整个序列进行良好上下文理解的任务。BERT 就是一个遮蔽语言模型的例子。

本指南将向您展示如何:

  1. r/askscience ELI5 数据集的子集上对DistilRoBERTa进行微调。
  2. 使用您微调的模型进行推断。

您可以按照本指南中的相同步骤对其他架构进行遮蔽语言建模的微调。选择以下架构之一:

ALBERT, BART, BERT, BigBird, CamemBERT, ConvBERT, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ESM, FlauBERT, FNet, Funnel Transformer, I-BERT, LayoutLM, Longformer, LUKE, mBART, MEGA, Megatron-BERT, MobileBERT, MPNet, MRA, MVP, Nezha, Nyströmformer, Perceiver, QDQBert, Reformer, RemBERT, RoBERTa, RoBERTa-PreLayerNorm, RoCBert, RoFormer, SqueezeBERT, TAPAS, Wav2Vec2, XLM, XLM-RoBERTa, XLM-RoBERTa-XL, X-MOD, YOSO

在开始之前,请确保已安装所有必要的库:

代码语言:javascript
复制
pip install transformers datasets evaluate

我们鼓励您登录您的 Hugging Face 帐户,这样您就可以上传和与社区分享您的模型。在提示时,输入您的令牌以登录:

代码语言:javascript
复制
>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载 ELI5 数据集

首先加载来自🤗数据集库的 ELI5 数据集的 r/askscience 子集的较小子集。这将让您有机会进行实验,并确保一切正常,然后再花更多时间在完整数据集上进行训练。

代码语言:javascript
复制
>>> from datasets import load_dataset

>>> eli5 = load_dataset("eli5", split="train_asks[:5000]")

使用train_test_split方法将数据集的train_asks分割为训练集和测试集:

代码语言:javascript
复制
>>> eli5 = eli5.train_test_split(test_size=0.2)

然后看一个例子:

代码语言:javascript
复制
>>> eli5["train"][0]
{'answers': {'a_id': ['c3d1aib', 'c3d4lya'],
  'score': [6, 3],
  'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
   "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]},
 'answers_urls': {'url': []},
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']},
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls': {'url': []}}

虽然这看起来很多,但您实际上只对text字段感兴趣。语言建模任务的有趣之处在于您不需要标签(也称为无监督任务),因为下一个词就是标签。

预处理

www.youtube-nocookie.com/embed/8PmhEIXhBvI

对于遮蔽语言建模,下一步是加载一个 DistilRoBERTa 分词器来处理text子字段:

代码语言:javascript
复制
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")

从上面的示例中,您会注意到text字段实际上是嵌套在answers内部的。这意味着您需要使用flatten方法从其嵌套结构中提取text子字段:

代码语言:javascript
复制
>>> eli5 = eli5.flatten()
>>> eli5["train"][0]
{'answers.a_id': ['c3d1aib', 'c3d4lya'],
 'answers.score': [6, 3],
 'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
  "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"],
 'answers_urls.url': [],
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'],
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls.url': []}

现在,每个子字段都是一个单独的列,由answers前缀指示,而text字段现在是一个列表。不要单独对每个句子进行标记化,而是将列表转换为字符串,以便可以联合对它们进行标记化。

这是一个第一个预处理函数,用于连接每个示例的字符串列表并对结果进行标记化:

代码语言:javascript
复制
>>> def preprocess_function(examples):
...     return tokenizer([" ".join(x) for x in examples["answers.text"]])

要在整个数据集上应用此预处理函数,请使用🤗数据集map方法。通过设置batched=True以一次处理数据集的多个元素,并使用num_proc增加进程数量来加速map函数。删除您不需要的任何列:

代码语言:javascript
复制
>>> tokenized_eli5 = eli5.map(
...     preprocess_function,
...     batched=True,
...     num_proc=4,
...     remove_columns=eli5["train"].column_names,
... )

此数据集包含标记序列,但其中一些序列比模型的最大输入长度更长。

现在可以使用第二个预处理函数

  • 连接所有序列
  • 将连接的序列拆分成由block_size定义的较短块,该块应该既比最大输入长度短,又足够短以适应您的 GPU RAM。
代码语言:javascript
复制
>>> block_size = 128

>>> def group_texts(examples):
...     # Concatenate all texts.
...     concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
...     total_length = len(concatenated_examples[list(examples.keys())[0]])
...     # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
...     # customize this part to your needs.
...     if total_length >= block_size:
...         total_length = (total_length // block_size) * block_size
...     # Split by chunks of block_size.
...     result = {
...         k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
...         for k, t in concatenated_examples.items()
...     }
...     return result

在整个数据集上应用group_texts函数:

代码语言:javascript
复制
>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

现在使用 DataCollatorForLanguageModeling 创建一批示例。在整理过程中,最好动态填充句子到批次中的最长长度,而不是将整个数据集填充到最大长度。

Pytorch 隐藏 Pytorch 内容

使用结束序列标记作为填充标记,并指定mlm_probability以在每次迭代数据时随机屏蔽标记:

代码语言:javascript
复制
>>> from transformers import DataCollatorForLanguageModeling

>>> tokenizer.pad_token = tokenizer.eos_token
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

TensorFlow 隐藏 TensorFlow 内容

使用结束序列标记作为填充标记,并指定mlm_probability以在每次迭代数据时随机屏蔽标记:

代码语言:javascript
复制
>>> from transformers import DataCollatorForLanguageModeling

>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15, return_tensors="tf")

训练

Pytorch 隐藏 Pytorch 内容

如果您不熟悉使用 Trainer 微调模型,请查看基本教程这里!

您现在可以开始训练您的模型了!使用 AutoModelForMaskedLM 加载 DistilRoBERTa:

代码语言:javascript
复制
>>> from transformers import AutoModelForMaskedLM

>>> model = AutoModelForMaskedLM.from_pretrained("distilroberta-base")

此时,只剩下三个步骤:

  1. 在 TrainingArguments 中定义您的训练超参数。唯一必需的参数是output_dir,它指定保存模型的位置。通过设置push_to_hub=True将此模型推送到 Hub(您需要登录 Hugging Face 才能上传模型)。
  2. 将训练参数传递给 Trainer,以及模型、数据集和数据整理器。
  3. 调用 train()来微调您的模型。
代码语言:javascript
复制
>>> training_args = TrainingArguments(
...     output_dir="my_awesome_eli5_mlm_model",
...     evaluation_strategy="epoch",
...     learning_rate=2e-5,
...     num_train_epochs=3,
...     weight_decay=0.01,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=lm_dataset["train"],
...     eval_dataset=lm_dataset["test"],
...     data_collator=data_collator,
... )

>>> trainer.train()

训练完成后,使用 evaluate()方法评估您的模型并获得其困惑度:

代码语言:javascript
复制
>>> import math

>>> eval_results = trainer.evaluate()
>>> print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
Perplexity: 8.76

然后使用 push_to_hub()方法将您的模型共享到 Hub,这样每个人都可以使用您的模型:

代码语言:javascript
复制
>>> trainer.push_to_hub()

TensorFlow 隐藏 TensorFlow 内容

如果您不熟悉使用 Keras 微调模型,请查看基本教程这里!

要在 TensorFlow 中微调模型,请首先设置优化器函数、学习率调度和一些训练超参数:

代码语言:javascript
复制
>>> from transformers import create_optimizer, AdamWeightDecay

>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

然后,您可以使用 TFAutoModelForMaskedLM 加载 DistilRoBERTa:

代码语言:javascript
复制
>>> from transformers import TFAutoModelForMaskedLM

>>> model = TFAutoModelForMaskedLM.from_pretrained("distilroberta-base")

使用 prepare_tf_dataset()将您的数据集转换为tf.data.Dataset格式:

代码语言:javascript
复制
>>> tf_train_set = model.prepare_tf_dataset(
...     lm_dataset["train"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_test_set = model.prepare_tf_dataset(
...     lm_dataset["test"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )

使用compile配置模型进行训练。请注意,Transformers 模型都有一个默认的与任务相关的损失函数,因此除非您想要指定一个,否则不需要指定一个:

代码语言:javascript
复制
>>> import tensorflow as tf

>>> model.compile(optimizer=optimizer)  # No loss argument!

这可以通过在 PushToHubCallback 中指定将模型和标记器推送到何处来完成:

代码语言:javascript
复制
>>> from transformers.keras_callbacks import PushToHubCallback

>>> callback = PushToHubCallback(
...     output_dir="my_awesome_eli5_mlm_model",
...     tokenizer=tokenizer,
... )

最后,您已经准备好开始训练您的模型了!使用您的训练和验证数据集、时代数和回调来微调模型,调用fit

代码语言:javascript
复制
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback])

一旦训练完成,您的模型将自动上传到 Hub,以便每个人都可以使用它!

要了解如何为掩码语言建模微调模型的更深入示例,请查看相应的PyTorch 笔记本TensorFlow 笔记本

推理

很好,现在您已经微调了一个模型,可以用它进行推理了!

想出一些您希望模型用来填充空白的文本,并使用特殊的<mask>标记来指示空白:

代码语言:javascript
复制
>>> text = "The Milky Way is a <mask> galaxy."

尝试使用您微调过的模型进行推理的最简单方法是在 pipeline()中使用它。为填充掩码实例化一个pipeline,并将文本传递给它。如果需要,可以使用top_k参数指定要返回多少预测:

代码语言:javascript
复制
>>> from transformers import pipeline

>>> mask_filler = pipeline("fill-mask", "stevhliu/my_awesome_eli5_mlm_model")
>>> mask_filler(text, top_k=3)
[{'score': 0.5150994658470154,
  'token': 21300,
  'token_str': ' spiral',
  'sequence': 'The Milky Way is a spiral galaxy.'},
 {'score': 0.07087188959121704,
  'token': 2232,
  'token_str': ' massive',
  'sequence': 'The Milky Way is a massive galaxy.'},
 {'score': 0.06434620916843414,
  'token': 650,
  'token_str': ' small',
  'sequence': 'The Milky Way is a small galaxy.'}]

Pytorch 隐藏 Pytorch 内容

对文本进行标记化,并将input_ids作为 PyTorch 张量返回。您还需要指定<mask>标记的位置:

代码语言:javascript
复制
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_eli5_mlm_model")
>>> inputs = tokenizer(text, return_tensors="pt")
>>> mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

将您的输入传递给模型,并返回掩码标记的logits

代码语言:javascript
复制
>>> from transformers import AutoModelForMaskedLM

>>> model = AutoModelForMaskedLM.from_pretrained("stevhliu/my_awesome_eli5_mlm_model")
>>> logits = model(**inputs).logits
>>> mask_token_logits = logits[0, mask_token_index, :]

然后返回概率最高的三个掩码标记并将它们打印出来:

代码语言:javascript
复制
>>> top_3_tokens = torch.topk(mask_token_logits, 3, dim=1).indices[0].tolist()

>>> for token in top_3_tokens:
...     print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))
The Milky Way is a spiral galaxy.
The Milky Way is a massive galaxy.
The Milky Way is a small galaxy.

TensorFlow 隐藏 TensorFlow 内容

对文本进行标记化,并将input_ids作为 TensorFlow 张量返回。您还需要指定<mask>标记的位置:

代码语言:javascript
复制
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_eli5_mlm_model")
>>> inputs = tokenizer(text, return_tensors="tf")
>>> mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]

将您的输入传递给模型,并返回掩码标记的logits

代码语言:javascript
复制
>>> from transformers import TFAutoModelForMaskedLM

>>> model = TFAutoModelForMaskedLM.from_pretrained("stevhliu/my_awesome_eli5_mlm_model")
>>> logits = model(**inputs).logits
>>> mask_token_logits = logits[0, mask_token_index, :]

然后返回概率最高的三个掩码标记并将它们打印出来:

代码语言:javascript
复制
>>> top_3_tokens = tf.math.top_k(mask_token_logits, 3).indices.numpy()

>>> for token in top_3_tokens:
...     print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))
The Milky Way is a spiral galaxy.
The Milky Way is a massive galaxy.
The Milky Way is a small galaxy.

翻译

原始文本:huggingface.co/docs/transformers/v4.37.2/en/tasks/translation

www.youtube-nocookie.com/embed/1JvfrvZgi6c

翻译将一个语言的文本序列转换为另一种语言。它是您可以将其制定为序列到序列问题的几个任务之一,这是一个从输入返回某些输出的强大框架,如翻译或摘要。翻译系统通常用于不同语言文本之间的翻译,但也可以用于语音或文本到语音或语音到文本之间的某种组合。

本指南将向您展示如何:

  1. OPUS Books数据集的英语-法语子集上微调T5以将英语文本翻译成法语。
  2. 使用您微调的模型进行推理。

本教程中演示的任务由以下模型架构支持:

BART, BigBird-Pegasus, Blenderbot, BlenderbotSmall, Encoder decoder, FairSeq Machine-Translation, GPTSAN-japanese, LED, LongT5, M2M100, Marian, mBART, MT5, MVP, NLLB, NLLB-MOE, Pegasus, PEGASUS-X, PLBart, ProphetNet, SeamlessM4T, SeamlessM4Tv2, SwitchTransformers, T5, UMT5, XLM-ProphetNet

在开始之前,请确保您已安装所有必要的库:

代码语言:javascript
复制
pip install transformers datasets evaluate sacrebleu

我们鼓励您登录您的 Hugging Face 帐户,这样您就可以上传和与社区共享您的模型。在提示时,输入您的令牌以登录:

代码语言:javascript
复制
>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载 OPUS Books 数据集

首先加载🤗数据集库中OPUS Books数据集的英语-法语子集:

代码语言:javascript
复制
>>> from datasets import load_dataset

>>> books = load_dataset("opus_books", "en-fr")

使用train_test_split方法将数据集分割为训练集和测试集:

代码语言:javascript
复制
>>> books = books["train"].train_test_split(test_size=0.2)

然后看一个例子:

代码语言:javascript
复制
>>> books["train"][0]
{'id': '90560',
 'translation': {'en': 'But this lofty plateau measured only a few fathoms, and soon we reentered Our Element.',
  'fr': 'Mais ce plateau élevé ne mesurait que quelques toises, et bientôt nous fûmes rentrés dans notre élément.'}}

translation:文本的英语和法语翻译。

预处理

www.youtube-nocookie.com/embed/XAR8jnZZuUs

下一步是加载 T5 标记器以处理英语-法语语言对:

代码语言:javascript
复制
>>> from transformers import AutoTokenizer

>>> checkpoint = "t5-small"
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)

您要创建的预处理函数需要:

  1. 在输入前加上提示,以便 T5 知道这是一个翻译任务。一些能够执行多个 NLP 任务的模型需要为特定任务提供提示。
  2. 将输入(英语)和目标(法语)分别进行标记化,因为无法使用在英语词汇上预训练的标记器对法语文本进行标记化。
  3. 将序列截断为max_length参数设置的最大长度。
代码语言:javascript
复制
>>> source_lang = "en"
>>> target_lang = "fr"
>>> prefix = "translate English to French: "

>>> def preprocess_function(examples):
...     inputs = [prefix + example[source_lang] for example in examples["translation"]]
...     targets = [example[target_lang] for example in examples["translation"]]
...     model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
...     return model_inputs

要在整个数据集上应用预处理函数,请使用🤗数据集map方法。您可以通过设置batched=True来加速map函数,以一次处理数据集的多个元素:

代码语言:javascript
复制
>>> tokenized_books = books.map(preprocess_function, batched=True)

现在使用 DataCollatorForSeq2Seq 创建一批示例。在整理过程中,将句子动态填充到批次中的最长长度,而不是将整个数据集填充到最大长度,这样更有效。

Pytorch 隐藏 Pytorch 内容

代码语言:javascript
复制
>>> from transformers import DataCollatorForSeq2Seq

>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

TensorFlow 隐藏 TensorFlow 内容

代码语言:javascript
复制
>>> from transformers import DataCollatorForSeq2Seq

>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="tf")

评估

在训练过程中包含一个指标通常有助于评估模型的性能。您可以使用🤗Evaluate库快速加载评估方法。对于这个任务,加载SacreBLEU指标(查看🤗Evaluate 快速入门以了解如何加载和计算指标):

代码语言:javascript
复制
>>> import evaluate

>>> metric = evaluate.load("sacrebleu")

然后创建一个函数,将您的预测和标签传递给compute以计算 SacreBLEU 分数:

代码语言:javascript
复制
>>> import numpy as np

>>> def postprocess_text(preds, labels):
...     preds = [pred.strip() for pred in preds]
...     labels = [[label.strip()] for label in labels]

...     return preds, labels

>>> def compute_metrics(eval_preds):
...     preds, labels = eval_preds
...     if isinstance(preds, tuple):
...         preds = preds[0]
...     decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

...     labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
...     decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

...     decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

...     result = metric.compute(predictions=decoded_preds, references=decoded_labels)
...     result = {"bleu": result["score"]}

...     prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
...     result["gen_len"] = np.mean(prediction_lens)
...     result = {k: round(v, 4) for k, v in result.items()}
...     return result

您的compute_metrics函数现在已经准备就绪,当您设置训练时会返回到它。

训练

Pytorch 隐藏 Pytorch 内容

如果您不熟悉如何使用 Trainer 微调模型,请查看这里的基本教程!

现在您已经准备好开始训练您的模型了!使用 AutoModelForSeq2SeqLM 加载 T5:

代码语言:javascript
复制
>>> from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

>>> model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

此时,只剩下三个步骤:

  1. 在 Seq2SeqTrainingArguments 中定义您的训练超参数。唯一必需的参数是output_dir,指定保存模型的位置。通过设置push_to_hub=True将此模型推送到 Hub(您需要登录 Hugging Face 才能上传模型)。在每个 epoch 结束时,Trainer 将评估 SacreBLEU 指标并保存训练检查点。
  2. 将训练参数传递给 Seq2SeqTrainer,同时还包括模型、数据集、分词器、数据整理器和compute_metrics函数。
  3. 调用 train()来微调您的模型。
代码语言:javascript
复制
>>> training_args = Seq2SeqTrainingArguments(
...     output_dir="my_awesome_opus_books_model",
...     evaluation_strategy="epoch",
...     learning_rate=2e-5,
...     per_device_train_batch_size=16,
...     per_device_eval_batch_size=16,
...     weight_decay=0.01,
...     save_total_limit=3,
...     num_train_epochs=2,
...     predict_with_generate=True,
...     fp16=True,
...     push_to_hub=True,
... )

>>> trainer = Seq2SeqTrainer(
...     model=model,
...     args=training_args,
...     train_dataset=tokenized_books["train"],
...     eval_dataset=tokenized_books["test"],
...     tokenizer=tokenizer,
...     data_collator=data_collator,
...     compute_metrics=compute_metrics,
... )

>>> trainer.train()

训练完成后,使用 push_to_hub()方法将模型共享到 Hub,以便每个人都可以使用您的模型:

代码语言:javascript
复制
>>> trainer.push_to_hub()

TensorFlow 隐藏 TensorFlow 内容

如果您不熟悉如何使用 Keras 微调模型,请查看这里的基本教程!

要在 TensorFlow 中微调模型,请首先设置优化器函数、学习率调度和一些训练超参数:

代码语言:javascript
复制
>>> from transformers import AdamWeightDecay

>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

然后您可以使用 TFAutoModelForSeq2SeqLM 加载 T5:

代码语言:javascript
复制
>>> from transformers import TFAutoModelForSeq2SeqLM

>>> model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)

使用 prepare_tf_dataset()将数据集转换为tf.data.Dataset格式:

代码语言:javascript
复制
>>> tf_train_set = model.prepare_tf_dataset(
...     tokenized_books["train"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_test_set = model.prepare_tf_dataset(
...     tokenized_books["test"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )

使用compile配置模型进行训练。请注意,Transformers 模型都有一个默认的与任务相关的损失函数,因此除非您想要指定一个,否则不需要:

代码语言:javascript
复制
>>> import tensorflow as tf

>>> model.compile(optimizer=optimizer)  # No loss argument!

在开始训练之前,还有最后两件事要设置:从预测中计算 SacreBLEU 指标,并提供一种将模型推送到 Hub 的方法。这两个都可以通过使用 Keras 回调来完成。

将您的compute_metrics函数传递给 KerasMetricCallback:

代码语言:javascript
复制
>>> from transformers.keras_callbacks import KerasMetricCallback

>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

指定将模型和分词器推送到 PushToHubCallback 的位置:

代码语言:javascript
复制
>>> from transformers.keras_callbacks import PushToHubCallback

>>> push_to_hub_callback = PushToHubCallback(
...     output_dir="my_awesome_opus_books_model",
...     tokenizer=tokenizer,
... )

然后将您的回调捆绑在一起:

代码语言:javascript
复制
>>> callbacks = [metric_callback, push_to_hub_callback]

最后,你已经准备好开始训练你的模型了!调用fit与你的训练和验证数据集,时代的数量,以及你的回调来微调模型:

代码语言:javascript
复制
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=callbacks)

一旦训练完成,你的模型会自动上传到 Hub,这样每个人都可以使用它!

有关如何为翻译微调模型的更深入示例,请查看相应的PyTorch 笔记本TensorFlow 笔记本

推理

很好,现在你已经微调了一个模型,你可以用它进行推理!

想出一些你想要翻译成另一种语言的文本。对于 T5,你需要根据你正在处理的任务来添加前缀。对于从英语翻译成法语,你应该像下面所示添加前缀:

代码语言:javascript
复制
>>> text = "translate English to French: Legumes share resources with nitrogen-fixing bacteria."

尝试使用微调模型进行推理的最简单方法是在 pipeline()中使用它。为翻译实例化一个pipeline与您的模型,并将文本传递给它:

代码语言:javascript
复制
>>> from transformers import pipeline

>>> translator = pipeline("translation", model="my_awesome_opus_books_model")
>>> translator(text)
[{'translation_text': 'Legumes partagent des ressources avec des bactéries azotantes.'}]

如果你愿意,你也可以手动复制pipeline的结果:

Pytorch 隐藏 Pytorch 内容

对文本进行标记化,并将input_ids返回为 PyTorch 张量:

代码语言:javascript
复制
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_opus_books_model")
>>> inputs = tokenizer(text, return_tensors="pt").input_ids

使用 generate()方法创建翻译。有关不同文本生成策略和控制生成的参数的更多详细信息,请查看 Text Generation API。

代码语言:javascript
复制
>>> from transformers import AutoModelForSeq2SeqLM

>>> model = AutoModelForSeq2SeqLM.from_pretrained("my_awesome_opus_books_model")
>>> outputs = model.generate(inputs, max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)

将生成的标记 ID 解码回文本:

代码语言:javascript
复制
>>> tokenizer.decode(outputs[0], skip_special_tokens=True)
'Les lignées partagent des ressources avec des bactéries enfixant l'azote.'

TensorFlow 隐藏 TensorFlow 内容

对文本进行标记化,并将input_ids返回为 TensorFlow 张量:

代码语言:javascript
复制
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_opus_books_model")
>>> inputs = tokenizer(text, return_tensors="tf").input_ids

使用 generate()方法创建翻译。有关不同文本生成策略和控制生成的参数的更多详细信息,请查看 Text Generation API。

代码语言:javascript
复制
>>> from transformers import TFAutoModelForSeq2SeqLM

>>> model = TFAutoModelForSeq2SeqLM.from_pretrained("my_awesome_opus_books_model")
>>> outputs = model.generate(inputs, max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)

将生成的标记 ID 解码回文本:

代码语言:javascript
复制
>>> tokenizer.decode(outputs[0], skip_special_tokens=True)
'Les lugumes partagent les ressources avec des bactéries fixatrices d'azote.'

摘要

原始文本:huggingface.co/docs/transformers/v4.37.2/en/tasks/summarization

www.youtube-nocookie.com/embed/yHnr5Dk2zCI

摘要创建文档或文章的简短版本,捕捉所有重要信息。除了翻译之外,这是另一个可以被制定为序列到序列任务的任务的例子。摘要可以是:

  • 抽取式:从文档中提取最相关的信息。
  • 生成式:生成捕捉最相关信息的新文本。

本指南将向您展示如何:

  1. BillSum数据集的加利福尼亚州议案子集上对T5进行微调,用于生成摘要。
  2. 使用您微调的模型进行推断。

本教程中展示的任务由以下模型架构支持:

BART, BigBird-Pegasus, Blenderbot, BlenderbotSmall, Encoder decoder, FairSeq Machine-Translation, GPTSAN-japanese, LED, LongT5, M2M100, Marian, mBART, MT5, MVP, NLLB, NLLB-MOE, Pegasus, PEGASUS-X, PLBart, ProphetNet, SeamlessM4T, SeamlessM4Tv2, SwitchTransformers, T5, UMT5, XLM-ProphetNet

在开始之前,请确保您已安装所有必要的库:

代码语言:javascript
复制
pip install transformers datasets evaluate rouge_score

我们鼓励您登录到您的 Hugging Face 账户,这样您就可以上传和分享您的模型给社区。在提示时,输入您的令牌以登录:

代码语言:javascript
复制
>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载 BillSum 数据集

首先加载🤗数据集库中较小的加利福尼亚州议案子集:

代码语言:javascript
复制
>>> from datasets import load_dataset

>>> billsum = load_dataset("billsum", split="ca_test")

将数据集分割成训练集和测试集,使用train_test_split方法:

代码语言:javascript
复制
>>> billsum = billsum.train_test_split(test_size=0.2)

然后看一个例子:

代码语言:javascript
复制
>>> billsum["train"][0]
{'summary': 'Existing law authorizes state agencies to enter into contracts for the acquisition of goods or services upon approval by the Department of General Services. Existing law sets forth various requirements and prohibitions for those contracts, including, but not limited to, a prohibition on entering into contracts for the acquisition of goods or services of $100,000 or more with a contractor that discriminates between spouses and domestic partners or same-sex and different-sex couples in the provision of benefits. Existing law provides that a contract entered into in violation of those requirements and prohibitions is void and authorizes the state or any person acting on behalf of the state to bring a civil action seeking a determination that a contract is in violation and therefore void. Under existing law, a willful violation of those requirements and prohibitions is a misdemeanor.\nThis bill would also prohibit a state agency from entering into contracts for the acquisition of goods or services of $100,000 or more with a contractor that discriminates between employees on the basis of gender identity in the provision of benefits, as specified. By expanding the scope of a crime, this bill would impose a state-mandated local program.\nThe California Constitution requires the state to reimburse local agencies and school districts for certain costs mandated by the state. Statutory provisions establish procedures for making that reimbursement.\nThis bill would provide that no reimbursement is required by this act for a specified reason.',
/*
* 提示:该行代码过长,系统自动注释不进行高亮。一键复制会移除系统注释 
* 'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 10295.35 is added to the Public Contract Code, to read:\n10295.35.\n(a) (1) Notwithstanding any other law, a state agency shall not enter into any contract for the acquisition of goods or services in the amount of one hundred thousand dollars ($100,000) or more with a contractor that, in the provision of benefits, discriminates between employees on the basis of an employee’s or dependent’s actual or perceived gender identity, including, but not limited to, the employee’s or dependent’s identification as transgender.\n(2) For purposes of this section, “contract” includes contracts with a cumulative amount of one hundred thousand dollars ($100,000) or more per contractor in each fiscal year.\n(3) For purposes of this section, an employee health plan is discriminatory if the plan is not consistent with Section 1365.5 of the Health and Safety Code and Section 10140 of the Insurance Code.\n(4) The requirements of this section shall apply only to those portions of a contractor’s operations that occur under any of the following conditions:\n(A) Within the state.\n(B) On real property outside the state if the property is owned by the state or if the state has a right to occupy the property, and if the contractor’s presence at that location is connected to a contract with the state.\n(C) Elsewhere in the United States where work related to a state contract is being performed.\n(b) Contractors shall treat as confidential, to the maximum extent allowed by law or by the requirement of the contractor’s insurance provider, any request by an employee or applicant for employment benefits or any documentation of eligibility for benefits submitted by an employee or applicant for employment.\n(c) After taking all reasonable measures to find a contractor that complies with this section, as determined by the state agency, the requirements of this section may be waived under any of the following circumstances:\n(1) There is only one prospective contractor willing to enter into a specific contract with the state agency.\n(2) The contract is necessary to respond to an emergency, as determined by the state agency, that endangers the public health, welfare, or safety, or the contract is necessary for the provision of essential services, and no entity that complies with the requirements of this section capable of responding to the emergency is immediately available.\n(3) The requirements of this section violate, or are inconsistent with, the terms or conditions of a grant, subvention, or agreement, if the agency has made a good faith attempt to change the terms or conditions of any grant, subvention, or agreement to authorize application of this section.\n(4) The contractor is providing wholesale or bulk water, power, or natural gas, the conveyance or transmission of the same, or ancillary services, as required for ensuring reliable services in accordance with good utility practice, if the purchase of the same cannot practically be accomplished through the standard competitive bidding procedures and the contractor is not providing direct retail services to end users.\n(d) (1) A contractor shall not be deemed to discriminate in the provision of benefits if the contractor, in providing the benefits, pays the actual costs incurred in obtaining the benefit.\n(2) If a contractor is unable to provide a certain benefit, despite taking reasonable measures to do so, the contractor shall not be deemed to discriminate in the provision of benefits.\n(e) (1) Every contract subject to this chapter shall contain a statement by which the contractor certifies that the contractor is in compliance with this section.\n(2) The department or other contracting agency shall enforce this section pursuant to its existing enforcement powers.\n(3) (A) If a contractor falsely certifies that it is in compliance with this section, the contract with that contractor shall be subject to Article 9 (commencing with Section 10420), unless, within a time period specified by the department or other contracting agency, the contractor provides to the department or agency proof that it has complied, or is in the process of complying, with this section.\n(B) The application of the remedies or penalties contained in Article 9 (commencing with Section 10420) to a contract subject to this chapter shall not preclude the application of any existing remedies otherwise available to the department or other contracting agency under its existing enforcement powers.\n(f) Nothing in this section is intended to regulate the contracting practices of any local jurisdiction.\n(g) This section shall be construed so as not to conflict with applicable federal laws, rules, or regulations. In the event that a court or agency of competent jurisdiction holds that federal law, rule, or regulation invalidates any clause, sentence, paragraph, or section of this code or the application thereof to any person or circumstances, it is the intent of the state that the court or agency sever that clause, sentence, paragraph, or section so that the remainder of this section shall remain in effect.\nSEC. 2.\nSection 10295.35 of the Public Contract Code shall not be construed to create any new enforcement authority or responsibility in the Department of General Services or any other contracting agency.\nSEC. 3.\nNo reimbursement is required by this act pursuant to Section 6 of Article XIII\u2009B of the California Constitution because the only costs that may be incurred by a local agency or school district will be incurred because this act creates a new crime or infraction, eliminates a crime or infraction, or changes the penalty for a crime or infraction, within the meaning of Section 17556 of the Government Code, or changes the definition of a crime within the meaning of Section 6 of Article XIII\u2009B of the California Constitution.',
*/
 'title': 'An act to add Section 10295.35 to the Public Contract Code, relating to public contracts.'}

有两个字段您将要使用:

  • text:将成为模型输入的议案文本。
  • summarytext的简化版本,将成为模型的目标。

预处理

下一步是加载 T5 分词器来处理textsummary

代码语言:javascript
复制
>>> from transformers import AutoTokenizer

>>> checkpoint = "t5-small"
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)

您要创建的预处理函数需要:

  1. 在输入前加上提示,以便 T5 知道这是一个摘要任务。一些能够执行多个 NLP 任务的模型需要为特定任务提供提示。
  2. 在标记标签时使用关键字text_target参数。
  3. 截断序列,使其不超过由max_length参数设置的最大长度。
代码语言:javascript
复制
>>> prefix = "summarize: "

>>> def preprocess_function(examples):
...     inputs = [prefix + doc for doc in examples["text"]]
...     model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

...     labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

...     model_inputs["labels"] = labels["input_ids"]
...     return model_inputs

要在整个数据集上应用预处理函数,使用🤗数据集的map方法。通过设置batched=True来加速map函数,以一次处理数据集的多个元素:

代码语言:javascript
复制
>>> tokenized_billsum = billsum.map(preprocess_function, batched=True)

现在使用 DataCollatorForSeq2Seq 创建一批示例。在整理过程中,将句子动态填充到批次中的最长长度,而不是将整个数据集填充到最大长度。

Pytorch 隐藏 Pytorch 内容

代码语言:javascript
复制
>>> from transformers import DataCollatorForSeq2Seq

>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

TensorFlow 隐藏 TensorFlow 内容

代码语言:javascript
复制
>>> from transformers import DataCollatorForSeq2Seq

>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="tf")

评估

在训练过程中包含一个指标通常有助于评估模型的性能。您可以使用🤗 Evaluate库快速加载一个评估方法。对于这个任务,加载ROUGE指标(查看🤗 Evaluate 快速入门以了解如何加载和计算指标):

代码语言:javascript
复制
>>> import evaluate

>>> rouge = evaluate.load("rouge")

然后创建一个函数,将您的预测和标签传递给compute以计算 ROUGE 指标:

代码语言:javascript
复制
>>> import numpy as np

>>> def compute_metrics(eval_pred):
...     predictions, labels = eval_pred
...     decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
...     labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
...     decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

...     result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

...     prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
...     result["gen_len"] = np.mean(prediction_lens)

...     return {k: round(v, 4) for k, v in result.items()}

您的compute_metrics函数现在已经准备就绪,当您设置训练时会返回到它。

训练

Pytorch 隐藏 Pytorch 内容

如果您不熟悉使用 Trainer 微调模型,请查看基本教程这里!

现在您已经准备好开始训练您的模型了!使用 AutoModelForSeq2SeqLM 加载 T5:

代码语言:javascript
复制
>>> from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

>>> model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

此时,只剩下三个步骤:

  1. 在 Seq2SeqTrainingArguments 中定义您的训练超参数。唯一必需的参数是output_dir,它指定了保存模型的位置。您可以通过设置push_to_hub=True将此模型推送到 Hub(您需要登录 Hugging Face 才能上传模型)。在每个 epoch 结束时,Trainer 将评估 ROUGE 指标并保存训练检查点。
  2. 将训练参数传递给 Seq2SeqTrainer,同时还要传递模型、数据集、分词器、数据整理器和compute_metrics函数。
  3. 调用 train()来微调您的模型。
代码语言:javascript
复制
>>> training_args = Seq2SeqTrainingArguments(
...     output_dir="my_awesome_billsum_model",
...     evaluation_strategy="epoch",
...     learning_rate=2e-5,
...     per_device_train_batch_size=16,
...     per_device_eval_batch_size=16,
...     weight_decay=0.01,
...     save_total_limit=3,
...     num_train_epochs=4,
...     predict_with_generate=True,
...     fp16=True,
...     push_to_hub=True,
... )

>>> trainer = Seq2SeqTrainer(
...     model=model,
...     args=training_args,
...     train_dataset=tokenized_billsum["train"],
...     eval_dataset=tokenized_billsum["test"],
...     tokenizer=tokenizer,
...     data_collator=data_collator,
...     compute_metrics=compute_metrics,
... )

>>> trainer.train()

训练完成后,使用 push_to_hub()方法将您的模型共享到 Hub,以便每个人都可以使用您的模型:

代码语言:javascript
复制
>>> trainer.push_to_hub()

TensorFlow 隐藏 TensorFlow 内容

如果您不熟悉使用 Keras 微调模型,请查看基本教程这里!

要微调 TensorFlow 模型,首先设置一个优化器函数、学习率调度和一些训练超参数:

代码语言:javascript
复制
>>> from transformers import create_optimizer, AdamWeightDecay

>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

然后您可以使用 TFAutoModelForSeq2SeqLM 加载 T5:

代码语言:javascript
复制
>>> from transformers import TFAutoModelForSeq2SeqLM

>>> model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)

使用 prepare_tf_dataset()将您的数据集转换为tf.data.Dataset格式:

代码语言:javascript
复制
>>> tf_train_set = model.prepare_tf_dataset(
...     tokenized_billsum["train"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_test_set = model.prepare_tf_dataset(
...     tokenized_billsum["test"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )

使用compile为训练配置模型。请注意,Transformers 模型都有一个默认的与任务相关的损失函数,因此除非您想要指定一个,否则不需要:

代码语言:javascript
复制
>>> import tensorflow as tf

>>> model.compile(optimizer=optimizer)  # No loss argument!

在开始训练之前,还有最后两件事要设置:从预测中计算 ROUGE 分数,并提供一种将模型推送到 Hub 的方法。这两个都可以通过使用 Keras 回调来完成。

将您的compute_metrics函数传递给 KerasMetricCallback:

代码语言:javascript
复制
>>> from transformers.keras_callbacks import KerasMetricCallback

>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

在 PushToHubCallback 中指定要推送模型和分词器的位置:

代码语言:javascript
复制
>>> from transformers.keras_callbacks import PushToHubCallback

>>> push_to_hub_callback = PushToHubCallback(
...     output_dir="my_awesome_billsum_model",
...     tokenizer=tokenizer,
... )

然后将您的回调捆绑在一起:

代码语言:javascript
复制
>>> callbacks = [metric_callback, push_to_hub_callback]

最后,您已经准备好开始训练您的模型了!使用您的训练和验证数据集、epoch 数量和回调函数调用fit来微调模型:

代码语言:javascript
复制
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=callbacks)

训练完成后,您的模型将自动上传到 Hub,以便每个人都可以使用它!

有关如何为摘要微调模型的更深入示例,请查看相应的 PyTorch 笔记本TensorFlow 笔记本

推理

很好,现在您已经对模型进行了微调,可以用于推理了!

想出一些您想要总结的文本。对于 T5,您需要根据您正在处理的任务为输入添加前缀。对于摘要,您应该像下面所示为输入添加前缀:

代码语言:javascript
复制
>>> text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

尝试使用微调后的模型进行推理的最简单方法是在 pipeline() 中使用它。为摘要实例化一个带有您的模型的 pipeline,并将文本传递给它:

代码语言:javascript
复制
>>> from transformers import pipeline

>>> summarizer = pipeline("summarization", model="stevhliu/my_awesome_billsum_model")
>>> summarizer(text)
[{"summary_text": "The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country."}]

如果愿意,您也可以手动复制 pipeline 的结果:

Pytorch 隐藏 Pytorch 内容

对文本进行标记化,并将 input_ids 返回为 PyTorch 张量:

代码语言:javascript
复制
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_billsum_model")
>>> inputs = tokenizer(text, return_tensors="pt").input_ids

使用 generate() 方法来创建摘要。有关不同文本生成策略和控制生成的参数的更多详细信息,请查看 Text Generation API。

代码语言:javascript
复制
>>> from transformers import AutoModelForSeq2SeqLM

>>> model = AutoModelForSeq2SeqLM.from_pretrained("stevhliu/my_awesome_billsum_model")
>>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)

将生成的标记 id 解码回文本:

代码语言:javascript
复制
>>> tokenizer.decode(outputs[0], skip_special_tokens=True)
'the inflation reduction act lowers prescription drug costs, health care costs, and energy costs. it's the most aggressive action on tackling the climate crisis in american history. it will ask the ultra-wealthy and corporations to pay their fair share.'

TensorFlow 隐藏 TensorFlow 内容

对文本进行标记化,并将 input_ids 返回为 TensorFlow 张量:

代码语言:javascript
复制
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_billsum_model")
>>> inputs = tokenizer(text, return_tensors="tf").input_ids

使用 generate() 方法来创建摘要。有关不同文本生成策略和控制生成的参数的更多详细信息,请查看 Text Generation API。

代码语言:javascript
复制
>>> from transformers import TFAutoModelForSeq2SeqLM

>>> model = TFAutoModelForSeq2SeqLM.from_pretrained("stevhliu/my_awesome_billsum_model")
>>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)

将生成的标记 id 解码回文本:

代码语言:javascript
复制
>>> tokenizer.decode(outputs[0], skip_special_tokens=True)
'the inflation reduction act lowers prescription drug costs, health care costs, and energy costs. it's the most aggressive action on tackling the climate crisis in american history. it will ask the ultra-wealthy and corporations to pay their fair share.'

多项选择

原始文本:huggingface.co/docs/transformers/v4.37.2/en/tasks/multiple_choice

多项选择任务类似于问题回答,不同之处在于提供了几个候选答案以及上下文,模型经过训练后会选择正确的答案。

本指南将向您展示如何:

  1. SWAG数据集的regular配置上对BERT进行微调,以在给定多个选项和一些上下文的情况下选择最佳答案。
  2. 使用您微调过的模型进行推理。

本教程中演示的任务由以下模型架构支持:

ALBERT、BERT、BigBird、CamemBERT、CANINE、ConvBERT、Data2VecText、DeBERTa-v2、DistilBERT、ELECTRA、ERNIE、ErnieM、FlauBERT、FNet、Funnel Transformer、I-BERT、Longformer、LUKE、MEGA、Megatron-BERT、MobileBERT、MPNet、MRA、Nezha、Nyströmformer、QDQBert、RemBERT、RoBERTa、RoBERTa-PreLayerNorm、RoCBert、RoFormer、SqueezeBERT、XLM、XLM-RoBERTa、XLM-RoBERTa-XL、XLNet、X-MOD、YOSO

在开始之前,请确保已安装所有必要的库:

代码语言:javascript
复制
pip install transformers datasets evaluate

我们鼓励您登录您的 Hugging Face 帐户,这样您就可以上传和与社区分享您的模型。在提示时,输入您的令牌以登录:

代码语言:javascript
复制
>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载 SWAG 数据集

首先加载🤗 Datasets 库中 SWAG 数据集的regular配置:

代码语言:javascript
复制
>>> from datasets import load_dataset

>>> swag = load_dataset("swag", "regular")

然后看一个例子:

代码语言:javascript
复制
>>> swag["train"][0]
{'ending0': 'passes by walking down the street playing their instruments.',
 'ending1': 'has heard approaching them.',
 'ending2': "arrives and they're outside dancing and asleep.",
 'ending3': 'turns the lead singer watches the performance.',
 'fold-ind': '3416',
 'gold-source': 'gold',
 'label': 0,
 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
 'sent2': 'A drum line',
 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
 'video-id': 'anetv_jkn6uvmqwh4'}

虽然这里看起来有很多字段,但实际上非常简单:

  • sent1sent2:这些字段显示句子如何开始,如果将它们放在一起,您将得到startphrase字段。
  • ending:建议一个可能的句子结尾,但只有一个是正确的。
  • label:标识正确的句子结尾。

预处理

下一步是加载 BERT 分词器,以处理句子开头和四个可能的结尾:

代码语言:javascript
复制
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

您要创建的预处理函数需要:

  1. 复制sent1字段四次,并将每个字段与sent2组合,以重新创建句子的开头方式。
  2. sent2与四个可能的句子结尾中的每一个结合起来。
  3. 将这两个列表展平,以便对它们进行标记化,然后在之后将它们展开,以便每个示例都有相应的input_idsattention_masklabels字段。
代码语言:javascript
复制
>>> ending_names = ["ending0", "ending1", "ending2", "ending3"]

>>> def preprocess_function(examples):
...     first_sentences = [[context] * 4 for context in examples["sent1"]]
...     question_headers = examples["sent2"]
...     second_sentences = [
...         [f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)
...     ]

...     first_sentences = sum(first_sentences, [])
...     second_sentences = sum(second_sentences, [])

...     tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
...     return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}

要在整个数据集上应用预处理函数,请使用🤗 Datasets map方法。您可以通过设置batched=True来加速map函数,以一次处理数据集的多个元素:

代码语言:javascript
复制
tokenized_swag = swag.map(preprocess_function, batched=True)

🤗 Transformers 没有用于多项选择的数据整理器,因此您需要调整 DataCollatorWithPadding 以创建一批示例。在整理期间,将句子动态填充到批次中的最长长度,而不是将整个数据集填充到最大长度,这样更有效。

DataCollatorForMultipleChoice将所有模型输入展平,应用填充,然后展平结果:

PytorchHide Pytorch 内容

代码语言:javascript
复制
>>> from dataclasses import dataclass
>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
>>> from typing import Optional, Union
>>> import torch

>>> @dataclass
... class DataCollatorForMultipleChoice:
...     """
...     Data collator that will dynamically pad the inputs for multiple choice received.
...     """

...     tokenizer: PreTrainedTokenizerBase
...     padding: Union[bool, str, PaddingStrategy] = True
...     max_length: Optional[int] = None
...     pad_to_multiple_of: Optional[int] = None

...     def __call__(self, features):
...         label_name = "label" if "label" in features[0].keys() else "labels"
...         labels = [feature.pop(label_name) for feature in features]
...         batch_size = len(features)
...         num_choices = len(features[0]["input_ids"])
...         flattened_features = [
...             [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
...         ]
...         flattened_features = sum(flattened_features, [])

...         batch = self.tokenizer.pad(
...             flattened_features,
...             padding=self.padding,
...             max_length=self.max_length,
...             pad_to_multiple_of=self.pad_to_multiple_of,
...             return_tensors="pt",
...         )

...         batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
...         batch["labels"] = torch.tensor(labels, dtype=torch.int64)
...         return batch

TensorFlowHide TensorFlow 内容

代码语言:javascript
复制
>>> from dataclasses import dataclass
>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
>>> from typing import Optional, Union
>>> import tensorflow as tf

>>> @dataclass
... class DataCollatorForMultipleChoice:
...     """
...     Data collator that will dynamically pad the inputs for multiple choice received.
...     """

...     tokenizer: PreTrainedTokenizerBase
...     padding: Union[bool, str, PaddingStrategy] = True
...     max_length: Optional[int] = None
...     pad_to_multiple_of: Optional[int] = None

...     def __call__(self, features):
...         label_name = "label" if "label" in features[0].keys() else "labels"
...         labels = [feature.pop(label_name) for feature in features]
...         batch_size = len(features)
...         num_choices = len(features[0]["input_ids"])
...         flattened_features = [
...             [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
...         ]
...         flattened_features = sum(flattened_features, [])

...         batch = self.tokenizer.pad(
...             flattened_features,
...             padding=self.padding,
...             max_length=self.max_length,
...             pad_to_multiple_of=self.pad_to_multiple_of,
...             return_tensors="tf",
...         )

...         batch = {k: tf.reshape(v, (batch_size, num_choices, -1)) for k, v in batch.items()}
...         batch["labels"] = tf.convert_to_tensor(labels, dtype=tf.int64)
...         return batch

评估

在训练过程中包含一个度量通常有助于评估模型的性能。您可以使用🤗 Evaluate库快速加载评估方法。对于此任务,加载accuracy度量(请参阅🤗 Evaluate quick tour以了解如何加载和计算度量):

代码语言:javascript
复制
>>> import evaluate

>>> accuracy = evaluate.load("accuracy")

然后创建一个函数,将您的预测和标签传递给compute来计算准确性:

代码语言:javascript
复制
>>> import numpy as np

>>> def compute_metrics(eval_pred):
...     predictions, labels = eval_pred
...     predictions = np.argmax(predictions, axis=1)
...     return accuracy.compute(predictions=predictions, references=labels)

您的compute_metrics函数现在已经准备就绪,在设置训练时将返回到它。

训练

PytorchHide Pytorch 内容

如果您不熟悉使用 Trainer 微调模型,请查看这里的基本教程 here!

您现在可以开始训练您的模型了!使用 AutoModelForMultipleChoice 加载 BERT:

代码语言:javascript
复制
>>> from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer

>>> model = AutoModelForMultipleChoice.from_pretrained("bert-base-uncased")

此时,只剩下三个步骤:

  1. 在 TrainingArguments 中定义您的训练超参数。唯一必需的参数是output_dir,它指定保存模型的位置。您将通过设置push_to_hub=True将此模型推送到 Hub(您需要登录 Hugging Face 才能上传模型)。在每个时代结束时,Trainer 将评估准确性并保存训练检查点。
  2. 将训练参数传递给 Trainer,同时还包括模型、数据集、标记器、数据整理器和compute_metrics函数。
  3. 调用 train()来微调您的模型。
代码语言:javascript
复制
>>> training_args = TrainingArguments(
...     output_dir="my_awesome_swag_model",
...     evaluation_strategy="epoch",
...     save_strategy="epoch",
...     load_best_model_at_end=True,
...     learning_rate=5e-5,
...     per_device_train_batch_size=16,
...     per_device_eval_batch_size=16,
...     num_train_epochs=3,
...     weight_decay=0.01,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=tokenized_swag["train"],
...     eval_dataset=tokenized_swag["validation"],
...     tokenizer=tokenizer,
...     data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
...     compute_metrics=compute_metrics,
... )

>>> trainer.train()

训练完成后,使用 push_to_hub()方法将您的模型共享到 Hub,以便每个人都可以使用您的模型:

代码语言:javascript
复制
>>> trainer.push_to_hub()

TensorFlowHide TensorFlow 内容

如果您不熟悉使用 Keras 微调模型,请查看这里的基本教程 here!

要在 TensorFlow 中微调模型,请首先设置优化器函数、学习率调度和一些训练超参数:

代码语言:javascript
复制
>>> from transformers import create_optimizer

>>> batch_size = 16
>>> num_train_epochs = 2
>>> total_train_steps = (len(tokenized_swag["train"]) // batch_size) * num_train_epochs
>>> optimizer, schedule = create_optimizer(init_lr=5e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

然后,您可以使用 TFAutoModelForMultipleChoice 加载 BERT:

代码语言:javascript
复制
>>> from transformers import TFAutoModelForMultipleChoice

>>> model = TFAutoModelForMultipleChoice.from_pretrained("bert-base-uncased")

使用 prepare_tf_dataset()将数据集转换为tf.data.Dataset格式:

代码语言:javascript
复制
>>> data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
>>> tf_train_set = model.prepare_tf_dataset(
...     tokenized_swag["train"],
...     shuffle=True,
...     batch_size=batch_size,
...     collate_fn=data_collator,
... )

>>> tf_validation_set = model.prepare_tf_dataset(
...     tokenized_swag["validation"],
...     shuffle=False,
...     batch_size=batch_size,
...     collate_fn=data_collator,
... )

使用compile配置模型进行训练。请注意,Transformers 模型都具有默认的与任务相关的损失函数,因此除非您想要指定一个,否则不需要:

代码语言:javascript
复制
>>> model.compile(optimizer=optimizer)  # No loss argument!

在开始训练之前,设置最后两件事是从预测中计算准确性,并提供一种将模型推送到 Hub 的方法。这两个都可以使用 Keras callbacks 来完成。

将您的compute_metrics函数传递给 KerasMetricCallback:

代码语言:javascript
复制
>>> from transformers.keras_callbacks import KerasMetricCallback

>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

指定在 PushToHubCallback 中推送您的模型和分词器的位置:

代码语言:javascript
复制
>>> from transformers.keras_callbacks import PushToHubCallback

>>> push_to_hub_callback = PushToHubCallback(
...     output_dir="my_awesome_model",
...     tokenizer=tokenizer,
... )

然后将您的回调捆绑在一起:

代码语言:javascript
复制
>>> callbacks = [metric_callback, push_to_hub_callback]

最后,您已经准备好开始训练您的模型了!调用fit,使用您的训练和验证数据集、时代数和回调来微调模型:

代码语言:javascript
复制
>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=2, callbacks=callbacks)

训练完成后,您的模型将自动上传到 Hub,以便每个人都可以使用它!

要了解如何为多项选择微调模型的更深入示例,请查看相应的PyTorch 笔记本TensorFlow 笔记本

推理

很好,现在您已经对模型进行了微调,可以用于推理!

想出一些文本和两个候选答案:

代码语言:javascript
复制
>>> prompt = "France has a bread law, Le Décret Pain, with strict rules on what is allowed in a traditional baguette."
>>> candidate1 = "The law does not apply to croissants and brioche."
>>> candidate2 = "The law applies to baguettes."

Pytorch 隐藏 Pytorch 内容

将每个提示和候选答案对进行标记化,并返回 PyTorch 张量。您还应该创建一些标签

代码语言:javascript
复制
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_swag_model")
>>> inputs = tokenizer([[prompt, candidate1], [prompt, candidate2]], return_tensors="pt", padding=True)
>>> labels = torch.tensor(0).unsqueeze(0)

将您的输入和标签传递给模型,并返回logits

代码语言:javascript
复制
>>> from transformers import AutoModelForMultipleChoice

>>> model = AutoModelForMultipleChoice.from_pretrained("my_awesome_swag_model")
>>> outputs = model(**{k: v.unsqueeze(0) for k, v in inputs.items()}, labels=labels)
>>> logits = outputs.logits

获取具有最高概率的类:

代码语言:javascript
复制
>>> predicted_class = logits.argmax().item()
>>> predicted_class
'0'

TensorFlow 隐藏 TensorFlow 内容

对每个提示和候选答案对进行标记化,并返回 TensorFlow 张量:

代码语言:javascript
复制
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_swag_model")
>>> inputs = tokenizer([[prompt, candidate1], [prompt, candidate2]], return_tensors="tf", padding=True)

将您的输入传递给模型,并返回logits

代码语言:javascript
复制
>>> from transformers import TFAutoModelForMultipleChoice

>>> model = TFAutoModelForMultipleChoice.from_pretrained("my_awesome_swag_model")
>>> inputs = {k: tf.expand_dims(v, 0) for k, v in inputs.items()}
>>> outputs = model(inputs)
>>> logits = outputs.logits

获取具有最高概率的类:

代码语言:javascript
复制
>>> predicted_class = int(tf.math.argmax(logits, axis=-1)[0])
>>> predicted_class
'0'
本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2024-06-26,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 问答
    • 加载 SQuAD 数据集
      • 预处理
        • 训练
          • 评估
            • 推理
            • 因果语言建模
              • 加载 ELI5 数据集
                • 预处理
                  • 训练
                    • 推理
                    • 遮蔽语言建模
                      • 加载 ELI5 数据集
                        • 预处理
                          • 训练
                            • 推理
                            • 翻译
                              • 加载 OPUS Books 数据集
                                • 预处理
                                  • 评估
                                    • 训练
                                      • 推理
                                      • 摘要
                                        • 加载 BillSum 数据集
                                          • 预处理
                                            • 评估
                                              • 训练
                                                • 推理
                                                • 多项选择
                                                  • 加载 SWAG 数据集
                                                    • 预处理
                                                      • 评估
                                                        • 训练
                                                          • 推理
                                                          领券
                                                          问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档