用微调的BERT回答问题

磐创AI

发布于 2021-11-10 11:03:39

1.4K0

发布于 2021-11-10 11:03:39

文章被收录于专栏：磐创AI技术团队的专栏

作者 | Chetna Khanna 编译 | VK

每当我想到一个问答系统，我脑海中浮现的第一件事就是教室——一个老师回答一个或几个学生举手提出的问题。

也就是说，回答问题对人类来说是一项微不足道的任务，但对机器来说却并非如此微不足道。要回答任何问题，机器都需要克服许多不同的挑战，如词汇空缺、共指消解、语言歧义等。

为此，机器需要大量的训练数据和智能体系结构来理解和存储文本中的重要信息。NLP的最新进展已经开启了机器理解文本和执行不同任务的能力。

在本文中，我们将共同研究一个问答系统。我们将使用一个已经从HuggingFace Transformers库微调Bert模型来回答问题，从CoQA数据集的基础上。

我确信，通过查看代码，你将认识到为我们的目的使用微调模型是多么容易。

注意：本文将不深入讨论BERT体系结构的细节。但是，如果需要或可能，我将提供一个解释。

文章中使用的版本：torch-1.7.1，transformers-4.4.2

让我们首先回答与本文相关的几个重要问题。

什么是 Hugging Face 和 Transformers ？

Hugging Face是自然语言处理（NLP）技术的开源提供商。

你可以使用最先进的模型来构建、训练和部署你自己的模型。Transformers是他们的NLP库。

我强烈建议你去看看Hugging Face团队所做的惊人工作，以及他们大量的预训练过的NLP模型。

什么是CoQA？

CoQA是斯坦福NLP于2019年发布的会话问答数据集，是构建会话问答系统的大型数据集。

这个数据集的目的是测量机器理解一段文字和回答对话中出现的一系列相互关联的问题的能力。这个数据集的独特之处在于，他们是问答的形式，因此，这些问题是对话性的。

要了解JSON数据的格式，请参考此链接：http://downloads.cs.stanford.edu/nlp/data/coqa/coqa-train-v1.0.json。我们将使用JSON数据集中的故事、问题和答案来形成我们的数据框架。

Bert是什么？🤔

BERT是一个来自Transformer的双向编码器。它是最流行和应用最广泛的NLP模型之一。

Bert模型可以通过查看单词前后的上下文来考虑单词的全部上下文，这对于理解查询背后的意图是特别有用的。

由于它的双向性，它对语言的语境和流动有着更深刻的意义，因此在当今的许多自然语言处理任务中都被使用。

Transformers库有很多不同的模型。从这个库中很容易找到一个特定于任务的模型并执行我们的任务。

所以，让我们开始，但让我们首先看看我们的数据集。

JSON数据有很多字段。为了我们的目的，我们将使用“故事”，“输入文本”从“问题”和“答案”，并形成我们的数据帧。

安装Transformer

!pip install transformers

导入库

import pandas as pd
import numpy as np
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer

从斯坦福网站加载数据

coqa = pd.read_json('http://downloads.cs.stanford.edu/nlp/data/coqa/coqa-train-v1.0.json')
coqa.head()

数据清理

我们将处理“data”列，所以让我们删除“version”列。

del coqa["version"]

对于每一个问答对，我们都会附上相关的故事。

# 数据框中的必需列
cols = ["text","question","answer"]

# 创建数据帧的列表的列表
comp_list = []
for index, row in coqa.iterrows():
    for i in range(len(row["data"]["questions"])):
        temp_list = []
        temp_list.append(row["data"]["story"])
        temp_list.append(row["data"]["questions"][i]["input_text"])
        temp_list.append(row["data"]["answers"][i]["input_text"])
        comp_list.append(temp_list)

new_df = pd.DataFrame(comp_list, columns=cols) 

# 将数据帧保存到csv文件以供进一步加载
new_df.to_csv("CoQA_data.csv", index=False)

从本地CSV文件加载数据

data = pd.read_csv("CoQA_data.csv")
data.head()

这是我们的数据清理版本。

print("Number of question and answers: ", len(data))

数据集有很多问题和答案，所以让我们得到有多少个。

Number of question and answers:  108647

构建聊天机器人

使用这些预训练好的模型最好的部分是，你可以在两行简单的代码中加载模型及其tokenizer。不是很简单吗？

对于文本分类这样的任务，我们需要对数据集进行微调。但是对于问答任务，我们甚至可以使用已经训练过的模型，即使我们的文本来自完全不同的领域，也能得到不错的结果。

为了得到好的结果，我们使用了一个BERT模型，这个模型在 SQuAD 基准上进行了微调。

对于我们的任务，我们将使用来自Transformers库的BertForQuestionAnswering类。

model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

预计下载需要几分钟，因为BERT large是一个非常大的模型，有24层和340M的参数，使它成为一个1.34GB的模型。

问一个问题

让我们随机选择一个问题编号。

random_num = np.random.randint(0,len(data))

question = data["question"][random_num]
text = data["text"][random_num]

让我们将问题和文本token化。

input_ids = tokenizer.encode(question, text)
print("The input has a total of {} tokens.".format(len(input_ids)))

让我们看看这个问题和文本对有多少token。

The input has a total of 427 tokens.

要查看我们的token化程序正在做什么，我们只需打印出token及其ID。

tokens = tokenizer.convert_ids_to_tokens(input_ids)

for token, id in zip(tokens, input_ids):
    print('{:8}{:8,}'.format(token,id))

BERT有一种处理token化输入的独特方法。

从上面的屏幕截图中，我们可以看到两个特殊token[CLS]和[SEP]。

[CLS]token表示分类，用于表示句子级别的分类，在分类时使用。

Bert使用的另一个标记是[SEP]。它用来分隔两段文字。你可以在上面的截图中看到两个[SEP]标记，一个在问题之后，另一个在文本之后。

除了“标记嵌入”之外，BERT内部还使用了“段嵌入”和“位置嵌入”。片段嵌入有助于BERT区分问题和文本。在实践中，如果嵌入来自句子1，则使用0的向量；如果嵌入来自句子2，则使用1的向量。位置嵌入有助于指定单词在序列中的位置。所有这些嵌入都被馈送到输入层。

Transformers库可以使用PretrainedTokenizer.encode_plus()自行创建段嵌入。但是，我们甚至可以创造自己的。为此，我们只需要为每个token指定一个0或1。

# 首次出现[SEP]token
sep_idx = input_ids.index(tokenizer.sep_token_id)
print("SEP token index: ", sep_idx)

#段A中的token数 (question) - 这将比sep_idx多一个，因为Python中的索引从0开始
num_seg_a = sep_idx+1
print("Number of tokens in segment A: ", num_seg_a)

# 段B中的token数（文本）
num_seg_b = len(input_ids) - num_seg_a
print("Number of tokens in segment B: ", num_seg_b)

# 创建段ID
segment_ids = [0]*num_seg_a + [1]*num_seg_b

# 确保每个输入token都有一个段id
assert len(segment_ids) == len(input_ids)

这是输出。

SEP token index: 8
Number of tokens in segment A: 9
Number of tokens in segment B: 418

现在让我们把这个输入到我们的模型中。

#token input_ids to 表示输入，token segment_id用于区分我们的段
output = model(torch.tensor([input_ids]),  token_type_ids=torch.tensor([segment_ids]))

查看最可能的开始词和结束词，仅当结束标记在开始标记之后时才提供答案。

# 开始和结束分数最高的token
answer_start = torch.argmax(output.start_logits)
answer_end = torch.argmax(output.end_logits)

if answer_end >= answer_start:
    answer = " ".join(tokens[answer_start:answer_end+1])
else:
    print("I am unable to find the answer to this question. Can you please ask another question?")

print("\nQuestion:\n{}".format(question.capitalize()))
print("\nAnswer:\n{}.".format(answer.capitalize()))

这是我们的问题和答案。

Question:
Who is the acas director?

Answer:
Agnes karin ##gu.

Bert预言了正确的答案——“ Agnes Karingu ”。但是，回复中的“##”是什么？继续读下去！

Bert使用 wordpiece tokenization 。在BERT中，稀有词被分解成子词/片段。Wordpiece标记化使用##来分隔已拆分的标记。

举个例子：“Karin”是一个普通的词，所以wordpiece不会把它分开。然而，“Karingu”是一个罕见的词，所以wordpiece把它分为“Karin”和“gu”。请注意，它在gu之前添加了##，表示它是拆分单词的第二部分。

使用wordpiece背后的想法是减少词汇的大小，从而提高训练性能。

考虑单词， run, running, runner 。没有wordpiece，模型必须独立地存储和学习所有三个单词的含义。

但是，通过词条标记化，这三个单词中的每一个都将被拆分为“run”和相关的“##后缀”。现在，模型将学习单词“run”的上下文，其余的意思将被编码在后缀中，这将学习其他具有类似后缀的单词。

很有趣，对吧？我们可以使用下面的简单代码来重建这些单词。

answer = tokens[answer_start]

for i in range(answer_start+1, answer_end+1):
    if tokens[i][0:2] == "##":
        answer += tokens[i][2:]
    else:
        answer += " " + tokens[i]

以上答案将变成：Agnes karingu

现在让我们把这个问答过程变成一个简单的函数。

def question_answer(question, text):

    # 将问题和文本token化
    input_ids = tokenizer.encode(question, text)

    #字符串版本
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    #段IDs
    #first occurence of [SEP] token
    sep_idx = input_ids.index(tokenizer.sep_token_id)

    # 段A中的token数
    num_seg_a = sep_idx+1

    # 段B中的token数
    num_seg_b = len(input_ids) - num_seg_a

    # 段嵌入的0和1列表
    segment_ids = [0]*num_seg_a + [1]*num_seg_b
    assert len(segment_ids) == len(input_ids)

    # 使用input_ids和segment_ids的模型输出
    output = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]))

    # 重建答案
    answer_start = torch.argmax(output.start_logits)
    answer_end = torch.argmax(output.end_logits)

    if answer_end >= answer_start:
        answer = tokens[answer_start]
        for i in range(answer_start+1, answer_end+1):
            if tokens[i][0:2] == "##":
                answer += tokens[i][2:]
            else:
                answer += " " + tokens[i]

    if answer.startswith("[CLS]"):
        answer = "Unable to find the answer to your question."

    print("\nPredicted answer:\n{}".format(answer.capitalize()))

让我们使用数据集中的文本和问题来测试这个函数。😛

text = """New York (CNN) -- More than 80 Michael Jackson collectibles -- including the late pop star's famous rhinestone-studded glove from a 1983 performance -- were auctioned off Saturday, reaping a total $2 million. Profits from the auction at the Hard Rock Cafe in New York's Times Square crushed pre-sale expectations of only $120,000 in sales. The highly prized memorabilia, which included items spanning the many stages of Jackson's career, came from more than 30 fans, associates and family members, who contacted Julien's Auctions to sell their gifts and mementos of the singer. Jackson's flashy glove was the big-ticket item of the night, fetching $420,000 from a buyer in Hong Kong, China. Jackson wore the glove at a 1983 performance during \"Motown 25,\" an NBC special where he debuted his revolutionary moonwalk. Fellow Motown star Walter \"Clyde\" Orange of the Commodores, who also performed in the special 26 years ago, said he asked for Jackson's autograph at the time, but Jackson gave him the glove instead. "The legacy that [Jackson] left behind is bigger than life for me,\" Orange said. \"I hope that through that glove people can see what he was trying to say in his music and what he said in his music.\" Orange said he plans to give a portion of the proceeds to charity. Hoffman Ma, who bought the glove on behalf of Ponte 16 Resort in Macau, paid a 25 percent buyer's premium, which was tacked onto all final sales over $50,000. Winners of items less than $50,000 paid a 20 percent premium."""

question = "Where was the Auction held?"

question_answer(question, text)

#数据集的原始答案
print("Original answer:\n", data.loc[data["question"] == question]["answer"].values[0]))

输出：

Predicted answer:
Hard rock cafe in new york ' s times square

Original answer:
Hard Rock Cafe

一点也不坏。事实上，我们的BERT模型给出了更详细的回答。

这里有一个小函数来测试BERT对上下文的理解程度。我只是将问答过程作为一个循环来使用模型。

text = input("Please enter your text: \n")
question = input("\nPlease enter your question: \n")

while True:
    question_answer(question, text)

    flag = True
    flag_N = False

    while flag:
        response = input("\nDo you want to ask another question based on this text (Y/N)? ")
        if response[0] == "Y":
            question = input("\nPlease enter your question: \n")
            flag = False
        elif response[0] == "N":
            print("\nBye!")
            flag = False
            flag_N = True

    if flag_N == True:
        break

结果：

Please enter your text: 
The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula.   The Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail.   In March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online.   The Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items.   Scholars have traditionally divided the history of the library into five periods, Pre-Lateran, Lateran, Avignon, Pre-Vatican and Vatican.   The Pre-Lateran period, comprising the initial days of the library, dated from the earliest days of the Church. Only a handful of volumes survive from this period, though some are very significant.

Please enter your question: 
When was the Vat formally opened?

Answer:
1475

Do you want to ask another question based on this text (Y/N)? Y

Please enter your question: 
what is the library for?

Answer:
Research library for history , law , philosophy , science and theology

Do you want to ask another question based on this text (Y/N)? Y

Please enter your question: 
for what subjects?

Answer:
History , law , philosophy , science and theology
Do you want to ask another question based on this text (Y/N)? N

Bye!

瞧！很好用！

我希望本文能让你了解如何轻松地使用Hugging Face Transformer库中预训练好的模型并执行我们的任务。

Github链接：https://github.com/chetnakhanna16/CoQA_QuesAns_BERT/blob/main/CoQA_BERT_QuestionAnswering.ipynb

参考文献：