前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >实用的AI:使用OpenAI GPT2,Sentence BERT和Berkley选区解析器从任何内容自动生成对或错问题

实用的AI:使用OpenAI GPT2,Sentence BERT和Berkley选区解析器从任何内容自动生成对或错问题

作者头像
代码医生工作室
发布2020-04-02 11:32:08
8660
发布2020-04-02 11:32:08
举报
文章被收录于专栏:相约机器人相约机器人

作者 | Ramsri Goutham

来源 | Medium

编辑 | 代码医生团队

在本文中,将介绍如何使用最新的AI算法自动生成“对或错”问题,例如您在学校教科书中看到的问题。

输入:程序的输入将是任何类似以下内容的文章

There is a lot of volcanic activity at divergent plate boundaries in the oceans. For example, many undersea volcanoes are found along the Mid-Atlantic Ridge. This is a divergent plate boundary that runs north-south through the middle of the Atlantic Ocean. As tectonic plates pull away from each other at a divergent plate boundary, they create deep fissures, or cracks, in the crust. Molten rock, called magma, erupts through these cracks onto Earth’s surface. At the surface, the molten rock is called lava. It cools and hardens, forming rock. Divergent plate boundaries also occur in the continental crust. Volcanoes form at these boundaries, but less often than in ocean crust. That’s because continental crust is thicker than oceanic crust. This makes it more difficult for molten rock to push up through the crust. Many volcanoes form along convergent plate boundaries where one tectonic plate is pulled down beneath another at a subduction zone. The leading edge of the plate melts as it is pulled into the mantle, forming magma that erupts as volcanoes. When a line of volcanoes forms along a subduction zone, they make up a volcanic arc. The edges of the Pacific plate are long subduction zones lined with volcanoes. This is why the Pacific rim is called the “Pacific Ring of Fire.”

输出:该输出将是一组自动生成的真和假的句子,与真正的句子直接从未来上述文章和假的句子通过生成OpenAI GPT2使用从所述制品的真正的句子。

真实句子(来自故事):

Divergent plate boundaries also occur in the continental crust

错误句子(生成GPT-2)

a) Divergent plate boundaries also occur in the low and high latitudes.

b) Divergent plate boundaries also occur in regions with more frequent rainfall.

c) Divergent plate boundaries also occur in the brain of mammals and vertebrates.

d) Divergent plate boundaries also have been proposed.

e) Divergent plate boundaries also may be used to map and reduce traffic congestion.

f) Divergent plate boundaries also had to be adjusted and the data collected from different cities was sent on a regular basis.

可以像下面这样重新排列并提出对或错问题

Divergent plate boundaries also occur in the continental crust

a) True

b) False

Divergent plate boundaries also occur in regions with more frequent rainfall.

a) True

b) False

--------------------------------------------------------------------

所有代码和jupyter笔记本

https://github.com/ramsrigouthamg/Generate_True_or_False_OpenAI_GPT2_Sentence_BERT

在了解了将要构建的内容之后,开始吧。

对或错陈述

首先,看看从给定语句生成True或False语句的几种方法。将了解GPT2在某些情况下如何提供帮助。

1)添加或删除否定

2)更改命名实体

3)改变形容词

4)更改主动词

5)将复合或复杂句子拆分为简单句子

6)更改名词短语或动词短语

Wordnet,Conceptnet和单词向量可用于查找相似的命名实体以及动词的反义词。这些方法可用于解决上述2)和4)。

在本文中,将使用6)更改名词短语或动词短语来生成True和False语句。

请继续阅读本文开头共享的Jupyter笔记本。

首先安装以下库。确保其中所有组件都已正确安装,因为其中有很多组件笨重。解决所有错误,然后继续。

代码语言:javascript
复制
!pip install tensorflow==1.14.0
!pip install torch==1.4.0
!pip install sentence-transformers==0.2.5.1
!pip install transformers==2.6.0
!pip install benepar==0.1.2
!pip install summa
!pip install nltk==3.4.5
!pip install spacy==2.1.0
!python3 -m spacy download en
!pip install scipy

导入必要的库并下载NLTK和Benepar文件。

代码语言:javascript
复制
import requests
import json
from summa.summarizer import summarize
import benepar
import string
import nltk
from nltk import tokenize
from nltk.tokenize import sent_tokenize
import re
from random import shuffle
import spacy
nlp = spacy.load('en')
#this package is required for the summa summarizer
nltk.download('punkt')
benepar.download('benepar_en2')
benepar_parser = benepar.Parser("benepar_en2")

步骤1:从文本文件加载内容

代码语言:javascript
复制
file_path = "volcano.txt"
 
def read_file(file_path):
    with open(file_path, 'r') as content_file:
        content = content_file.read()
        return content
    
text = read_file(file_path)
print(text)

输出将是:

There is a lot of volcanic activity at divergent plate boundaries in the oceans. For example, many undersea volcanoes are found along the Mid-Atlantic Ridge. This is a divergent plate boundary that runs north-south through the middle of the Atlantic Ocean. As tectonic plates pull away from each other at a divergent plate boundary, they create deep fissures, or cracks, in the crust. Molten rock, called magma, erupts through these cracks onto Earth’s surface. At the surface, the molten rock is called lava. It cools and hardens, forming rock. Divergent plate boundaries also occur in the continental crust. Volcanoes form at these boundaries, but less often than in ocean crust. That’s because continental crust is thicker than oceanic crust. This makes it more difficult for molten rock to push up through the crust. Many volcanoes form along convergent plate boundaries where one tectonic plate is pulled down beneath another at a subduction zone. The leading edge of the plate melts as it is pulled into the mantle, forming magma that erupts as volcanoes. When a line of volcanoes forms along a subduction zone, they make up a volcanic arc. The edges of the Pacific plate are long subduction zones lined with volcanoes. This is why the Pacific rim is called the “Pacific Ring of Fire.”

步骤2:汇总已加载的内容

使用summa提取摘要器库汇总加载的内容。同样从摘要句子中删除包含单引号,双引号和问号的句子,因为它们不适合生成“真”或“假”测验。

代码语言:javascript
复制
from string import punctuation
 
def preprocess(sentences):
    output = []
    for sent in sentences:
        single_quotes_present = len(re.findall(r"['][\w\s.:;,!?\\-]+[']",sent))>0
        double_quotes_present = len(re.findall(r'["][\w\s.:;,!?\\-]+["]',sent))>0
        question_present = "?" in sent
        if single_quotes_present or double_quotes_present or question_present :
            continue
        else:
            output.append(sent.strip(punctuation))
    return output
        
        
def get_candidate_sents(resolved_text, ratio=0.3):
    candidate_sents = summarize(resolved_text, ratio=ratio)
    candidate_sents_list = tokenize.sent_tokenize(candidate_sents)
    candidate_sents_list = [re.split(r'[:;]+',x)[0] for x in candidate_sents_list ]
    # Remove very short sentences less than 30 characters and long sentences greater than 150 characters
    filtered_list_short_sentences = [sent for sent in candidate_sents_list if len(sent)>30 and len(sent)<150]
    return filtered_list_short_sentences
 
cand_sents = get_candidate_sents(text)
filter_quotes_and_questions = preprocess(cand_sents)
for each_sentence in filter_quotes_and_questions:
    print (each_sentence)
    print ("\n")

汇总的输出将只是从文本中选择的4个句子。

As tectonic plates pull away from each other at a divergent plate boundary, they create deep fissures, or cracks, in the crust

Divergent plate boundaries also occur in the continental crust

Volcanoes form at these boundaries, but less often than in ocean crust

Many volcanoes form along convergent plate boundaries where one tectonic plate is pulled down beneath another at a subduction zone

将使用上面的上述4个句子,并通过更改动词短语或名词短语或两者来从中生成False句子。

步骤3:使用Berkley选区解析器在适当的位置拆分句子

在这里,使用Berkley选区解析器在结尾的动词短语或名词短语处拆分句子。例如:如果输入句子为“Divergent plate boundaries also occur in the continental crust”,则在结尾名词短语处进行拆分以得到“Divergent plate boundaries also occur in”,然后在结尾动词短语中进行拆分以得到“Divergent plate boundaries also”。现在向OpenAI GPT-2 提供部分拆分的句子“Divergent plate boundaries also occur in”以生成具有不同结尾的句子。这就是生成带有不同结尾动词短语或名词短语的False句子的方式。

解析句子“Divergent plate boundaries also occur in the continental crust”的句子

如上图所示,它是用AllenNLP选区解析器demo生成的,最后一个动词短语是“occur in the continental crust”,最后一个名词短语是“the continental crust”。将最后一个动词短语的原始句子拆分为“Divergent plate boundaries also”。将最后一个名词短语的原始句子拆分为“Divergent plate boundaries also occur in”。

代码语言:javascript
复制
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))
    
def get_flattened(t):
    sent_str_final = None
    if t is not None:
        sent_str = [" ".join(x.leaves()) for x in list(t)]
        sent_str_final = [" ".join(sent_str)]
        sent_str_final = sent_str_final[0]
    return sent_str_final
    
 
def get_termination_portion(main_string,sub_string):
    combined_sub_string = sub_string.replace(" ","")
    main_string_list = main_string.split()
    last_index = len(main_string_list)
    for i in range(last_index):
        check_string_list = main_string_list[i:]
        check_string = "".join(check_string_list)
        check_string = check_string.replace(" ","")
        if check_string == combined_sub_string:
            return " ".join(main_string_list[:i])
                     
    return None
    
def get_right_most_VP_or_NP(parse_tree,last_NP = None,last_VP = None):
    if len(parse_tree.leaves()) == 1:
        return get_flattened(last_NP),get_flattened(last_VP)
    last_subtree = parse_tree[-1]
    if last_subtree.label() == "NP":
        last_NP = last_subtree
    elif last_subtree.label() == "VP":
        last_VP = last_subtree
    
    return get_right_most_VP_or_NP(last_subtree,last_NP,last_VP)
 
 
def get_sentence_completions(key_sentences):
    sentence_completion_dict = {}
    for individual_sentence in filter_quotes_and_questions:
        sentence = individual_sentence.rstrip('?:!.,;')
        tree = benepar_parser.parse(sentence)
        last_nounphrase, last_verbphrase =  get_right_most_VP_or_NP(tree)
        phrases= []
        if last_verbphrase is not None:
            verbphrase_string = get_termination_portion(sentence,last_verbphrase)
            phrases.append(verbphrase_string)
        if last_nounphrase is not None:
            nounphrase_string = get_termination_portion(sentence,last_nounphrase)
            phrases.append(nounphrase_string)
 
        longest_phrase =  sorted(phrases, key=len,reverse= True)
        if len(longest_phrase) == 2:
            first_sent_len = len(longest_phrase[0].split())
            second_sentence_len = len(longest_phrase[1].split())
            if (first_sent_len - second_sentence_len) > 4:
                del longest_phrase[1]
                
        if len(longest_phrase)>0:
            sentence_completion_dict[sentence]=longest_phrase
    return sentence_completion_dict
 
 
 
sent_completion_dict = get_sentence_completions(filter_quotes_and_questions)
 
print (sent_completion_dict)

上面代码的输出是

{'As tectonic plates pull away from each other at a divergent plate boundary, they create deep fissures, or cracks, in the crust': ['As tectonic plates pull away from each other at a divergent plate boundary, they create deep fissures, or cracks, in'], 'Divergent plate boundaries also occur in the continental crust': ['Divergent plate boundaries also occur in', 'Divergent plate boundaries also'], 'Volcanoes form at these boundaries, but less often than in ocean crust': ['Volcanoes form at these boundaries, but less often than in'], 'Many volcanoes form along convergent plate boundaries where one tectonic plate is pulled down beneath another at a subduction zone': ['Many volcanoes form along convergent plate boundaries where one tectonic plate is pulled down beneath another at']}

在上面的代码中,传入每个句子,并得到一个以句子为键的字典,动词短语和名词短语在列表中拆分为值。

例如:“Divergent plate boundaries also occur in the continental crust”:[“Divergent plate boundaries also occur in’, ‘Divergent plate boundaries also”]

请注意,有时句子的结尾没有动词短语或名词短语。因此,在上述词典中,并非所有句子都具有两个值。

上面代码中的函数get_right_most_VP_or_NP 是主要函数,在其中使用动态编程递归遍历句子树,并确定要分割的最后一个动词短语或最后一个名词短语。

benepar的解析树对象(nltk tree object)不保留空格和其他详细信息。因此如果存在诸如“Mary ate John’s apple pie” 之类的句子,将识别名词短语并使用get_flattened将名词短语称为“John ‘ s apple pie”。如果您注意到名词短语中的John后面的撇号和“ s”之间有空格。如果只是尝试从主句“ Mary ate John's apple pie ”中匹配字符串,并尝试删除“ John's apple pie”,那是不可能的。因此在上面编写了一个辅助函数get_termination_portion,以添加自定义逻辑以匹配空间,并返回“ Mary ate”删除名词短语“ohn’s apple pie”。

步骤4:载入OpenAI GPT2和Sentence BERT

只是进行一些初始化,以加载openAI GPT2和句子BERT,以进行下一步生成上面带有部分拆分的句子的文本的操作。

代码语言:javascript
复制
# https://huggingface.co/transformers/main_classes/model.html?highlight=no_repeat_ngram_size
 
from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
 
# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2",pad_token_id=tokenizer.eos_token_id)
 
from sentence_transformers import SentenceTransformer
# Load the BERT model. Various models trained on Natural Language Inference (NLI) https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/nli-models.md and
# Semantic Textual Similarity are available https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/sts-models.md
model_BERT = SentenceTransformer('bert-base-nli-mean-tokens')

步骤5:生成错误的句子,并通过与原始句子的相似性对其进行过滤。

使用OpenAI GPT2生成多个句子,并在其中过滤(使用Sentence BERT)相似的句子,因为只想将不相似的句子保留为False句子。

代码语言:javascript
复制
from nltk import tokenize
import scipy
torch.manual_seed(2020)
 
 
def sort_by_similarity(original_sentence,generated_sentences_list):
    # Each sentence is encoded as a 1-D vector with 768 columns
    sentence_embeddings = model_BERT.encode(generated_sentences_list)
 
    queries = [original_sentence]
    query_embeddings = model_BERT.encode(queries)
    # Find the top sentences of the corpus for each query sentence based on cosine similarity
    number_top_matches = len(generated_sentences_list)
 
    dissimilar_sentences = []
 
    for query, query_embedding in zip(queries, query_embeddings):
        distances = scipy.spatial.distance.cdist([query_embedding], sentence_embeddings, "cosine")[0]
 
        results = zip(range(len(distances)), distances)
        results = sorted(results, key=lambda x: x[1])
 
 
        for idx, distance in reversed(results[0:number_top_matches]):
            score = 1-distance
            if score < 0.9:
                dissimilar_sentences.append(generated_sentences_list[idx].strip())
           
    sorted_dissimilar_sentences = sorted(dissimilar_sentences, key=len)
    
    return sorted_dissimilar_sentences[:3]
    
 
def generate_sentences(partial_sentence,full_sentence):
    input_ids = torch.tensor([tokenizer.encode(partial_sentence)])
    maximum_length = len(partial_sentence.split())+80
 
    # Actiavte top_k sampling and top_p sampling with only from 90% most likely words
    sample_outputs = model.generate(
        input_ids,
        do_sample=True,
        max_length=maximum_length,
        top_p=0.90, # 0.85
        top_k=50,   #0.30
        repetition_penalty  = 10.0,
        num_return_sequences=10
    )
    generated_sentences=[]
    for i, sample_output in enumerate(sample_outputs):
        decoded_sentences = tokenizer.decode(sample_output, skip_special_tokens=True)
        decoded_sentences_list = tokenize.sent_tokenize(decoded_sentences)
        generated_sentences.append(decoded_sentences_list[0])
        
    top_3_sentences = sort_by_similarity(full_sentence,generated_sentences)
    
    return top_3_sentences
 
index = 1
choice_list = ["a)","b)","c)","d)","e)","f)"]
for key_sentence in sent_completion_dict:
    partial_sentences = sent_completion_dict[key_sentence]
    false_sentences =[]
    print_string = "**%s) True Sentence (from the story) :**"%(str(index))
    printmd(print_string)
    print ("  ",key_sentence)
    for partial_sent in partial_sentences:
        false_sents = generate_sentences(partial_sent,key_sentence)
        false_sentences.extend(false_sents)
    printmd("  **False Sentences (GPT-2 Generated)**")
    for ind,false_sent in enumerate(false_sentences):
        print_string_choices = "**%s** %s"%(choice_list[ind],false_sent)
        printmd(print_string_choices)
    index = index+1
    
    print ("\n\n")

通过一个实际的例子来理解上面的代码。

假设原始句子是“Many years ago, there was a holy man who lived in a monastery.” 然后从上面的步骤3中,将名词短语的句子拆分为“Many years ago, there was a holy man who lived in a”。

给部分句子“Many years ago, there was a holy man who lived in a”中,以在上方的generate_sentences函数中获取以下生成的句子

1 Many years ago, there was a holy man who lived in a monastery that had been built on ruins of an ancient building.

2 Many years ago, there was a holy man who lived in a cave and said that the only thing which he did not know is God.

3 Many years ago, there was a holy man who lived in a monastery.

4 Many years ago, there was a holy man who lived in a mountain area.

5 Many years ago, there was a holy man who lived in a temple with no people.

6 Many years ago, there was a holy man who lived in a small village.

7 Many years ago, there was a holy man who lived in a town called Foula.

将所有这些生成的句子以及原始句子“很多年前,有一个圣人住在修道院里”传递给函数sort_by_similarity,该函数给出了与原始句子相比上述七个句子中每个句子的余弦相似度得分。使用句子BERT编码每个句子,并使用Scipy获得余弦相似度得分。然后选择最不相似的(相似度得分较低),因为想得到与原始句子不匹配的False句子。与原始句子相比,还会过滤很长的句子。

使用这些技术后,从上面的7个句子中进行过滤以获取下面的输出。对于原始句子,这些看起来很虚假。

Original : Many years ago, there was a holy man who lived in a monastery.

======================

Generated and filtered sentences:

1 ) Many years ago, there was a holy man who lived in a small village.

2 ) Many years ago, there was a holy man who lived in a mountain area.

3 ) Many years ago, there was a holy man who lived in a town called Foula.

4 ) Many years ago, there was a holy man who lived in a temple with no people.

5 ) Many years ago, there was a holy man who lived in a cave and said that the only thing which he did not know is God.

最终,程序在步骤5之后的输出是一组生成的False句子,这些句子是从故事中最初选择的True句子。

最终作品:

1) True Sentence (from the story) :

As tectonic plates pull away from each other at a divergent plate boundary, they create deep fissures, or cracks, in the crust

False Sentences (GPT-2 Generated)

a) As tectonic plates pull away from each other at a divergent plate boundary, they create deep fissures, or cracks, in the crust that provide access to oxygen-rich water.

b) As tectonic plates pull away from each other at a divergent plate boundary, they create deep fissures, or cracks, in the seafloor that are more sensitive to wind velocity and pressure than most continental surfaces.

2) True Sentence (from the story) :

Divergent plate boundaries also occur in the continental crust

False Sentences (GPT-2 Generated)

a) Divergent plate boundaries also occur in the low and high latitudes.

b) Divergent plate boundaries also occur in regions with more frequent rainfall.

c) Divergent plate boundaries also occur in the brain of mammals and vertebrates.

d) Divergent plate boundaries also have been proposed.

e) Divergent plate boundaries also may be used to map and reduce traffic congestion.

f) Divergent plate boundaries also had to be adjusted and the data collected from different cities was sent on a regular basis.

3) True Sentence (from the story) :

Volcanoes form at these boundaries, but less often than in ocean crust

False Sentences (GPT-2 Generated)

a) Volcanoes form at these boundaries, but less often than in any other country," he says.

b) Volcanoes form at these boundaries, but less often than in a large coastal country (such as the USA) they must be found along well-defined water bodies such that their location can become clear.

4) True Sentence (from the story) :

Many volcanoes form along convergent plate boundaries where one tectonic plate is pulled down beneath another at a subduction zone

False Sentences (GPT-2 Generated)

a) Many volcanoes form along convergent plate boundaries where one tectonic plate is pulled down beneath another at a rate of 1.1, 3–4 km/h (5).

b) Many volcanoes form along convergent plate boundaries where one tectonic plate is pulled down beneath another at about 25% of its original location.

c) Many volcanoes form along convergent plate boundaries where one tectonic plate is pulled down beneath another at a rate of about 100 million km/year.

一些生成的虚假句子几乎看起来像人为生成的虚假句子,而其中一些则不那么合乎逻辑。可以尝试使用GPT-2文本生成器的参数并对其进行进一步调整。

可以改善的事情?

可以在全文上使用代词解析(神经共指解析),然后再将其传递给Summa摘要。然后,任何带有代词的句子都将被解析,这样当以真或假表示时,它们看起来是完整且独立的。但是由于代词的解析度不是理想的,也会遇到一些不必要的错误。

不要给GPT2仅给出未完成的句子(例如“ivergent plate boundaries also occur in”)并要求它生成句子,而给出未完成的句子之前的少量句子。然后,GPT-2将拥有更多上下文来生成连贯的文本。

使用benepar选区解析器将复合句子和复杂句子拆分为简单句子。然后,可以将简单的句子作为True语句给出。这将解决开始时提到的5)点 中关于生成True或False的不同方法。

经过了一个非常实用的项目,结合了最新的NLP(OpenAI GPT2,Sentence BERT和Berkley Neural Parser),为教育内容生成了对错题。

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2020-03-31,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 相约机器人 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档