利用BERT和spacy3联合训练实体提取器和关系抽取器

磐创AI

发布于 2021-08-05 10:16:03

3.1K00

代码可运行

文章被收录于专栏：磐创AI技术团队的专栏磐创AI技术团队的专栏

运行总次数：0

代码可运行

介绍

NLP技术最有用的应用之一是从非结构化文本（合同、财务文档、医疗记录等）中提取信息，这使得自动数据查询能够有用武之地。

传统上，命名实体识别被广泛用于识别文本中的实体并存储数据以进行高级查询和过滤。然而，如果我们想从语义上理解非结构化文本，仅仅使用NER是不够的，因为我们不知道实体之间是如何相互关联的。

执行NER和关系提取将打开一个全新的信息检索方式，通过知识知识图谱，你可以浏览不同的节点，以发现隐藏的关系。因此，共同执行这些任务将是有益的。

在我上一篇文章的基础上，我们使用spaCy3对NER的BERT模型进行了微调，现在我们将使用spaCy的Thinc库向管道添加关系提取。

我们按照spaCy文档中概述的步骤训练关系提取模型。我们将比较使用transformer和tok2vec算法的关系分类器的性能。最后，我们将在网上找到的职位描述上测试该模型。

关系分类：

关系抽取模型的核心是一个分类器，它为给定的一对实体{e1，e2}预测关系r。在transformer的情况下，这个分类器被添加到输出隐藏状态的顶部。有关关系提取的更多信息，请阅读这篇优秀的文章，其中概述了用于关系分类的微调transformer模型的理论:https://towardsdatascience.com/bert-s-for-relation-extraction-in-nlp-2c7c3ab487c4

我们将要微调的预训练模型是roberta基础模型，但是你可以使用huggingface库中提供的任何预训练模型，只需在配置文件中输入名称即可（见下文）。

在本教程中，我们将提取作为经验的两个实体{经验，技能}和作为学位的两个实体{文凭，文凭专业}之间的关系。

目标是提取特定技能的经验年数以及与所需文凭和文凭专业。当然，你可以为你自己的用例训练你自己的关系分类器，例如在健康记录或财务文档中的公司收购中查找症状的原因/影响。

在本教程中，我们将只介绍实体关系提取部分。对于使用spacy3进行微调bert ner，请参阅我的上一篇文章：https://towardsdatascience.com/how-to-fine-tune-bert-transformer-with-spacy-3-6a90bfe57647

数据注释：

在我的上一篇文章中，我们使用ubai文本注释工具来执行联合实体提取和关系抽取，因为它的多功能接口允许我们在实体和关系注释之间轻松切换（见下文）：

http://qiniu.aihubs.net/1_USiz_vUfk0nLRN4GxVQ3AA.gif

在本教程中，我只注释了大约100个包含实体和关系的文档。对于生产，我们肯定需要更多带注释的数据。

数据准备：

在训练模型之前，我们需要将带注释的数据转换为二进制spacy文件。我们首先将ubai生成的注释拆分为training/dev/test并分别保存它们。我们修改spaCy教程repo中提供的代码，为我们自己的注释（转换代码）创建二进制文件。

我们对training、dev和test数据集重复此步骤，以生成三个二进制spaCy文件（github中提供的文件）。

关系抽取模型训练：

对于训练，我们将从我们的语料库中提供实体，并在这些实体上训练分类器。

打开一个新的google colab项目，确保在笔记本设置中选择GPU作为硬件加速器。通过运行以下命令确保GPU已启用：!nvidia-smi
安装spacy-nightly：

!pip install -U spacy-nightly --pre

克隆tutorials/rel_component:

!pip install -U pip setuptools wheel
!python -m spacy project clone tutorials/rel_component

安装transformer管道和spacy transformer库：

!python -m spacy download en_core_web_trf
!pip install -U spacy transformers

将目录更改为rel_component文件夹：cd rel_component
在rel_component中创建一个名为“data”的文件夹，并将training、dev和test二进制文件上载到其中：

打开project.yml文件并更新训练、开发和测试路径：

train_file: "data/relations_training.spacy"

dev_file: "data/relations_dev.spacy"

test_file: "data/relations_test.spacy"

你可以通过转到 configs/rel_trf.cfg并输入模型名称来更改预训练的transformer模型（例如，如果你想使用其他语言）：

[components.transformer.model]

@architectures = "spacy-transformers.TransformerModel.v1"

name = "roberta-base" #  来自huggingface的Transformer模型

tokenizer_config = {"use_fast": true}

在开始训练之前，我们将configs/rel_trf.cfg中的max_length从默认的100token减少到20，以提高模型的效率。max_length对应于两个实体之间的最大距离，在该距离以上的实体将不被考虑用于关系分类。因此，来自同一文档的两个实体将被分类，只要它们在彼此的最大距离内（在token数量上）。

[components.relation_extractor.model.create_instance_tensor.get_instances]

@misc = "rel_instance_generator.v1"

max_length = 20

我们最终准备好训练和评估关系抽取模型；只需运行以下命令：

!spacy project run train_gpu # 训练transformers
!spacy project run evaluate # 评估测试集

你应该开始看到P、R和F分数开始更新：

模型训练完成后，对测试数据集的评估将立即开始，并显示预测与真实标签。模型将与模型的分数一起保存在名为“training”的文件夹中。

要训练tok2vec，请运行以下命令：

!spacy project run train_cpu # 命令训练tok2vec
!spacy project run evaluate

我们可以比较两款模型的性能：

# Transformer模型
"performance":{
"rel_micro_p":0.8476190476,
"rel_micro_r":0.9468085106,
"rel_micro_f":0.8944723618,
}

# Tok2vec 模型
  "performance":{
"rel_micro_p":0.8604651163,
"rel_micro_r":0.7872340426,
"rel_micro_f":0.8222222222,
}

基于transformer的模型的查准率和查全率明显优于tok2vec，说明了transformer在处理少量标注数据时的有效性。

联合实体和关系提取管道：

假设我们已经训练了一个transformer-NER模型，就像我在上一篇文章中所说的那样，我们将从网上找到的工作描述中提取实体（这不是训练或开发集的一部分），并将它们提供给关系提取模型来对关系进行分类。

安装空间transformer和transformer管道
加载NER模型并提取实体：

import spacy

nlp = spacy.load("NER Model Repo/model-best")

Text=['''2+ years of non-internship professional software development experience
Programming experience with at least one modern language such as Java, C++, or C# including object-oriented design.

1+ years of experience contributing to the architecture and design (architecture, design patterns, reliability and scaling) of new and current systems.

Bachelor / MS Degree in Computer Science. Preferably a PhD in data science.
8+ years of professional experience in software development. 2+ years of experience in project management.

Experience in mentoring junior software engineers to improve their skills, and make them more effective, product software engineers.

Experience in data structures, algorithm design, complexity analysis, object-oriented design.

3+ years experience in at least one modern programming language such as Java, Scala, Python, C++, C#

Experience in professional software engineering practices & best practices for the full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations

Experience in communicating with users, other technical teams, and management to collect requirements, describe software product features, and technical designs.
Experience with building complex software systems that have been successfully delivered to customers

Proven ability to take a project from scoping requirements through actual launch of the project, with experience in the subsequent operation of the system in production''']

for doc in nlp.pipe(text, disable=["tagger"]):
   print(f"spans: {[(e.start, e.text, e.label_) for e in doc.ents]}")

我们打印提取的实体：

spans: [(0, '2+ years', 'EXPERIENCE'), (7, 'professional software development', 'SKILLS'), (12, 'Programming', 'SKILLS'), (22, 'Java', 'SKILLS'), (24, 'C++', 'SKILLS'), (27, 'C#', 'SKILLS'), (30, 'object-oriented design', 'SKILLS'), (36, '1+ years', 'EXPERIENCE'), (41, 'contributing to the', 'SKILLS'), (46, 'design', 'SKILLS'), (48, 'architecture', 'SKILLS'), (50, 'design patterns', 'SKILLS'), (55, 'scaling', 'SKILLS'), (60, 'current systems', 'SKILLS'), (64, 'Bachelor', 'DIPLOMA'), (68, 'Computer Science', 'DIPLOMA_MAJOR'), (75, '8+ years', 'EXPERIENCE'), (82, 'software development', 'SKILLS'), (88, 'mentoring junior software engineers', 'SKILLS'), (103, 'product software engineers', 'SKILLS'), (110, 'data structures', 'SKILLS'), (113, 'algorithm design', 'SKILLS'), (116, 'complexity analysis', 'SKILLS'), (119, 'object-oriented design', 'SKILLS'), (135, 'Java', 'SKILLS'), (137, 'Scala', 'SKILLS'), (139, 'Python', 'SKILLS'), (141, 'C++', 'SKILLS'), (143, 'C#', 'SKILLS'), (148, 'professional software engineering', 'SKILLS'), (151, 'practices', 'SKILLS'), (153, 'best practices', 'SKILLS'), (158, 'software development', 'SKILLS'), (164, 'coding', 'SKILLS'), (167, 'code reviews', 'SKILLS'), (170, 'source control management', 'SKILLS'), (174, 'build processes', 'SKILLS'), (177, 'testing', 'SKILLS'), (180, 'operations', 'SKILLS'), (184, 'communicating', 'SKILLS'), (193, 'management', 'SKILLS'), (199, 'software product', 'SKILLS'), (204, 'technical designs', 'SKILLS'), (210, 'building complex software systems', 'SKILLS'), (229, 'scoping requirements', 'SKILLS')]

我们已经成功地从文中提取了所有的技能、多少年、文凭和文凭专业！接下来我们加载关系提取模型并对实体之间的关系进行分类。

注意：确保将“脚本”文件夹中的rel_pipe和rel_model复制到主文件夹中：

import random
import typer
from pathlib import Path
import spacy
from spacy.tokens import DocBin, Doc
from spacy.training.example import Example
from rel_pipe import make_relation_extractor, score_relations
from rel_model import create_relation_model, create_classification_layer, create_instances, create_tensors

# 我们加载关系提取(REL)模型
nlp2 = spacy.load("training/model-best")

# 我们从NER管道中生成实体，并将它们输入到REL管道中
for name, proc in nlp2.pipeline:
          doc = proc(doc)

# 在这里，我们将段落分成句子，并对每个句子中找到的每一对实体进行关联抽取。
for value, rel_dict in doc._.rel.items():
        for sent in doc.sents:
          for e in sent.ents:
            for b in sent.ents:
              if e.start == value[0] and b.start == value[1]:
                if rel_dict['EXPERIENCE_IN'] >=0.9 :
                  print(f" entities: {e.text, b.text} --> predicted relation: {rel_dict}")

在这里，我们显示了所有EXPERIENCE_IN高于90%的实体：

"entities":("2+ years", "professional software development"") --> predicted relation":
{"DEGREE_IN":1.2778723e-07,"EXPERIENCE_IN":0.9694631}

"entities":"(""1+ years", "contributing to the"") -->
predicted relation":
{"DEGREE_IN":1.4581254e-07,"EXPERIENCE_IN":0.9205434}

"entities":"(""1+ years","design"") --> 
predicted relation":
{"DEGREE_IN":1.8895419e-07,"EXPERIENCE_IN":0.94121873}

"entities":"(""1+ years","architecture"") --> 
predicted relation":
{"DEGREE_IN":1.9635708e-07,"EXPERIENCE_IN":0.9399484}

"entities":"(""1+ years","design patterns"") --> 
predicted relation":
{"DEGREE_IN":1.9823732e-07,"EXPERIENCE_IN":0.9423302}

"entities":"(""1+ years", "scaling"") --> 
predicted relation":
{"DEGREE_IN":1.892173e-07,"EXPERIENCE_IN":0.96628445}

entities: ('2+ years', 'project management') --> 
predicted relation:
{'DEGREE_IN': 5.175297e-07, 'EXPERIENCE_IN': 0.9911635}

"entities":"(""8+ years","software development"") -->
predicted relation":
{"DEGREE_IN":4.914319e-08,"EXPERIENCE_IN":0.994812}

"entities":"(""3+ years","Java"") -->
predicted relation":
{"DEGREE_IN":9.288566e-08,"EXPERIENCE_IN":0.99975795}

"entities":"(""3+ years","Scala"") --> 
predicted relation":
{"DEGREE_IN":2.8477e-07,"EXPERIENCE_IN":0.99982494}

"entities":"(""3+ years","Python"") -->
predicted relation":
{"DEGREE_IN":3.3149718e-07,"EXPERIENCE_IN":0.9998517}

"entities":"(""3+ years","C++"") -->
predicted relation":
{"DEGREE_IN":2.2569053e-07,"EXPERIENCE_IN":0.99986637}

值得注意的是，我们能够提取几乎所有他们各自的技能，正确无误！

让我们看看实体的关系：

entities: ('Bachelor / MS', 'Computer Science') -->
predicted relation: 
{'DEGREE_IN': 0.9943974, 'EXPERIENCE_IN':1.8361954e-09} 

entities: ('PhD', 'data science') --> predicted relation: {'DEGREE_IN': 0.98883855, 'EXPERIENCE_IN': 5.2092592e-09}

再次，我们成功地提取了文凭和文凭专业之间的所有关系！

这再一次证明了将transformer模型微调到具有少量注释数据的特定领域的情况是多么容易，无论是用于NER还是关系提取。

在只有上百个带注释的文档的情况下，我们能够训练出性能良好的关系分类器。此外，我们可以使用这个初始模型自动标注数百个未标记的数据，只需最少的校正。这可以显著加快注释过程并提高模型性能。

结论：

transformer真正改变了自然语言处理的领域，我对它们在信息提取中的应用感到特别兴奋。explosion AI（spaCy developers）和huggingface提供了开源解决方案，促进了transformer的采用。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-08-01，如有侵权请联系 cloudcommunity@tencent.com 删除

css

本文分享自磐创AI 微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

css

登录后参与评论

0 条评论

热度