NLP技术最有用的应用之一是从非结构化文本(合同、财务文档、医疗记录等)中提取信息,这使得自动数据查询能够有用武之地。
传统上,命名实体识别被广泛用于识别文本中的实体并存储数据以进行高级查询和过滤。然而,如果我们想从语义上理解非结构化文本,仅仅使用NER是不够的,因为我们不知道实体之间是如何相互关联的。
执行NER和关系提取将打开一个全新的信息检索方式,通过知识知识图谱,你可以浏览不同的节点,以发现隐藏的关系。因此,共同执行这些任务将是有益的。
在我上一篇文章的基础上,我们使用spaCy3对NER的BERT模型进行了微调,现在我们将使用spaCy的Thinc库向管道添加关系提取。
我们按照spaCy文档中概述的步骤训练关系提取模型。我们将比较使用transformer和tok2vec算法的关系分类器的性能。最后,我们将在网上找到的职位描述上测试该模型。
关系抽取模型的核心是一个分类器,它为给定的一对实体{e1,e2}预测关系r。在transformer的情况下,这个分类器被添加到输出隐藏状态的顶部。有关关系提取的更多信息,请阅读这篇优秀的文章,其中概述了用于关系分类的微调transformer模型的理论:https://towardsdatascience.com/bert-s-for-relation-extraction-in-nlp-2c7c3ab487c4
我们将要微调的预训练模型是roberta基础模型,但是你可以使用huggingface库中提供的任何预训练模型,只需在配置文件中输入名称即可(见下文)。
在本教程中,我们将提取作为经验的两个实体{经验,技能}和作为学位的两个实体{文凭,文凭专业}之间的关系。
目标是提取特定技能的经验年数以及与所需文凭和文凭专业。当然,你可以为你自己的用例训练你自己的关系分类器,例如在健康记录或财务文档中的公司收购中查找症状的原因/影响。
在本教程中,我们将只介绍实体关系提取部分。对于使用spacy3进行微调bert ner,请参阅我的上一篇文章:https://towardsdatascience.com/how-to-fine-tune-bert-transformer-with-spacy-3-6a90bfe57647
在我的上一篇文章中,我们使用ubai文本注释工具来执行联合实体提取和关系抽取,因为它的多功能接口允许我们在实体和关系注释之间轻松切换(见下文):
http://qiniu.aihubs.net/1_USiz_vUfk0nLRN4GxVQ3AA.gif
在本教程中,我只注释了大约100个包含实体和关系的文档。对于生产,我们肯定需要更多带注释的数据。
在训练模型之前,我们需要将带注释的数据转换为二进制spacy文件。我们首先将ubai生成的注释拆分为training/dev/test并分别保存它们。我们修改spaCy教程repo中提供的代码,为我们自己的注释(转换代码)创建二进制文件。
我们对training、dev和test数据集重复此步骤,以生成三个二进制spaCy文件(github中提供的文件)。
对于训练,我们将从我们的语料库中提供实体,并在这些实体上训练分类器。
!pip install -U spacy-nightly --pre
!pip install -U pip setuptools wheel
!python -m spacy project clone tutorials/rel_component
!python -m spacy download en_core_web_trf
!pip install -U spacy transformers
train_file: "data/relations_training.spacy"
dev_file: "data/relations_dev.spacy"
test_file: "data/relations_test.spacy"
[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "roberta-base" # 来自huggingface的Transformer模型
tokenizer_config = {"use_fast": true}
[components.relation_extractor.model.create_instance_tensor.get_instances]
@misc = "rel_instance_generator.v1"
max_length = 20
!spacy project run train_gpu # 训练transformers
!spacy project run evaluate # 评估测试集
你应该开始看到P、R和F分数开始更新:
模型训练完成后,对测试数据集的评估将立即开始,并显示预测与真实标签。模型将与模型的分数一起保存在名为“training”的文件夹中。
要训练tok2vec,请运行以下命令:
!spacy project run train_cpu # 命令训练tok2vec
!spacy project run evaluate
我们可以比较两款模型的性能:
# Transformer模型
"performance":{
"rel_micro_p":0.8476190476,
"rel_micro_r":0.9468085106,
"rel_micro_f":0.8944723618,
}
# Tok2vec 模型
"performance":{
"rel_micro_p":0.8604651163,
"rel_micro_r":0.7872340426,
"rel_micro_f":0.8222222222,
}
基于transformer的模型的查准率和查全率明显优于tok2vec,说明了transformer在处理少量标注数据时的有效性。
假设我们已经训练了一个transformer-NER模型,就像我在上一篇文章中所说的那样,我们将从网上找到的工作描述中提取实体(这不是训练或开发集的一部分),并将它们提供给关系提取模型来对关系进行分类。
import spacy
nlp = spacy.load("NER Model Repo/model-best")
Text=['''2+ years of non-internship professional software development experience
Programming experience with at least one modern language such as Java, C++, or C# including object-oriented design.
1+ years of experience contributing to the architecture and design (architecture, design patterns, reliability and scaling) of new and current systems.
Bachelor / MS Degree in Computer Science. Preferably a PhD in data science.
8+ years of professional experience in software development. 2+ years of experience in project management.
Experience in mentoring junior software engineers to improve their skills, and make them more effective, product software engineers.
Experience in data structures, algorithm design, complexity analysis, object-oriented design.
3+ years experience in at least one modern programming language such as Java, Scala, Python, C++, C#
Experience in professional software engineering practices & best practices for the full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations
Experience in communicating with users, other technical teams, and management to collect requirements, describe software product features, and technical designs.
Experience with building complex software systems that have been successfully delivered to customers
Proven ability to take a project from scoping requirements through actual launch of the project, with experience in the subsequent operation of the system in production''']
for doc in nlp.pipe(text, disable=["tagger"]):
print(f"spans: {[(e.start, e.text, e.label_) for e in doc.ents]}")
我们打印提取的实体:
spans: [(0, '2+ years', 'EXPERIENCE'), (7, 'professional software development', 'SKILLS'), (12, 'Programming', 'SKILLS'), (22, 'Java', 'SKILLS'), (24, 'C++', 'SKILLS'), (27, 'C#', 'SKILLS'), (30, 'object-oriented design', 'SKILLS'), (36, '1+ years', 'EXPERIENCE'), (41, 'contributing to the', 'SKILLS'), (46, 'design', 'SKILLS'), (48, 'architecture', 'SKILLS'), (50, 'design patterns', 'SKILLS'), (55, 'scaling', 'SKILLS'), (60, 'current systems', 'SKILLS'), (64, 'Bachelor', 'DIPLOMA'), (68, 'Computer Science', 'DIPLOMA_MAJOR'), (75, '8+ years', 'EXPERIENCE'), (82, 'software development', 'SKILLS'), (88, 'mentoring junior software engineers', 'SKILLS'), (103, 'product software engineers', 'SKILLS'), (110, 'data structures', 'SKILLS'), (113, 'algorithm design', 'SKILLS'), (116, 'complexity analysis', 'SKILLS'), (119, 'object-oriented design', 'SKILLS'), (135, 'Java', 'SKILLS'), (137, 'Scala', 'SKILLS'), (139, 'Python', 'SKILLS'), (141, 'C++', 'SKILLS'), (143, 'C#', 'SKILLS'), (148, 'professional software engineering', 'SKILLS'), (151, 'practices', 'SKILLS'), (153, 'best practices', 'SKILLS'), (158, 'software development', 'SKILLS'), (164, 'coding', 'SKILLS'), (167, 'code reviews', 'SKILLS'), (170, 'source control management', 'SKILLS'), (174, 'build processes', 'SKILLS'), (177, 'testing', 'SKILLS'), (180, 'operations', 'SKILLS'), (184, 'communicating', 'SKILLS'), (193, 'management', 'SKILLS'), (199, 'software product', 'SKILLS'), (204, 'technical designs', 'SKILLS'), (210, 'building complex software systems', 'SKILLS'), (229, 'scoping requirements', 'SKILLS')]
我们已经成功地从文中提取了所有的技能、多少年、文凭和文凭专业!接下来我们加载关系提取模型并对实体之间的关系进行分类。
注意:确保将“脚本”文件夹中的rel_pipe和rel_model复制到主文件夹中:
import random
import typer
from pathlib import Path
import spacy
from spacy.tokens import DocBin, Doc
from spacy.training.example import Example
from rel_pipe import make_relation_extractor, score_relations
from rel_model import create_relation_model, create_classification_layer, create_instances, create_tensors
# 我们加载关系提取(REL)模型
nlp2 = spacy.load("training/model-best")
# 我们从NER管道中生成实体,并将它们输入到REL管道中
for name, proc in nlp2.pipeline:
doc = proc(doc)
# 在这里,我们将段落分成句子,并对每个句子中找到的每一对实体进行关联抽取。
for value, rel_dict in doc._.rel.items():
for sent in doc.sents:
for e in sent.ents:
for b in sent.ents:
if e.start == value[0] and b.start == value[1]:
if rel_dict['EXPERIENCE_IN'] >=0.9 :
print(f" entities: {e.text, b.text} --> predicted relation: {rel_dict}")
在这里,我们显示了所有EXPERIENCE_IN高于90%的实体:
"entities":("2+ years", "professional software development"") --> predicted relation":
{"DEGREE_IN":1.2778723e-07,"EXPERIENCE_IN":0.9694631}
"entities":"(""1+ years", "contributing to the"") -->
predicted relation":
{"DEGREE_IN":1.4581254e-07,"EXPERIENCE_IN":0.9205434}
"entities":"(""1+ years","design"") -->
predicted relation":
{"DEGREE_IN":1.8895419e-07,"EXPERIENCE_IN":0.94121873}
"entities":"(""1+ years","architecture"") -->
predicted relation":
{"DEGREE_IN":1.9635708e-07,"EXPERIENCE_IN":0.9399484}
"entities":"(""1+ years","design patterns"") -->
predicted relation":
{"DEGREE_IN":1.9823732e-07,"EXPERIENCE_IN":0.9423302}
"entities":"(""1+ years", "scaling"") -->
predicted relation":
{"DEGREE_IN":1.892173e-07,"EXPERIENCE_IN":0.96628445}
entities: ('2+ years', 'project management') -->
predicted relation:
{'DEGREE_IN': 5.175297e-07, 'EXPERIENCE_IN': 0.9911635}
"entities":"(""8+ years","software development"") -->
predicted relation":
{"DEGREE_IN":4.914319e-08,"EXPERIENCE_IN":0.994812}
"entities":"(""3+ years","Java"") -->
predicted relation":
{"DEGREE_IN":9.288566e-08,"EXPERIENCE_IN":0.99975795}
"entities":"(""3+ years","Scala"") -->
predicted relation":
{"DEGREE_IN":2.8477e-07,"EXPERIENCE_IN":0.99982494}
"entities":"(""3+ years","Python"") -->
predicted relation":
{"DEGREE_IN":3.3149718e-07,"EXPERIENCE_IN":0.9998517}
"entities":"(""3+ years","C++"") -->
predicted relation":
{"DEGREE_IN":2.2569053e-07,"EXPERIENCE_IN":0.99986637}
值得注意的是,我们能够提取几乎所有他们各自的技能,正确无误!
让我们看看实体的关系:
entities: ('Bachelor / MS', 'Computer Science') -->
predicted relation:
{'DEGREE_IN': 0.9943974, 'EXPERIENCE_IN':1.8361954e-09}
entities: ('PhD', 'data science') --> predicted relation: {'DEGREE_IN': 0.98883855, 'EXPERIENCE_IN': 5.2092592e-09}
再次,我们成功地提取了文凭和文凭专业之间的所有关系!
这再一次证明了将transformer模型微调到具有少量注释数据的特定领域的情况是多么容易,无论是用于NER还是关系提取。
在只有上百个带注释的文档的情况下,我们能够训练出性能良好的关系分类器。此外,我们可以使用这个初始模型自动标注数百个未标记的数据,只需最少的校正。这可以显著加快注释过程并提高模型性能。
transformer真正改变了自然语言处理的领域,我对它们在信息提取中的应用感到特别兴奋。explosion AI(spaCy developers)和huggingface提供了开源解决方案,促进了transformer的采用。