首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >如何读取brat注解toll提供的ann文件并将其转换为python中的dataframe?

如何读取brat注解toll提供的ann文件并将其转换为python中的dataframe?
EN

Stack Overflow用户
提问于 2020-10-28 04:40:48
回答 3查看 820关注 0票数 1

我正在研究基于序列标签分类的IOB方案,

首先,我想要读取我的语料库和它们的标签,但是语料库是以一种叫做.ann文件的格式保存的,我从来没有像你一样在这里工作过。它是用https://brat.nlplab.org/注释的,当我打开它的时候,我看到了这个

代码语言:javascript
运行
复制
T1  Claim 78 140    competition can effectively promote the development of economy
A1  Stance T1 Against
T2  MajorClaim 503 550  we should attach more importance to cooperation
T3  Premise 142 283 In order to survive in the competition, companies continue to improve their products and service, and as a result, the whole society prospers
T4  Claim 591 714   through cooperation, children can learn about interpersonal skills which are significant in the future life of all students
A2  Stance T4 For
T5  Premise 716 851 What we acquired from team work is not only how to achieve the same goal with others but more importantly, how to get along with others
T6  Premise 853 1086    During the process of cooperation, children can learn about how to listen to opinions of others, how to communicate with others, how to think comprehensively, and even how to compromise with other team members when conflicts occurred
T7  Premise 1088 1191   All of these skills help them to get on well with other people and will benefit them for the whole life
T8  Claim 1332 1376 competition makes the society more effective
A3  Stance T8 Against
T9  Premise 1212 1301   the significance of competition is that how to become more excellence to gain the victory
T10 Premise 1387 1492   when we consider about the question that how to win the game, we always find that we need the cooperation
T11 Premise 1549 1846   Take Olympic games which is a form of competition for instance, it is hard to imagine how an athlete could win the game without the training of his or her coach, and the help of other professional staffs such as the people who take care of his diet, and those who are in charge of the medical care
T12 Premise 1848 1915   The winner is the athlete but the success belongs to the whole team
T13 Claim 1927 1992 without the cooperation, there would be no victory of competition
A4  Stance T13 For
T14 Claim 2154 2231 a more cooperative attitudes towards life is more profitable in one's success
A5  Stance T14 For
R1  supports Arg1:T3 Arg2:T1    
R2  attacks Arg1:T1 Arg2:T2 
R3  supports Arg1:T5 Arg2:T4    
R4  supports Arg1:T6 Arg2:T4    
R5  supports Arg1:T7 Arg2:T4    
R6  supports Arg1:T9 Arg2:T8    
R7  supports Arg1:T11 Arg2:T12  
R8  supports Arg1:T12 Arg2:T13  
R9  supports Arg1:T10 Arg2:T13  
R10 supports Arg1:T4 Arg2:T2    
R11 attacks Arg1:T8 Arg2:T2 
R12 supports Arg1:T13 Arg2:T2   
R13 supports Arg1:T14 Arg2:T2   

我想轻松地解码它,并将我的数据保存为以下格式的dataframe:

带标签的句子( claim or Premise或MAJORCLAIM,就像你在文本中看到的那样)

类似以下格式的内容

带标签的句子

我曾尝试使用此函数读取.txt文件

代码语言:javascript
运行
复制
myList = []                #read the whole text from 
for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith('.txt'):
            with open(os.path.join(root, file), 'r', encoding="utf-8") as f:
                text = f.read()
                myList.append(text)
代码语言:javascript
运行
复制
df = pd.DataFrame(np.array(myList),index=list(range(1,len(myList)+1)),columns=["Paragraph"])

但是对于brat提供的这个ann文件,我不知道

EN

回答 3

Stack Overflow用户

发布于 2020-10-28 05:21:04

我不确定您到底需要如何格式化此数据帧,但无论您是否需要两列,您都可以使用正则表达式分隔符来查找第一个空格,并在读取pandas数据帧时将其用作分隔符。

代码语言:javascript
运行
复制
df = pd.read_csv('test.ann', sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)

如果另存为.ann文件,则此方法适用于您提供的上述示例。

票数 1
EN

Stack Overflow用户

发布于 2020-10-28 05:58:28

我的问题有这样的答案

代码语言:javascript
运行
复制
df = pd.read_csv('..\data\corpus02\essay01.ann', sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)

但这不是我想要的

我只想提取句子和它们的标签(声明,前提,主要声明),但在这里它分离T1或...我不想让他们

只有句子+它们的标签

票数 0
EN

Stack Overflow用户

发布于 2020-10-28 06:03:36

这似乎是最好的方法。

代码语言:javascript
运行
复制
from brat_parser import get_entities_relations_attributes_groups

entities, relations, attributes, groups = get_entities_relations_attributes_groups("..\data\corpus02\essay01.ann")

使用这个apprach,我可以读取.ann文件!!

代码语言:javascript
运行
复制
{'T1': Entity(id='T1', type='Claim', span=((78, 140),), text='competition can effectively promote the development of economy'),
 'T2': Entity(id='T2', type='MajorClaim', span=((503, 550),), text='we should attach more importance to cooperation'),
 'T3': Entity(id='T3', type='Premise', span=((142, 283),), text='In order to survive in the competition, companies continue to improve their products and service, and as a result, the whole society prospers'),
 'T4': Entity(id='T4', type='Claim', span=((591, 714),), text='through cooperation, children can learn about interpersonal skills which are significant in the future life of all students'),
 'T5': Entity(id='T5', type='Premise', span=((716, 851),), text='What we acquired from team work is not only how to achieve the same goal with others but more importantly, how to get along with others'),
 'T6': Entity(id='T6', type='Premise', span=((853, 1086),), text='During the process of cooperation, children can learn about how to listen to opinions of others, how to communicate with others, how to think comprehensively, and even how to compromise with other team members when conflicts occurred'),
 'T7': Entity(id='T7', type='Premise', span=((1088, 1191),), text='All of these skills help them to get on well with other people and will benefit them for the whole life'),
 'T8': Entity(id='T8', type='Claim', span=((1332, 1376),), text='competition makes the society more effective'),
 'T9': Entity(id='T9', type='Premise', span=((1212, 1301),), text='the significance of competition is that how to become more excellence to gain the victory'),
 'T10': Entity(id='T10', type='Premise', span=((1387, 1492),), text='when we consider about the question that how to win the game, we always find that we need the cooperation'),
 'T11': Entity(id='T11', type='Premise', span=((1549, 1846),), text='Take Olympic games which is a form of competition for instance, it is hard to imagine how an athlete could win the game without the training of his or her coach, and the help of other professional staffs such as the people who take care of his diet, and those who are in charge of the medical care'),
 'T12': Entity(id='T12', type='Premise', span=((1848, 1915),), text='The winner is the athlete but the success belongs to the whole team'),
 'T13': Entity(id='T13', type='Claim', span=((1927, 1992),), text='without the cooperation, there would be no victory of competition'),
 'T14': Entity(id='T14', type='Claim', span=((2154, 2231),), text="a more cooperative attitudes towards life is more profitable in one's success")}

以下是结果。将其转换为dataframe应该不难。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/64562582

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档