我希望格式化df中保存的数据,以便它可以在NER模型中使用。我从2列中的数据开始,示例如下:
df['text'] df['annotation']
some text [('Consequence', 23, 47)]
some other text [('Consequence', 33, 46), ('Cause', 101, 150)] 并需要将其格式化为:
TRAIN_DATA = [(some text, {'entities': [(23, 47, 'Consequence')]}), (some other text, {'entities': [(33, 46, 'Consequence'), (101, 150, 'Cause')]})我一直试图迭代每一行,例如尝试:
TRAIN_DATA = []
for row in df['annotation']:
entities = []
label, start, end = entity
entities.append((start, end, label))
# add to dataset
TRAIN_DATA.append((df['text'], {'entities': entities}))但是,我不能让它迭代每一行来填充TRAIN_DATA。有时注释列中有多个实体。
如果有人能强调我哪里出了问题,以及如何纠正它,我非常感激!
发布于 2022-02-16 18:25:17
您可以使用zip()函数:
TRAIN_DATA = [
(t, {"entities": [(s, e, l) for (l, s, e) in a]})
for t, a in zip(df["text"], df["annotation"])
]
print(TRAIN_DATA)指纹:
[
("some text", {"entities": [(23, 47, "Consequence")]}),
(
"some other text",
{"entities": [(33, 46, "Consequence"), (101, 150, "Cause")]},
),
]https://stackoverflow.com/questions/71147296
复制相似问题