问查找关键字+1并创建新列
EN

Stack Overflow用户

提问于 2019-05-28 02:02:58

回答 3查看 65关注 0票数 1

目标：

1)找到关键字旁边的单词(例如brca)

2)创建包含此单词的新列

背景：

1)我有一个列表l，我在其中生成一个数据帧df，并使用以下代码从其中提取单词brca：

l = ['carcinoma brca positive completion mastectomy',
     'clinical brca gene mutation',
     'carcinoma brca positive chemotherapy']
df = pd.DataFrame(l, columns=['Text'])
df['Gene'] = df['Text'].str.extract(r"(brca)")

输出：

                                                Text    Gene
0   breast invasive lobular carcinoma brca positiv...   brca
1   clinical history brca gene mutation . gross de...   brca
2   left breast invasive ductal carcinoma brca pos...   brca

问题：

但是，我现在正在尝试为每一行查找单词brca旁边的单词，并创建一个新列。

所需输出：

                                                Text    Gene  NextWord
0   breast invasive lobular carcinoma brca positiv...   brca  positive
1   clinical history brca gene mutation . gross de...   brca  gene
2   left breast invasive ductal carcinoma brca pos...   brca  positive

我看过python pandas dataframe words in context: get 3 words before and after和PANDAS Finding the exact word and before word in a column of string and append that new column in python (pandas) column，但它们不太适合我。

问题：

我如何实现我的目标？

regex

pandas

text

nlp

keyword

回答 3

Stack Overflow用户

回答已采纳

发布于 2019-05-28 02:24:26

我们可以使用python的内置方法partition

df['NextWord'] = df['Text'].apply(lambda x: x.partition('brca')[2]).str.split().str[0]

输出

                                            Text  Gene  NextWord
0  carcinoma brca positive completion mastectomy  brca  positive
1                    clinical brca gene mutation  brca      gene
2           carcinoma brca positive chemotherapy  brca  positive

说明

.partition返回三个值：

关键字之前的字符串
关键字本身
关键字

之后的字符串

string = 'carcinoma brca positive completion mastectomy'

before, keyword, after = string.partition('brca')

print(before)
print(keyword)
print(after)

输出

carcinoma 
brca
 positive completion mastectomy

速度

我对答案之间的速度比较很好奇，因为我使用的是.apply，但它是一个内置的方法。出乎意料的是，我的回答是最快的：

dfbig = pd.concat([df]*10000, ignore_index=True)
dfbig.shape

(30000, 2)

%%timeit
dfbig['Text'].apply(lambda x: x.partition('brca')[2]).str.split().str[0]
31.5 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
dfbig['NextWord'] = dfbig['Text'].str.split('brca').str[1].str.split('\s').str[1]
74.5 ms ± 2.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
dfbig['NextWord'] = dfbig['Text'].str.extract(r"(?<=brca)(.+?) ")
40.7 ms ± 2.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

票数 0

Stack Overflow用户

发布于 2019-05-28 02:24:38

大量使用pandas Series.str访问器：

df['NextWord'] = df['Text'].str.split('brca').str[1].str.split('\s').str[1]
df

                                            Text  Gene  NextWord
0  carcinoma brca positive completion mastectomy  brca  positive
1                    clinical brca gene mutation  brca      gene
2           carcinoma brca positive chemotherapy  brca  positive

票数 0

Stack Overflow用户

发布于 2019-05-28 02:29:50

使用：

import pandas as pd

l = ['carcinoma brca positive completion mastectomy',
     'clinical brca gene mutation',
     'carcinoma brca positive chemotherapy']
df = pd.DataFrame(l, columns=['Text'])

df['NextWord'] = df['Text'].str.extract(r"(?<=brca)(.+?) ")
print(df)

输出：

                                            Text   NextWord
0  carcinoma brca positive completion mastectomy   positive
1                    clinical brca gene mutation       gene
2           carcinoma brca positive chemotherapy   positive

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/56330620

复制

相似问题

问查找关键字+1并创建新列
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问查找关键字+1并创建新列EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问查找关键字+1并创建新列
EN