文章/答案/技术大牛

发布

社区首页 >问答首页 >在段落中查找字典值，如果段落没有字典值，则返回NA

问在段落中查找字典值，如果段落没有字典值，则返回NA
EN

Stack Overflow用户

提问于 2019-07-19 11:23:08

回答 2查看 93关注 0票数 1

，假设我有一个随机的段落作为一个列表：

t = ['protein and carbohydrates Its is a little heavier pulsus widely used and is a versatile ingredient',
 'Tea contains the goodness of  Natural Ingredients Cardamom Ginger Tea bags Disclaimers As per Ayurvedic texts',
 'almonds are all natural supreme sized nuts they are highly nutritious and extremely healthy',
 'Camel milk can be consumed by lactose intolerant people and those allergic to cows milk',
 'Healthy Crunch  Almond with honey is an extra crunchy breakfast cereal for a delightful start to your mornings']

“字典”与“”

d = {'First': ['Tea','Coffee'],
     'Second':  ['Noodles','Pasta'],
     'Third': ['sandwich','honey'],
     'Fourth': ['Almond','apricot','blueberry']
    }

我写的代码非常慢，我还想为那些与任何文本不匹配的段落显示'NA‘

码

get_labels = []
get_text = []

for txt in t:
    for dictrow in d.values():
        for i in dictrow:
            for j in txt.split():
                if i == j:
                    print(j)
                    print(txt)
                    get_labels.append(j)
                    get_text.append(txt)


pd.DataFrame(list(zip(get_text,get_labels)),columns=["whole_text","matched_text"])

最后，在创建Dataframe输出之后，是：

     whole_text                                       matched_text
0   Tea contains the goodness of Natural Ingredie...    Tea
1   Tea contains the goodness of Natural Ingredie...    Tea
2   Healthy Crunch Almond with honey is an extra ...    honey
3   Healthy Crunch Almond with honey is an extra ...    Almond

，但我想要的输出是：

     whole_text                                       matched_text
0   protein and carbohydrates Its is a little ....      NA 
1   Tea contains the goodness of Natural Ingredie...    Tea
2   Tea contains the goodness of Natural Ingredie...    Tea
3   almonds are all natural supreme sized nuts th...    NA
4   Camel milk can be consumed by lactose intoler...    NA
2   Healthy Crunch Almond with honey is an extra ...    honey
3   Healthy Crunch Almond with honey is an extra ...    Almond

我有两个问题：

( a)我想为不匹配任何文本字典值的段落添加'NA‘，如上表。

b)如何优化这段代码，使其运行得更快，因为我正在大型数据集上使用它

list

performance

dictionary

python

回答 2

Stack Overflow用户

回答已采纳

发布于 2019-07-19 11:40:12

具有set相交功率

paragraphs = ['protein and carbohydrates Its is a little heavier pulsus widely used and is a versatile ingredient',
 'Tea contains the goodness of  Natural Ingredients Cardamom Ginger Tea bags Disclaimers As per Ayurvedic texts',
 'almonds are all natural supreme sized nuts they are highly nutritious and extremely healthy',
 'Camel milk can be consumed by lactose intolerant people and those allergic to cows milk',
 'Healthy Crunch  Almond with honey is an extra crunchy breakfast cereal for a delightful start to your mornings']

d = {'First': ['Tea', 'Coffee'],
     'Second':  ['Noodles', 'Pasta'],
     'Third': ['sandwich', 'honey'],
     'Fourth': ['Almond', 'apricot','blueberry']
}

words = set(w for lst in d.values() for w in lst)
match_stats = {'whole_text': [], 'matched_text': []}
for p in paragraphs:
    common_words = set(p.split()) & words
    if not common_words:
        match_stats['whole_text'].append(p)
        match_stats['matched_text'].append('NA')
    else:
        for w in common_words:
            match_stats['whole_text'].append(p)
            match_stats['matched_text'].append(w)

df = pd.DataFrame(match_stats)
print(df)

产出：

                                          whole_text matched_text
0  protein and carbohydrates Its is a little heav...           NA
1  Tea contains the goodness of  Natural Ingredie...          Tea
2  almonds are all natural supreme sized nuts the...           NA
3  Camel milk can be consumed by lactose intolera...           NA
4  Healthy Crunch  Almond with honey is an extra ...        honey
5  Healthy Crunch  Almond with honey is an extra ...       Almond

票数 1

Stack Overflow用户

发布于 2019-07-19 13:04:11

您可以使用in

values = set(v for l in d.values() for v in l)
txt_and_label = []  # string of tuples

for line in t:
    # checks if v in line and assigning in the form of (line ,label_value)
    match = [(line, v) for v in values if v in line]
    if not match:
        match = [(line, 'NA')]
    txt_and_label.extend(match)

pd.DataFrame(txt_and_label, columns=["whole_text", "matched_text"])

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/57111483

复制

相似问题

问在段落中查找字典值，如果段落没有字典值，则返回NA
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在段落中查找字典值，如果段落没有字典值，则返回NAEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在段落中查找字典值，如果段落没有字典值，则返回NA
EN