文章/答案/技术大牛

发布

社区首页 >问答首页 >从字符串中删除短语列表

问从字符串中删除短语列表
EN

Stack Overflow用户

提问于 2020-06-18 06:16:22

回答 4查看 58关注 0票数 0

我有一个需要从给定句子中删除的短语(n-gram)列表。

    removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
    sentence = 'Oranges are the main ingredient for a wide range of food and drinks'

我想要得到：

    new_sentence = 'Oranges are the main ingredient for a wide of'

我试过Remove list of phrases from string，但它不起作用(‘橙子’变成了‘O’，‘饮料’被删除，而不是一个短语‘食物和饮料’)

有人知道怎么解决这个问题吗？谢谢！

python

string

text

回答 4

Stack Overflow用户

回答已采纳

发布于 2020-06-18 06:37:39

正则表达式时间！

In [116]: removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
     ...: removed = sorted(removed, key=len, reverse=True)
     ...: sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
     ...: new_sentence = sentence
     ...: import re
     ...: removals = [r'\b' + phrase + r'\b' for phrase in removed]
     ...: for removal in removals:
     ...:     new_sentence = re.sub(removal, '', new_sentence)
     ...: new_sentence = ' '.join(new_sentence.split())
     ...: print(sentence)
     ...: print(new_sentence)
Oranges are the main ingredient for a wide range of food and drinks
Oranges are the main ingredient for a wide of

票数 0

Stack Overflow用户

发布于 2020-06-18 06:34:08

由于您只想匹配整个单词，我认为第一步是将所有内容都转换为单词列表，然后从最长到最短的短语进行迭代，以便找到要删除的内容：

>>> removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
>>> sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
>>> words = sentence.split()
>>> for ngram in sorted([r.split() for r in removed], key=len, reverse=True):
...     for i in range(len(words) - len(ngram)+1):
...         if words[i:i+len(ngram)] == ngram:
...             words = words[:i] + words[i+len(ngram):]
...             break
...
>>> " ".join(words)
'Oranges are the main ingredient for a wide of'

注意，这种简单的方法有一些缺陷--相同n元语法的多个副本不会被删除，但是在修改words之后，您也不能继续使用该循环(长度将不同)，所以如果您想处理重复的内容，则需要对更新进行批量处理。

票数 1

Stack Overflow用户

发布于 2020-06-18 06:51:21

    import re

    removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
    sentence = 'Oranges are the main ingredient for a wide range of food and drinks'

    # sort the removed tokens according to their length,
    removed = sorted(removed, key=len, reverse=True)

    # using word boundaries
    for r in removed:
        sentence = re.sub(r"\b{}\b".format(r), " ", sentence)

    # replace multiple whitspaces with a single one   
    sentence = re.sub(' +',' ',sentence)

我希望这会有所帮助:首先，你需要根据长度对删除的字符串进行排序，这样“食品和饮料”就会在“饮料”之前被替换。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/62439244

复制

相似问题

问从字符串中删除短语列表
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从字符串中删除短语列表EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从字符串中删除短语列表
EN