问Python取消句子标记化
EN

Stack Overflow用户

提问于 2014-02-22 08:42:05

回答 9查看 39.7K关注 0票数 38

关于如何对句子进行标记化的指南太多了，但我没有找到任何相反的方法。

 import nltk
 words = nltk.word_tokenize("I've found a medicine for my disease.")
 result I get is: ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']

除了将标记化的句子还原到原始状态之外，还有什么功能吗？由于某种原因，函数tokenize.untokenize()无法工作。

编辑：

例如，我知道我可以这样做，这可能解决了问题，但我很好奇有没有一个集成的函数来解决这个问题：

result = ' '.join(sentence).replace(' , ',',').replace(' .','.').replace(' !','!')
result = result.replace(' ?','?').replace(' : ',': ').replace(' \'', '\'')

python-2.7

nltk

python

回答 9

Stack Overflow用户

发布于 2014-02-25 22:17:12

要从nltk中反转word_tokenize，我建议在http://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktLanguageVars.word_tokenize中查找并进行一些逆向工程。

除了在nltk上做疯狂的hack之外，你可以尝试这样做：

>>> import nltk
>>> import string
>>> nltk.word_tokenize("I've found a medicine for my disease.")
['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']
>>> tokens = nltk.word_tokenize("I've found a medicine for my disease.")
>>> "".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip()
"I've found a medicine for my disease."

票数 12

Stack Overflow用户

发布于 2016-01-09 01:36:14

使用here中的token_utils.untokenize

import re
def untokenize(words):
    """
    Untokenizing a text undoes the tokenizing operation, restoring
    punctuation and spaces to the places that people expect them to be.
    Ideally, `untokenize(tokenize(text))` should be identical to `text`,
    except for line breaks.
    """
    text = ' '.join(words)
    step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .',  '...')
    step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
    step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
    step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3)
    step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
         "can not", "cannot")
    step6 = step5.replace(" ` ", " '")
    return step6.strip()

 tokenized = ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my','disease', '.']
 untokenize(tokenized)
 "I've found a medicine for my disease."

票数 5

Stack Overflow用户

发布于 2018-06-24 14:48:07

from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
# 'The quick brown'

票数 4

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/21948019

复制

相似问题

问Python取消句子标记化
EN

回答 9

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python取消句子标记化EN

回答 9

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python取消句子标记化
EN