文章/答案/技术大牛

发布

社区首页 >问答首页 >为什么CoreNLP ner和ner要将分离的数字连接在一起？

问为什么CoreNLP ner和ner要将分离的数字连接在一起？
EN

Stack Overflow用户

提问于 2018-09-10 02:19:35

回答 1查看 397关注 0票数 2

下面是代码片段：

In [390]: t
Out[390]: ['my', 'phone', 'number', 'is', '1111', '1111', '1111']

In [391]: ner_tagger.tag(t)
Out[391]: 
[('my', 'O'),
 ('phone', 'O'),
 ('number', 'O'),
 ('is', 'O'),
 ('1111\xa01111\xa01111', 'NUMBER')]

我期望的是：

Out[391]: 
[('my', 'O'),
 ('phone', 'O'),
 ('number', 'O'),
 ('is', 'O'),
 ('1111', 'NUMBER'),
 ('1111', 'NUMBER'),
 ('1111', 'NUMBER')]

如您所见，人工电话号码由\xa0连接，这据说是一个不破坏的空间。我可以通过在不改变其他默认规则的情况下设置CoreNLP来区分这一点。

ner_tagger被定义为：

ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')

python

nlp

nltk

stanford-nlp

pycorenlp

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-09-10 07:22:41

TL;DR

NLTK正在将令牌列表读取到字符串中，然后将其传递给CoreNLP服务器。CoreNLP对输入进行重新标记，并将类似数字的令牌与\xa0 (非破缺空间)连接起来。

长时间

让我们看一看代码，如果我们从tag()中看到CoreNLPParser函数，我们会看到它调用了tag_sents()函数，并在调用允许CoreNLPParser重新标记输入的raw_tag_sents()之前将输入字符串列表转换为字符串，请参阅https://github.com/nltk/nltk/blob/develop/nltk/parse/corenlp.py#L348。

def tag_sents(self, sentences):
    """
    Tag multiple sentences.
    Takes multiple sentences as a list where each sentence is a list of
    tokens.

    :param sentences: Input sentences to tag
    :type sentences: list(list(str))
    :rtype: list(list(tuple(str, str))
    """
    # Converting list(list(str)) -> list(str)
    sentences = (' '.join(words) for words in sentences)
    return [sentences[0] for sentences in self.raw_tag_sents(sentences)]

def tag(self, sentence):
    """
    Tag a list of tokens.
    :rtype: list(tuple(str, str))
    >>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
    >>> tokens = 'Rami Eid is studying at Stony Brook University in NY'.split()
    >>> parser.tag(tokens)
    [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'),
    ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]
    >>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
    >>> tokens = "What is the airspeed of an unladen swallow ?".split()
    >>> parser.tag(tokens)
    [('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'),
    ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'),
    ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
    """
    return self.tag_sents([sentence])[0]

调用时，raw_tag_sents()使用api_call()将输入传递给服务器。

def raw_tag_sents(self, sentences):
    """
    Tag multiple sentences.
    Takes multiple sentences as a list where each sentence is a string.

    :param sentences: Input sentences to tag
    :type sentences: list(str)
    :rtype: list(list(list(tuple(str, str)))
    """
    default_properties = {'ssplit.isOneSentence': 'true',
                          'annotators': 'tokenize,ssplit,' }

    # Supports only 'pos' or 'ner' tags.
    assert self.tagtype in ['pos', 'ner']
    default_properties['annotators'] += self.tagtype
    for sentence in sentences:
        tagged_data = self.api_call(sentence, properties=default_properties)
        yield [[(token['word'], token[self.tagtype]) for token in tagged_sentence['tokens']]
                for tagged_sentence in tagged_data['sentences']]

，所以问题是如何解决问题，并在传入令牌时获得令牌？

如果我们查看CoreNLP中托卡器的选项，就会看到tokenize.whitespace选项：

如果在调用properties之前对api_call()进行了一些更改，则可以在将令牌传递给由空格连接的CoreNLP服务器时强制执行，例如对代码的更改：

def tag_sents(self, sentences, properties=None):
    """
    Tag multiple sentences.

    Takes multiple sentences as a list where each sentence is a list of
    tokens.

    :param sentences: Input sentences to tag
    :type sentences: list(list(str))
    :rtype: list(list(tuple(str, str))
    """
    # Converting list(list(str)) -> list(str)
    sentences = (' '.join(words) for words in sentences)
    if properties == None:
        properties = {'tokenize.whitespace':'true'}
    return [sentences[0] for sentences in self.raw_tag_sents(sentences, properties)]

def tag(self, sentence, properties=None):
    """
    Tag a list of tokens.

    :rtype: list(tuple(str, str))

    >>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
    >>> tokens = 'Rami Eid is studying at Stony Brook University in NY'.split()
    >>> parser.tag(tokens)
    [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'),
    ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]

    >>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
    >>> tokens = "What is the airspeed of an unladen swallow ?".split()
    >>> parser.tag(tokens)
    [('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'),
    ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'),
    ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
    """
    return self.tag_sents([sentence], properties)[0]

def raw_tag_sents(self, sentences, properties=None):
    """
    Tag multiple sentences.

    Takes multiple sentences as a list where each sentence is a string.

    :param sentences: Input sentences to tag
    :type sentences: list(str)
    :rtype: list(list(list(tuple(str, str)))
    """
    default_properties = {'ssplit.isOneSentence': 'true',
                          'annotators': 'tokenize,ssplit,' }

    default_properties.update(properties or {})

    # Supports only 'pos' or 'ner' tags.
    assert self.tagtype in ['pos', 'ner']
    default_properties['annotators'] += self.tagtype
    for sentence in sentences:
        tagged_data = self.api_call(sentence, properties=default_properties)
        yield [[(token['word'], token[self.tagtype]) for token in tagged_sentence['tokens']]
                for tagged_sentence in tagged_data['sentences']]

更改上述代码后：

>>> from nltk.parse.corenlp import CoreNLPParser
>>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> sent = ['my', 'phone', 'number', 'is', '1111', '1111', '1111']
>>> ner_tagger.tag(sent)
[('my', 'O'), ('phone', 'O'), ('number', 'O'), ('is', 'O'), ('1111', 'DATE'), ('1111', 'DATE'), ('1111', 'DATE')]

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/52250268

复制

相似问题

问为什么CoreNLP ner和ner要将分离的数字连接在一起？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么CoreNLP ner和ner要将分离的数字连接在一起？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么CoreNLP ner和ner要将分离的数字连接在一起？
EN