问Stanzas实现中使用tregex进行模式匹配似乎找不到正确的子树
EN

Stack Overflow用户

提问于 2020-09-23 03:23:53

回答 1查看 205关注 0票数 0

我是NLP的新手，目前我正在尝试从德语文本中提取不同的短语结构。为此，我使用stanza的Stanford corenlp实现，并使用tregex特性在树中进行模式切割。

到目前为止，我没有任何问题，我能够匹配简单的模式，如"NPs“或"S > CS”。不，我正在尝试匹配直接由ROOT控制的S节点，或者立即由ROOT控制的CS节点。为此，im使用模式"S > (CS > TOP) |> TOP“。但它似乎不能正常工作。我使用了以下代码：

text = "Peter kommt und Paul geht."    
def linguistic_units(_client, _text, _pattern):
        matches = _client.tregex(_text,_pattern)
        list = matches['sentences']
        print('+++++Tree++++') 
        print(list[0])
        for sentence in matches['sentences']:
            for match_id in sentence:
                print(sentence[match_id]['spanString'])
        return count_units



with CoreNLPClient(properties='./corenlp/StanfordCoreNLP-german.properties', 
                   annotators=['tokenize', 'ssplit', 'pos', 'lemma', 'ner', 'parse', 'depparse', 'coref'],
                   timeout=300000, 
                   be_quiet=True,
                   endpoint='http://localhost:9001', 
                   memory='16G') as client:

      result = linguistic_units(client, text, 'S > (CS > ROOT) | > ROOT'
      print(result)

在文本为"Peter kommt and Paul geht“的示例中，我使用的模式应该匹配两个短语"Peter kommt”和"Paul geht"，但它不起作用。然后，我查看了树本身，解析器的输出如下：

constituency parse of first sentence
child {
  child {
    child {
      child {
        child {
          value: "Peter"
        }
        value: "PROPN"
      }
      child {
        child {
          value: "kommt"
        }
        value: "VERB"
      }
      value: "S"
    }
    child {
      child {
        value: "und"
      }
      value: "CCONJ"
    }
    child {
      child {
        child {
          value: "Paul"
        }
        value: "PROPN"
      }
      child {
        child {
          value: "geht"
        }
        value: "VERB"
      }
      value: "S"
    }
    value: "CS"
  }
  child {
    child {
      value: "."
    }
    value: "PUNCT"
  }
  value: "NUR"
}
value: "ROOT"
score: 5466.83349609375

我现在怀疑这是由于根节点造成的，因为它是树的最后一个节点。根节点不应该在树的开头吗？有人知道我做错了什么吗？

stanford-nlp

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-09-23 06:01:02

下面是一些评论：

1.)假设您使用的是最新版本的CoreNLP (4.0.0+)，则需要对德语使用mwt注释器。因此您的注释器列表应该是tokenize,ssplit,mwt,pos,parse

2.)为了清楚起见，这里是你在PTB中的句子：

(ROOT
  (NUR
    (CS
      (S (PROPN Peter) (VERB kommt))
      (CCONJ und)
      (S (PROPN Paul) (VERB geht)))))

正如您所看到的，根是树的根节点，因此您的模式在此句子中不匹配。我个人发现PTB格式更容易看到树结构，也更容易写出Tregex模式。您可以通过json或文本输出格式(而不是序列化对象)来获取。在客户端请求集output_format="text"中

3.)下面是关于使用Stanza客户端的最新文档：https://stanfordnlp.github.io/stanza/client_properties.html

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/64016461

复制

相似问题

问Stanzas实现中使用tregex进行模式匹配似乎找不到正确的子树
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Stanzas实现中使用tregex进行模式匹配似乎找不到正确的子树EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Stanzas实现中使用tregex进行模式匹配似乎找不到正确的子树
EN