首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >用Lucene 7 OpenNLP查询语音部分标签

用Lucene 7 OpenNLP查询语音部分标签
EN

Stack Overflow用户
提问于 2018-09-16 11:05:03
回答 1查看 643关注 0票数 0

为了好玩和学习,我正在尝试用OpenNLP和Lucene7.4构建一个词性部分(POS)标签。目标是,一旦索引,我就可以搜索一个POS标签序列,并找到所有匹配序列的句子。我已经得到索引部分,但我被困在查询部分。我知道SolR在这方面可能有一些功能,我已经检查了代码(毕竟这并不是自我扩展的)。但我的目标是理解和实现Lucene 7,而不是在SolR,因为我想独立于任何搜索引擎的顶部。

Idea输入句子1:快的棕色狐狸跳过懒惰的狗。应用Lucene OpenNLP标记器的结果是:下一步是布朗跳转狗,应用Lucene OpenNLP POS标记结果的是: DTJJVBDDTNNS

输入句子2:把它给我,宝贝!应用Lucene OpenNLP令牌程序的结果是: Giveto!接下来,将Lucene OpenNLP词性标注结果应用于: VBTO,。

查询: JJ匹配部分句子1,因此应该返回句子1。(在这一点上,我只对精确匹配感兴趣,即让我们撇开部分匹配、通配符等。)

索引首先创建了自己的类com.example.OpenNLPAnalyzer:

代码语言:javascript
运行
复制
public class OpenNLPAnalyzer extends Analyzer {
  protected TokenStreamComponents createComponents(String fieldName) {
    try {

        ResourceLoader resourceLoader = new ClasspathResourceLoader(ClassLoader.getSystemClassLoader());


        TokenizerModel tokenizerModel = OpenNLPOpsFactory.getTokenizerModel("en-token.bin", resourceLoader);
        NLPTokenizerOp tokenizerOp = new NLPTokenizerOp(tokenizerModel);


        SentenceModel sentenceModel = OpenNLPOpsFactory.getSentenceModel("en-sent.bin", resourceLoader);
        NLPSentenceDetectorOp sentenceDetectorOp = new NLPSentenceDetectorOp(sentenceModel);

        Tokenizer source = new OpenNLPTokenizer(
                AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY, sentenceDetectorOp, tokenizerOp);

        POSModel posModel = OpenNLPOpsFactory.getPOSTaggerModel("en-pos-maxent.bin", resourceLoader);
        NLPPOSTaggerOp posTaggerOp = new NLPPOSTaggerOp(posModel);

        // Perhaps we should also use a lower-case filter here?

        TokenFilter posFilter = new OpenNLPPOSFilter(source, posTaggerOp);

        // Very important: Tokens are not indexed, we need a store them as payloads otherwise we cannot search on them
        TypeAsPayloadTokenFilter payloadFilter = new TypeAsPayloadTokenFilter(posFilter);

        return new TokenStreamComponents(source, payloadFilter);
    }
    catch (IOException e) {
        throw new RuntimeException(e.getMessage());
    }              

}

请注意,我们使用的是围绕着TypeAsPayloadTokenFilter的OpenNLPPOSFilter。这意味着,我们的POS标记将被索引为有效载荷,而我们的查询--不管它看起来如何--也必须在有效载荷上搜索。

查询,这是我被困的地方。我不知道如何查询有效载荷,而且我尝试的任何东西都不起作用。请注意,我使用的是Lucene 7,在旧版本中,查询有效负载似乎已经更改了几次。文件极为稀少。现在还不清楚要查询的正确字段名是什么--是"word“还是"type”,还是其他什么?例如,我尝试了以下代码,它不返回任何搜索结果:

代码语言:javascript
运行
复制
    // Step 1: Indexing
    final String body = "The quick brown fox jumped over the lazy dogs.";
    Directory index = new RAMDirectory();
    OpenNLPAnalyzer analyzer = new OpenNLPAnalyzer();
    IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
    IndexWriter writer = new IndexWriter(index, indexWriterConfig);
    Document document = new Document();
    document.add(new TextField("body", body, Field.Store.YES));
    writer.addDocument(document);
    writer.close();


    // Step 2: Querying
    final int topN = 10;
    DirectoryReader reader = DirectoryReader.open(index);
    IndexSearcher searcher = new IndexSearcher(reader);

    final String fieldName = "body"; // What is the correct field name here? "body", or "type", or "word" or anything else?
    final String queryText = "JJ";
    Term term = new Term(fieldName, queryText);
    SpanQuery match = new SpanTermQuery(term);
    BytesRef pay = new BytesRef("type"); // Don't understand what to put here as an argument
    SpanPayloadCheckQuery query = new SpanPayloadCheckQuery(match, Collections.singletonList(pay));

    System.out.println(query.toString());

    TopDocs topDocs = searcher.search(query, topN);

这里的任何帮助都是非常感谢的。

EN

回答 1

Stack Overflow用户

发布于 2018-09-18 04:36:36

为什么不使用TypeAsSynonymFilter而不是TypeAsPayloadTokenFilter,只做一个正常的查询。所以在你的分析器里:

代码语言:javascript
运行
复制
:
TokenFilter posFilter = new OpenNLPPOSFilter(source, posTaggerOp);
TypeAsSynonymFilter typeAsSynonymFilter = new TypeAsSynonymFilter(posFilter);
return new TokenStreamComponents(source, typeAsSynonymFilter);

和索引方面:

代码语言:javascript
运行
复制
static Directory index() throws Exception {
  Directory index = new RAMDirectory();
  OpenNLPAnalyzer analyzer = new OpenNLPAnalyzer();
  IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
  IndexWriter writer = new IndexWriter(index, indexWriterConfig);
  writer.addDocument(doc("The quick brown fox jumped over the lazy dogs."));
  writer.addDocument(doc("Give it to me, baby!"));
  writer.close();

  return index;
}

static Document doc(String body){
  Document document = new Document();
  document.add(new TextField(FIELD, body, Field.Store.YES));
  return document;
}

搜索方:

代码语言:javascript
运行
复制
static void search(Directory index, String searchPhrase) throws Exception {
  final int topN = 10;
  DirectoryReader reader = DirectoryReader.open(index);
  IndexSearcher searcher = new IndexSearcher(reader);

  QueryParser parser = new QueryParser(FIELD, new WhitespaceAnalyzer());
  Query query = parser.parse(searchPhrase);
  System.out.println(query);

  TopDocs topDocs = searcher.search(query, topN);
  System.out.printf("%s => %d hits\n", searchPhrase, topDocs.totalHits);
  for(ScoreDoc scoreDoc: topDocs.scoreDocs){
    Document doc = searcher.doc(scoreDoc.doc);
    System.out.printf("\t%s\n", doc.get(FIELD));
  }
}

然后像这样使用它们:

代码语言:javascript
运行
复制
public static void main(String[] args) throws Exception {
  Directory index = index();
  search(index, "\"JJ NN VBD\"");    // search the sequence of POS tags
  search(index, "\"brown fox\"");    // search a phrase
  search(index, "\"fox brown\"");    // search a phrase (no hits)
  search(index, "baby");             // search a word
  search(index, "\"TO PRP\"");       // search the sequence of POS tags
}

结果如下:

代码语言:javascript
运行
复制
body:"JJ NN VBD"
"JJ NN VBD" => 1 hits
    The quick brown fox jumped over the lazy dogs.
body:"brown fox"
"brown fox" => 1 hits
    The quick brown fox jumped over the lazy dogs.
body:"fox brown"
"fox brown" => 0 hits
body:baby
baby => 1 hits
    Give it to me, baby!
body:"TO PRP"
"TO PRP" => 1 hits
    Give it to me, baby!
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/52353452

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档