文章/答案/技术大牛

发布

社区首页 >问答首页 >lucene中的Regexpquery不起作用

问lucene中的Regexpquery不起作用
EN

Stack Overflow用户

提问于 2016-02-17 13:20:44

回答 1查看 491关注 0票数 0

我正在使用lucene 5.4从使用regex的文件中搜索一些文本，但是regexpquery不起作用，尽管短语查询和常规查询可以工作，并且能够找到出现搜索字符串的文件，但是当我运行regex查询时，lucenece找不到任何包含该regex的文件。

索引创建代码：

public IndexWriter generateIndex(String docsPath) throws IOException {

      String indexPath = System.getProperty("java.io.tmpdir") +File.separator+"indexDirectory";
        if (indexPath == null) {
          throw new IOException("System property 'java.io.tmpdir' does not specify a tmp dir");
        }
        File tmpDir = new File(indexPath);
        if (!tmpDir.exists()) {
          boolean created = tmpDir.mkdirs();
          if (!created) {
            throw new IOException("Unable to create tmp dir " + tmpDir);
          }
        }

    boolean create = true;
    final Path docDir = Paths.get(docsPath);
    if (!Files.isReadable(docDir)) {
        System.out.println("Document directory '" + docDir.toAbsolutePath()
                + "' does not exist or is not readable, please check the path");
        System.exit(1);
    }

    Date start = new Date();
    try {
        System.out.println("Indexing to directory '" + indexPath + "'...");

        Directory dir = FSDirectory.open(Paths.get(indexPath));
        Analyzer analyzer = new StandardAnalyzer();
        IndexWriterConfig iwc = new IndexWriterConfig(analyzer);

        if (create) {
            iwc.setOpenMode(OpenMode.CREATE);
        } else {
            iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
        }

        IndexWriter writer = new IndexWriter(dir, iwc);
        indexDocs(writer, docDir);
        setIndexWriter(writer);

        Date end = new Date();
        System.out.println(end.getTime() - start.getTime() + " total milliseconds");
        writer.close();
    } catch (IOException e) {
        System.out.println(" caught a " + e.getClass() + "\n with message: " + e.getMessage());
    }

    return getIndexWriter();
}

static void indexDocs(final IndexWriter writer, Path path) throws IOException {
    if (Files.isDirectory(path)) {
        Files.walkFileTree(path, new SimpleFileVisitor<Path>() {
            @Override
            public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
                try {
                    indexDoc(writer, file, attrs.lastModifiedTime().toMillis());
                } catch (IOException ignore) {
                    // don't index files that can't be read.
                }
                return FileVisitResult.CONTINUE;
            }
        });
    } else {
        indexDoc(writer, path, Files.getLastModifiedTime(path).toMillis());
    }
}
static void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException {
    try (InputStream stream = Files.newInputStream(file)) {
        Document doc = new Document();
        Field pathField = new StringField("path", file.toString(), Field.Store.YES);
        doc.add(pathField);

        doc.add(new LongField("modified", lastModified, Field.Store.NO));
        doc.add(new TextField("contents",
                new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8))));

        if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {
            System.out.println("adding " + file);
            writer.addDocument(doc);
        } else {
            System.out.println("updating " + file);
            writer.updateDocument(new Term("path", file.toString()), doc);
        }
    }
}

使用正则表达式代码搜索文本：

IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(index)));
    IndexSearcher searcher = new IndexSearcher(reader);
    Analyzer analyzer = new StandardAnalyzer();

    BufferedReader in = null;

    Query query = new RegexpQuery(new Term("contents", "program-id\\."));
    query = query.rewrite(reader);

    System.out.println("Searching for: " + query.toString(field));
    searcher.search(query, null, 100);

正在运行的查询解析器代码：

QueryParser parser = new QueryParser(field, analyzer);
Query query = parser.parse("+program-id");

我们将搜索的源代码：

IDENTIFICATION DIVISION.
   PROGRAM-ID.  ACINSTAL.

   ENVIRONMENT DIVISION.

   DATA DIVISION.
   WORKING-STORAGE SECTION.

请帮帮忙。

java

lucene

回答 1

Stack Overflow用户

发布于 2016-02-18 00:48:44

正如注释中所述，正则表达式查询必须与单个令牌匹配。不存在允许单个正则表达式跨越多个术语的查询类型。在我看来，一般应该避免对全文内容执行正则表达式查询(如果字段是一个简单的标识符，那就另当别论了)。如果您正在使用它们，则可能表明您无法提供有效的全文搜索。您应该倾向于使用更典型的全文搜索工具，例如通配符、模糊查询、邻近查询和范围查询，或者调整分析以提供更有用的搜索结果。

但是，如果你坚持这样做，有两种方法可以支持这种搜索。

您可以将您的分析更改为支持您的搜索需求的标记化。使用StringField将创建单个令牌，因此正则表达式查询将按照您的预期更好地工作。当然，这将导致较差的性能，并且对更标准的查询样式的支持将会差得多。如果字段是某种类型的字符串标识符，这可能是最好的解决方案。如果是一个全文字段，你想要有强大的全文搜索支持，这几乎可以肯定是一个糟糕的解决方案。
你可以使用一个更有意义的查询。在您提供的示例中，正如您已经指出的那样，一个简单的短语查询就可以很好地完成工作，所以很难说出您在这里需要什么。通常，对于跨越多个术语的复杂正则表达式查询，您必须使用SpanQuery应用程序接口来支持它，通常使用SpanNearQuery.Also组合多个SpanMultiTermQueryWrapper，值得注意的是，SurroundQueryParser是可用的，它被设计为与SpanQuery应用程序接口一起使用。它不支持正则表达式，但是如果使用SpanNears将通配符查询组合到短语中最终是您需要的，那么QueryParser可能会被证明是一个方便的工具。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/35448522

复制

相似问题

问lucene中的Regexpquery不起作用
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问lucene中的Regexpquery不起作用EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问lucene中的Regexpquery不起作用
EN