文章/答案/技术大牛

发布

Python全文检索利器：Whoosh库入门与实践

文章来源：企鹅号 - 孙导TV

在现代应用程序中，搜索功能变得越来越重要。Whoosh是一个纯Python实现的全文搜索引擎库，它不需要任何外部依赖，易于集成到Python项目中。本文将深入讲解Whoosh的核心概念、使用方法以及与其他搜索引擎库的比较。

「什么是全文搜索？」

全文搜索是指在文本的全部内容中查找指定的关键词，而不仅仅是在标题、标签等元数据中查找。全文搜索能够提供更全面、更精准的搜索结果。

「Whoosh的核心概念」

「Schema（模式）：」Schema定义了索引中包含的字段以及每个字段的类型。

「Index（索引）：」Index是存储索引数据的数据结构，用于加速搜索。

「Analyzer（分析器）：」Analyzer负责对文本进行分词、过滤等处理，以便更好地进行索引和搜索。

「Searcher（搜索器）：」Searcher用于执行搜索查询并返回结果。

「安装Whoosh」

可以使用pip安装Whoosh：

pip install Whoosh

「示例1：创建索引和添加文档」

import os.path

from whoosh.index import create_in, open_dir

from whoosh.fields import Schema, TEXT, ID

# 定义Schema

schema = Schema(title=TEXT(stored=True), content=TEXT, path=ID(stored=True))

# 创建索引目录

ifnot os.path.exists("indexdir"):

os.mkdir("indexdir")

# 创建索引

ix = create_in("indexdir", schema)

# 创建Writer对象

writer = ix.writer()

# 添加文档

writer.add_document(title="My first document", content="This is the content of my first document.", path="/a")

writer.add_document(title="My second document", content="The second document contains some more text.", path="/b")

writer.add_document(title="Third document is here", content="This is the third document in the index.", path="/c")

# 提交更改

writer.commit()

print("索引创建完成")

在这个例子中，我们定义了一个包含title、content和path字段的Schema，并创建了一个索引。然后，我们使用Writer对象添加了三个文档到索引中。

「示例2：执行搜索查询」

from whoosh.index import open_dir

from whoosh.qparser import QueryParser

# 打开索引

ix = open_dir("indexdir")

# 创建Searcher对象

with ix.searcher() as searcher:

# 创建QueryParser对象

query_parser = QueryParser("content", ix.schema)

# 解析查询字符串

query = query_parser.parse("document")

# 执行搜索

results = searcher.search(query)

# 打印搜索结果

print(f"搜索到{len(results)}个结果：")

for result in results:

print(f"标题：{result['title']}, 路径：{result['path']}")

在这个例子中，我们使用QueryParser解析查询字符串，并使用Searcher对象执行搜索。然后，我们遍历搜索结果并打印出每个结果的标题和路径。

「中文分词」

Whoosh默认使用英文分词器，对于中文文本需要使用中文分词器。常用的中文分词器有jieba：

pip install jieba

使用jieba分词器的示例如下：

import os.path

from whoosh.index import create_in, open_dir

from whoosh.fields import Schema, TEXT, ID

import jieba

from whoosh.analysis import Tokenizer, Token

class ChineseTokenizer(Tokenizer):

def __call__(self, value, positions=True, **kwargs):

words = jieba.cut(value, cut_all=False)

for w in words:

yield Token(text=w)

schema = Schema(title=TEXT(analyzer=ChineseTokenizer(), stored=True), content=TEXT(analyzer=ChineseTokenizer()), path=ID(stored=True))

# ... (其他代码与示例1类似)

在这个例子中，我们自定义了一个ChineseTokenizer类，使用jieba进行分词。然后，我们在Schema中将title和content字段的分析器设置为ChineseTokenizer。

「Whoosh与其他搜索引擎库的比较」

总的来说，Whoosh适合中小型项目或对性能要求不高的项目，以及需要纯Python解决方案的项目。Elasticsearch和Solr则更适合大型项目或对性能和功能有较高要求的项目。

发表于: 2025-01-052025-01-05 10:45:35
原文链接：https://page.om.qq.com/page/OqnqHP4HMVEfgOVjBmtr-awQ0
腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号（企鹅号）传播渠道之一，根据《腾讯内容开放平台服务协议》转载发布内容。
如有侵权，请联系 cloudcommunity@tencent.com 删除。

扫码

添加站长进交流群

领取专属 10元无门槛券

私享最新 技术干货

Python全文检索利器：Whoosh库入门与实践

相关快讯

扫码

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐