elasticsearch之analyzer(分词器)

原创

空洞的盒子

发布于 2023-11-24 14:16:27

3260

发布于 2023-11-24 14:16:27

一.什么是分词器(analyzer)

在elasticsearch中analyzer是用于文本分析与处理的组件。analyzer由字符过滤器，分词器和标记过滤器组成。按照特定的分词算法与顺序对文本进行处理。生成可供搜索与索引的词项。存储于elasticsearch的倒排索引中。在elasticsearch中，分词器均是以插件的形式进行安装。

二.分词器的安装

1.准备插件包

首先在相应分词插件的git或官网，下载插件包，一般为zip形式。

2.插件安装

将zip包上传至elasticsearch集群所在的节点。然后使用以下命令进行安装。在插件安装完成后，还需要重启elasticsearch服务，以此让安装的分词插件生效。

bin/elasticsearch-plugin install file:///path/to/my-plugin.zip

三.分词器的使用

1.验证分词器的分词效果

以IK分词为例，IK分词插件作为elasticsearch官方插件，可以与elasticsearch搜索服务无缝集成，只需要通过简单的配置即可使用。同时IK分词插件提供了多种分词模式，供业务进行选择。

在以下样例中，我们使用IK分词的"ik_smart"分词模式对文本进行分词效果的验证。我们可以在返回结果中看到，分词器将我们传入的text文本分割为了若干个词汇短语。

GET _analyze/?pretty
{
  "analyzer":"ik_smart",
  "text":"庆祝祖国六十岁生日快乐"
}

{
  "tokens" : [
    {
      "token" : "庆祝",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "祖国",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "六",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "十岁",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "生日快乐",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}

当我们使用"ik_max_word"对搜索关键字进行拆分时，根据结果返回，我们发现分词器对我们传入的文本信息进行了最大粒度的切分。其分词结果较"ik_smart"分词模式的切分结果更加完整。

GET _analyze/?pretty
{
  "analyzer":"ik_max_word",
  "text":"庆祝祖国六十岁生日快乐"
}

{
  "tokens" : [
    {
      "token" : "庆祝",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "祖国",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "六十",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "十岁",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "生日快乐",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "生日",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "快乐",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 6
    }
  ]
}

以拼音分词为例，我们可以看到拼音分词器已经将我们传入的文本，按照单个字的拼音，拼音首字母的方式进行了切分与转换。这也就是很多时候我们在搜索框输入拼音后，也可以搜索到符合要求的结果的原因。

GET _analyze/?pretty
{
  "analyzer":"pinyin",
  "text":"庆祝祖国六十岁生日快乐"
}

{
  "tokens" : [
    {
      "token" : "qing",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "qzzglsssrkl",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "zhu",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "zu",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "guo",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "liu",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "shi",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "sui",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "sheng",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "ri",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "kuai",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "le",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 10
    }
  ]
}

2.如何在创建索引时为mapping字段指定分词器

以下样例为在创建索引时，为指定字段创建分词器。在我们指定分词器之后，该字段的数据在写入时，就会使用配置的分词器进行分词。需要注意的是，我们的分词器在使用之前一定要在集群中安装完成。

PUT /test
{
  "settings": {
    "number_of_shards": 1
  },
  "mappings": {
    "properties": {
      "field1": { 
                  "type": "text",
                  "analyzer": "standard"
                  }
    }
  }
}

3.如何在查询时为搜索关键字指定分词器

在以下样例中，我们设置"name"字段的分词器为"standard"，此时在查询时搜索关键字将会被"standard"分词器进行切分。

GET /_search
{
	"query": {
		"match": {
			"name": {
				"query": "mark",
				"analyzer": "standard"
			}
		}
	}
}

我正在参与2023腾讯技术创作特训营第三期有奖征文，组队打卡瓜分大奖！

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

2023腾讯·技术创作特训营第三期

Elasticsearch Service

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

2023腾讯·技术创作特训营第三期

Elasticsearch Service

登录后参与评论

0 条评论

热度