前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >ElasticSearch(7.2.2)-分词器的介绍和使⽤

ElasticSearch(7.2.2)-分词器的介绍和使⽤

作者头像
cwl_java
发布2019-11-04 16:23:32
3970
发布2019-11-04 16:23:32
举报
文章被收录于专栏:cwl_Javacwl_Java

简介:分词器是什么,内置的分词器有哪些

什么是分词器
  • 将⽤户输⼊的⼀段⽂本,按照⼀定逻辑,分析成多个词语的⼀种⼯具
  • example: The best 3-points shooter is Curry!
常用的内置分词器
  • standard analyzer
  • simple analyzer
  • whitespace analyzer
  • stop analyzer
  • language analyzer
  • pattern analyzer
standard analyzer
  • 标准分析器是默认分词器,如果未指定,则使⽤该分词器。
  • POST localhost:9200/_analyze
代码语言:javascript
复制
{
	 "analyzer": "standard",
	 "text": "The best 3-points shooter is Curry!"
}
simple analyzer
  • simple 分析器当它遇到只要不是字⺟的字符,就将⽂本解析成term,⽽且所有的term都是⼩写的。
  • POST localhost:9200/_analyze
代码语言:javascript
复制
{
	 "analyzer": "simple",
	 "text": "The best 3-points shooter is Curry!"
}
whitespace analyzer
  • whitespace 分析器,当它遇到空⽩字符时,就将⽂本解析成terms
  • POST localhost:9200/_analyze
代码语言:javascript
复制
{
	 "analyzer": "whitespace",
	 "text": "The best 3-points shooter is Curry!"
}
stop analyzer
  • stop 分析器 和 simple 分析器很像,唯⼀不同的是,stop 分析器增加了对删除停⽌词的⽀持,默认使⽤了english停⽌词
  • stop words 预定义的停⽌词列表,⽐如 (the,a,an,this,of,at)等等
  • POST localhost:9200/_analyze
代码语言:javascript
复制
{
	 "analyzer": "whitespace",
	 "text": "The best 3-points shooter is Curry!"
}
language analyzer
  • (特定的语⾔的分词器,⽐如说,English[英语分词器]),内置语⾔:arabic, armenian,basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish,french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian,lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish,swedish, turkish, thai
  • POST localhost:9200/_analyze
代码语言:javascript
复制
{
	 "analyzer": "english",
	 "text": "The best 3-points shooter is Curry!"
}
pattern analyzer
  • ⽤正则表达式来将⽂本分割成terms,默认的正则表达式是\W+(⾮单词字符)
  • POST localhost:9200/_analyze
代码语言:javascript
复制
{
	 "analyzer": "pattern",
	 "text": "The best 3-points shooter is Curry!"
}
选择分词器
  • PUT localhost:9200/my_index
代码语言:javascript
复制
{
	"settings": {
		"analysis": {
			"analyzer": {
				"my_analyzer": {
					"type": "whitespace"
				}
			}
		}
	},
	"mappings": {
		"properties": {
			"name": {
				"type": "text"
			},
			"team_name": {
				"type": "text"
			},
			"position": {
				"type": "text"
			},
			"play_year": {
				"type": "long"
			},
			"jerse_no": {
				"type": "keyword"
			},
			"title": {
				"type": "text",
				"analyzer": "my_analyzer"
			}
		}
	}
}
  • PUT localhost:9200/my_index/_doc/1
代码语言:javascript
复制
{
	 "name": "库⾥",
	 "team_name": "勇⼠",
	 "position": "控球后卫",
	 "play_year": 10,
	 "jerse_no": "30",
	 "title": "The best 3-points shooter is Curry!"
 }
  • POST localhost:9200/my_index/_search
代码语言:javascript
复制
{
	"query": {
		"match": {
			"title": "Curry!"
		}
	}
}
本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2019-10-29 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 什么是分词器
  • 常用的内置分词器
  • standard analyzer
  • simple analyzer
  • whitespace analyzer
  • stop analyzer
  • language analyzer
  • pattern analyzer
  • 选择分词器
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档