ES 分词器相关

郑小超.

发布于 2022-09-21 09:24:41

3040

发布于 2022-09-21 09:24:41

文章被收录于专栏：GreenLeaves

1、规范化 Normalization

规范化,主要实在ES对文本类型进行分词后,按照各自分词器的规范标准,对分词进行二次处理的过程.如was=>is(时态转换),brother‘s=>brother(复数变单数),Watch=>watch(大小写转换)等等,且还可能去掉量词a、an,is等和搜索无关的词语,不同的分词器规范化的过程不一样

总结:Normalization会做一些有利于搜索和规范化的操作,提升搜索效率,不同的分词器其Normalization流程各不相同.

2、ES 字符过滤器&令牌过滤器

3、分词器 tokenizer 官方文档

官方提供了10余种分词器,默认是standard分词器(根据Unicode文本分割算法的定义，标准标记器根据单词边界将文本划分为术语。它删除了大多数标点符号。它是大多数语言的最佳选择)

2.1 常用分词器(随便介绍两种,具体查阅文档)

stanard分词器

GET _analyze
{
  "text": "Xiao chao was a good man",
  "tokenizer": "standard"
}

分词结果如下:

xiao、chao、was、a、good、man

english分词器

GET _analyze
{
  "text": "Xiao chao was a good man",
  "analyzer": "english"
}

分词结果如下:

xiao、chao、good、man

和standard不同的是,english分词器,舍去了was a等和搜索相关度不高的词.

2.3 中文分词器

关于中文分词器参考ES 中文分词器ik

4、自定义分词器

结合上面的内容,来实现一个自定义分词器.

PUT test_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "custom_char_filter":{
          "type":"mapping",
          "mappings":[
            "&=>and",
            "|=>or",
            "!=>not"
            ]
        },
        "custom_html_strip_filter":{
          "type":"html_strip",
          "escaped_tags":["a"]
        },
        "custom_pattern_replace_filter":{
          "type":"pattern_replace",
          "pattern": "(\\d{3})\\d{4}(\\d{4})",
          "replacement": "$1****$2"
        }
      },
      "filter": {
        "custom_stop_filter":{
          "type": "stop",
          "ignore_case": true,
          "stopwords": [ "and", "is","friend" ]
        }
      },
      "tokenizer": {
        "custom_tokenizer":{
          "type":"pattern",
          "pattern":"[ ,!.?]"
        }
      }, 
      "analyzer": {
        "custom_analyzer":{
          "type":"custom",
          "tokenizer":"custom_tokenizer",
          "char_filter":["custom_char_filter","custom_html_strip_filter","custom_pattern_replace_filter"],
          "filter":["custom_stop_filter"]
        }
      }
    }
  }
}

当前自定义分析器用了字符串过滤器(三种形式),和令牌过滤器(这里只用了停用词).

关于过滤器相关参考ES 字符过滤器&令牌过滤器,关于分词器相关参考ES 分词器(示例使用了pattern分词器,参考文档)

执行以上代码后,执行搜索如下搜索代码:

GET test_index/_analyze
{
  "analyzer": "custom_analyzer",
  "text":"&.|,!?13366666666.You and me is Friend <p>超链接</p>"
}

执行结果如下:

{
  "tokens" : [
    {
      "token" : "or",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "not",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "133****6666",
      "start_offset" : 6,
      "end_offset" : 17,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "You",
      "start_offset" : 18,
      "end_offset" : 21,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "me",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : """
超链接
""",
      "start_offset" : 39,
      "end_offset" : 49,
      "type" : "word",
      "position" : 9
    }
  ]
}

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2022-08-09，如有侵权请联系 cloudcommunity@tencent.com 删除

中文分词

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

中文分词

登录后参与评论

0 条评论

热度

ES 分词器相关

ES 分词器相关

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐