Elasticsearch分词：自定义分词器

原创

HLee

修改于 2021-08-13 17:38:41

8.4K1

文章被收录于专栏：房东的猫房东的猫

简介

虽然Elasticsearch带有一些现成的分析器，然而在分析器上Elasticsearch真正的强大之处在于，你可以通过在一个适合你的特定数据的设置之中组合字符过滤器、分词器、词汇单元过滤器来创建自定义的分析器。

在分析与分析器我们说过，一个 分析器 就是在一个包里面组合了三种函数的一个包装器，三种函数按照顺序被执行:

字符过滤器

官网：https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-charfilters.html

字符过滤器用来 整理 一个尚未被分词的字符串。例如，如果我们的文本是HTML格式的，它会包含像 <p> 或者 <div> 这样的HTML标签，这些标签是我们不想索引的。我们可以使用 html清除 字符过滤器来移除掉所有的HTML标签，并且像把 Á 转换为相对应的Unicode字符 Á 这样，转换HTML实体。

一个分析器可能有0个或者多个字符过滤器。处理原始文本，可以配置多个，会影响到tokenizer的position和offset信息。在es中有几个默认的字符过滤器

html_strip：去除html标签

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": ["html_strip"],
  "text": "<br>you know, for search</br>"
}

分词效果：
{
  "tokens" : [
    {
      "token" : """
                you know, for search
                """,
      "start_offset" : 0,
      "end_offset" : 29,
      "type" : "word",
      "position" : 0
    }
  ]
}

mapping：字符串替换

GET /_analyze
{
  "tokenizer": "whitespace",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [
        "- => "
      ]
    },
    "html_strip"
  ],
  "text": "<br>中国-北京 中国-台湾 中国-人民</br>"
}

分词效果：
{
  "tokens" : [
    {
      "token" : "中国北京",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "中国台湾",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "中国人民",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "word",
      "position" : 2
    }
  ]
}

pattern_replace：正则匹配替换

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "https?://(.*)",
      "replacement": "$1"
    }
  ],
  "text": "https://www.elastic.co"
}

分词效果：
{
  "tokens" : [
    {
      "token" : "www.elastic.co",
      "start_offset" : 0,
      "end_offset" : 22,
      "type" : "word",
      "position" : 0
    }
  ]
}

分词器

官网：https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html

一个分析器必须有一个唯一的分词器。分词器把字符串分解成单个词条或者词汇单元。 标准 分析器里使用的 标准 分词器把一个字符串根据单词边界分解成单个词条，并且移除掉大部分的标点符号，然而还有其他不同行为的分词器存在。

例如， 关键词 分词器完整地输出接收到的同样的字符串，并不做任何分词。 空格 分词器只根据空格分割文本。 正则 分词器根据匹配正则表达式来分割文本。

将原始文本按照一定规则，切分成词项（字符处理）。在es中有几个默认的分词器。

standard
letter
lowercase
whitespace
uax url email
classic
thai
n-gram
edge n-gram
keyword
pattern
simple
char group
simple pattern split
path

GET /_analyze
{
  "tokenizer": "path_hierarchy",
  "text": ["/usr/local/bin/java"]
}

分词效果：
{
  "tokens" : [
    {
      "token" : "/usr",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/usr/local",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/usr/local/bin",
      "start_offset" : 0,
      "end_offset" : 14,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/usr/local/bin/java",
      "start_offset" : 0,
      "end_offset" : 19,
      "type" : "word",
      "position" : 0
    }
  ]
}

分词过滤

官网：https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html

经过分词，作为结果的 词单元流 会按照指定的顺序通过指定的词单元过滤器。

词单元过滤器可以修改、添加或者移除词单元。我们已经提到过 lowercase 和 stop 词过滤器，但是在 Elasticsearch 里面还有很多可供选择的词单元过滤器。词干过滤器把单词 遏制 为词干。 ascii_folding 过滤器移除变音符，把一个像 "très" 这样的词转换为 "tres" 。 ngram 和 edge_ngram 词单元过滤器可以产生适合用于部分匹配或者自动补全的词单元。

将tokenizer输出的词项进行处理，如：增加，修改，删除。在es中有几个默认的分词过滤器。

lowercase
stop

GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop"],
  "text": ["how are you i am fine thank you"]
}

uppercase
reverse
length
n-gram
edge n-gram
pattern replace
trim

自定义分词器

在analysis下的相应位置设置字符过滤器、分词器和词单元过滤器:

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": { ... custom character filters ... },
            "tokenizer":   { ... custom tokenizers ... },
            "filter":      { ... custom token filters ... },
            "analyzer":    { ... custom analyzers ... }
        }
    }
}

作为示范，让我们一起来创建一个自定义分析器吧，这个分析器可以做到下面的这些事:

使用 html清除 字符过滤器移除HTML部分。
使用一个自定义的 映射 字符过滤器把 & 替换为 " and " ：

"char_filter": {
    "&_to_and": {
        "type":       "mapping",
        "mappings": [ "&=> and "]
    }
}

使用 标准 分词器分词。
小写词条，使用 小写 词过滤器处理。
使用自定义 停止 词过滤器移除自定义的停止词列表中包含的词：

"filter": {
    "my_stopwords": {
        "type":        "stop",
        "stopwords": [ "the", "a" ]
    }
}

我们的分析器定义用我们之前已经设置好的自定义过滤器组合了已经定义好的分词器和过滤器：

"analyzer": {
    "my_analyzer": {
        "type":           "custom",
        "char_filter":  [ "html_strip", "&_to_and" ],
        "tokenizer":      "standard",
        "filter":       [ "lowercase", "my_stopwords" ]
    }
}

完整的创建索引请求看起来应该像这样：

curl -X PUT "localhost:9200/my_index?pretty" -H 'Content-Type: application/json' -d'
{
    "settings":{
        "analysis":{
            "char_filter":{
                "&_to_and":{
                    "type":"mapping",
                    "mappings":[
                        "&=> and "
                    ]
                }
            },
            "filter":{
                "my_stopwords":{
                    "type":"stop",
                    "stopwords":[
                        "the",
                        "a"
                    ]
                }
            },
            "analyzer":{
                "my_analyzer":{
                    "type":"custom",
                    "char_filter":[
                        "html_strip",  // 跳过HTML标签
                        "&_to_and"  // 将&符号转换为"and"
                    ],
                    "tokenizer":"standard",
                    "filter":[
                        "lowercase",  // // 转换为小写
                        "my_stopwords"
                    ]
                }
            }
        }
    }
}

索引被创建以后，使用analyzeAPI 来测试这个新的分析器：

GET my_index/_analyze      
{
    "analyzer": "my_analyzer",   
    "text": "The quick & brown fox"
}

分析效果：
{
  "tokens" : [
    {
      "token" : "quick",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "and",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "brown",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "fox",
      "start_offset" : 18,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}

完整的创建索引请求看起来应该像这样：

PUT /my_analysis
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": [
            "my_char_filter"
            ],
          "tokenizer": "my_tokenizer",
          "filter": [
            "my_tokenizer_filter"
            ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": ["_ => "]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "[,.!? ]"
        }
      },
      "filter": {
        "my_tokenizer_filter": {
          "type": "stop",
          "stopword": "__english__"
        }
      }
    }
  }
}

索引被创建以后，使用analyzeAPI 来测试这个新的分析器：

POST /my_analysis/_analyze
{
  "analyzer": "my_analyzer",
  "text": ["Hello Kitty!, A_n_d you?"]
}

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

Elasticsearch Service

中文分词

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

Elasticsearch Service

中文分词

登录后参与评论

0 条评论

热度