前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Elasticsearch Analyzer

Elasticsearch Analyzer

作者头像
程序猿杜小头
发布2022-12-01 21:32:22
4660
发布2022-12-01 21:32:22
举报
文章被收录于专栏:程序猿杜小头程序猿杜小头

Elasticsearch Analyzer

Elasticsearch全文检索的核心是Text Analysis,而Text AnalysisAnalyzer实现。

1 Analyzer的类型

1.1 Built-in Analyzer

Elasticsearch内置了若干开箱即用的Analyzer,其中Standard Analyzer是默认的,一般可以满足大多数场景。

  • Standard Analyzer,根据词边界将文本拆分成若干term,其中词边界Unicode文本分段算法决策;标准分析器会删除大多数的标点符号,同时将大写的term转化为小写样式。
  • Simple Analyzer,根据非字母将文本拆分成若干term,简单分析器会将大写的term转化为小写样式。
  • Whitespace Analyzer,根据空白符将文本拆分成若干term,空白分析器不会将大写的term转化为小写样式。
  • Stop Analyzer,与简单分析器类似,但其可以删除停止词。
  • Keyword Analyzer,关键字分析器是一个空的分析器,并不会对文本进行拆分,而是将整个文本看作一个term
  • Pattern Analyzer,根据正则表达式拆分文本。
  • Language Analyzer,语言分析器,比如:English和French等。
  • Fingerprint Analyzer,主要用于重复检测场景。

1.2 Custom Analyzer

如果Elasticsearch内置的分析器无法满足你的需求,那么你可以创建一个custom类型的分析器:

  • 零个或多个character filter
  • 一个tokenizer
  • 零个或多个token filter

无论是character filter还是tokenizer亦或是token filter都可以有两种:built-in和custom。

2 Elasticsearch Analyzer的结构

一般地,Elasticsearch Analyzercharacter filtertokenizertoken filter级联而成,其中tokenizer有且只能有一个。

2.1 Character filter

Character filter主要针对字符进行预处理操作。

  • HTML Strip Character Filter,将HTML标签编码,比如:<b>转化为&amp;
  • Mapping Character Filter,类比Java中的map<Function<T>>
  • Pattern Replace Character Filter,基于正则表达式替换字符。

2.2 Tokenizer

Tokenizer主要负责分词操作,同时会记录每个分词type、position和该分词首尾字符的offset。Elasticsearch内置了10+种分词器,主要分为三类:Word Oriented Tokenizer、Partial Word Tokenizer和Structured Text Tokenizer。

2.2.1 Word Oriented Tokenizer

Word Oriented Tokenizer以individual word为维度进行分词。下面是比较常用的Word Oriented Tokenizer分词器:

  • Standard Tokenizer,根据词边界将文本拆分成若干term,其中词边界Unicode文本分段算法决策;标准分词器会删除大多数的标点符号。
  • Letter Tokenizer,根据非字母将文本拆分成若干term
  • Lowercase Tokenizer,与Letter Tokenizer类似,同时会将各个分词转化为小写态。
  • Whitespace Tokenizer,根据空白符将文本拆分成若干term
2.2.2 Partial Word Tokenizer

Partial Word Tokenizer以partial word为维度进行分词。

  • N-Gram Tokenizer,quick → [qu, ui, ic, ck]。
  • Edge N-Gram Tokenizer,quick → [q, qu, qui, quic, quick]。
2.2.3 Structured Text Tokenizer

Structured Text Tokenizer主要针对结构化文本进行分词,比如:ID、邮箱地址和路径等。

  • Keyword Tokenizer,不分词,而是将整个文本看作一个term
  • Pattern Tokenizer,根据正则表达式拆分文本。
  • Path Tokenizer,/foo/bar/baz → [/foo, /foo/bar, /foo/bar/baz]。

2.3 Token filter

Token filter主要针对分词进行后处理操作。Elasticsearch内置了40+种分词过滤器,这里不再一一赘述。

3 Specify the analyzer for a text field

mapping analyzer参数可以为特定字段设定分析器。一旦设定完毕,那么在indexsearch阶段将会使用该分析器进行文本分析。

4 Analyze API

我们可以通过Analyze API来进行Text Analysis。

4.1 Request

Request Method

URL

POST

/{index}/_analyze

4.2 Path parameters

Parameter

Required

Description

index

false

使用该索引中特定field的分析器进行文本分析

4.3 Query parameters

Parameter

Required

Description

analyzer

false

由character filter、tokenizer和token filter级联而成的分析器

char_filter

false

字符过滤器

tokenizer

false

分词器

filter

false

分词过滤器

text

true

要分析的文本内容

field

false

使用该参数时,那么必须提供index path parameter;该参数声明了从哪一个field获取分析器

normalizer

false

归一化器用于将文本转化为单个term

Normalizer Normalizer是简化版的Analyzer,它没有Tokenizer分词器模块;换句话说,Normalizer只能生成一个分词。

4.4 体验

代码语言:javascript
复制
GET /_analyze
{
    "tokenizer": "standard",
    "text": "sline-admin-webapp"
}

其响应结构如下:

代码语言:javascript
复制
{
    "tokens": [
        {
            "token": "sline",
            "start_offset": 0,
            "end_offset": 5,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "admin",
            "start_offset": 6,
            "end_offset": 11,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "webapp",
            "start_offset": 12,
            "end_offset": 18,
            "type": "<ALPHANUM>",
            "position": 2
        }
    ]
}

5 自定义分析器

5.1 需求

基于FilebeatLogstashElasticsearch实现了微服务日志的采集与存储,需要对moduleName这一field进行模糊搜索,moduleName也就是微服务的实例名称,其名称中字符只有英文字母和-分隔符。

5.2 实现

首先,Elasticsearch内置tokenizer不支持字符级的分词。于是,我们使用character filter进行处理,将moduleName拆分成-single character-形式,比如sline-webapp经转化就变为-s-l-i-n-e---w-e-b-a-p-p-;然后借助standard tokenizer进行分词处理。接下来,更新index template,指定index阶段和search阶段均使用该自定义分析器对moduleName field进行处理。最后,模糊匹配使用match_phrase进行查询即可。

5.2.1 更新index template
代码语言:javascript
复制
PUT /_index_template/sline-system-log-template
{
    "index_patterns": [
        "elk-*"
    ],
    "template": {
        "settings": {
            "lifecycle": {
                "name": "sline-system-log-ilm-policy",
                "rollover_alias": "sline-system-log-ilm-policy-alias"
            },
            "number_of_shards": "1",
            "max_result_window": "1000000",
            "number_of_replicas": "1",
            "analysis": {
                "analyzer": {
                    "sline_analyzer": {
                        "type": "custom",
                        "tokenizer": "standard",
                        "char_filter": [
                            "sline_char_filter"
                        ]
                    }
                },
                "char_filter": {
                    "sline_char_filter": {
                        "type": "mapping",
                        "mappings": [
                            "a => -a-",
                            "b => -b-",
                            "c => -c-",
                            "d => -d-",
                            "e => -e-",
                            "f => -f-",
                            "g => -g-",
                            "h => -h-",
                            "i => -i-",
                            "j => -j-",
                            "k => -k-",
                            "l => -l-",
                            "m => -m-",
                            "n => -n-",
                            "o => -o-",
                            "p => -p-",
                            "q => -q-",
                            "r => -r-",
                            "s => -s-",
                            "t => -t-",
                            "u => -u-",
                            "v => -v-",
                            "w => -w-",
                            "x => -x-",
                            "y => -y-",
                            "z => -z-"
                        ]
                    }
                }
            }
        },
        "mappings": {
            "properties": {
                "@timestamp": {
                    "type": "date"
                },
                "@version": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "className": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "lineNum": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "logLevel": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "message": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "methodName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    },
                    "analyzer": "sline_analyzer"
                },
                "moduleName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    },
                    "analyzer": "sline_analyzer"
                },
                "systemName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    },
                    "analyzer": "sline_analyzer"
                },
                "threadName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "timestamp": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                }
            }
        }
    }
}'
5.2.2 检查index template是否生效

检查的思路就是查看新生成的日志索引详情。

代码语言:javascript
复制
GET /elk-2021.01.31/_mappings
{
    "elk-2021.01.31": {
        "aliases": {},
        "mappings": {
            "properties": {
                "@timestamp": {
                    "type": "date"
                },
                "@version": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "className": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "lineNum": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "logLevel": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "message": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "methodName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    },
                    "analyzer": "sline_analyzer"
                },
                "moduleName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    },
                    "analyzer": "sline_analyzer"
                },
                "systemName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    },
                    "analyzer": "sline_analyzer"
                },
                "threadName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "timestamp": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                }
            }
        },
        "settings": {
            "index": {
                "lifecycle": {
                    "name": "sline-system-log-ilm-policy",
                    "rollover_alias": "sline-system-log-ilm-policy-alias"
                },
                "number_of_shards": "1",
                "provided_name": "elk-2021.01.31",
                "max_result_window": "1000000",
                "creation_date": "1612051512356",
                "analysis": {
                    "analyzer": {
                        "sline_analyzer": {
                            "type": "custom",
                            "char_filter": [
                                "sline_char_filter"
                            ],
                            "tokenizer": "standard"
                        }
                    },
                    "char_filter": {
                        "sline_char_filter": {
                            "type": "mapping",
                            "mappings": [
                                "a => -a-",
                                "b => -b-",
                                "c => -c-",
                                "d => -d-",
                                "e => -e-",
                                "f => -f-",
                                "g => -g-",
                                "h => -h-",
                                "i => -i-",
                                "j => -j-",
                                "k => -k-",
                                "l => -l-",
                                "m => -m-",
                                "n => -n-",
                                "o => -o-",
                                "p => -p-",
                                "q => -q-",
                                "r => -r-",
                                "s => -s-",
                                "t => -t-",
                                "u => -u-",
                                "v => -v-",
                                "w => -w-",
                                "x => -x-",
                                "y => -y-",
                                "z => -z-"
                            ]
                        }
                    }
                },
                "number_of_replicas": "1",
                "uuid": "C7rmtPmVTAeaN_6dW0vwfA",
                "version": {
                    "created": "7090199"
                }
            }
        }
    }
}
5.2.3 模糊查询
代码语言:javascript
复制
GET /elk-2021.01.31/_search
{
    "from": 0,
    "size": 10,
    "timeout": "10s",
    "_source": {
        "exclude": [
            "@version",
            "@timestamp"
        ]
    },
    "track_total_hits": true,
    "query": {
        "bool": {
            "must": [
                {
                    "match_phrase": {
                        "moduleName": "web"
                    }
                }
            ]
        }
    },
    "sort": [
        {
            "timestamp.keyword": {
                "order": "desc"
            }
        }
    ]
}'

其搜索结果如下:

代码语言:javascript
复制
{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": null,
        "hits": [
            {
                "_index": "elk-2021.01.31",
                "_type": "_doc",
                "_id": "JB9aYHcBbZJ5iJayD4Mj",
                "_score": null,
                "_source": {
                    "systemName": "ccn",
                    "logLevel": "INFO",
                    "moduleName": "ccn-webapp",
                    "lineNum": "1093",
                    "methodName": "getAndStoreFullRegistry",
                    "className": "com.netflix.discovery.DiscoveryClient",
                    "message": "Getting all instance registry info from the eureka server",
                    "threadName": "main",
                    "timestamp": "2021-01-31 09:27:29"
                },
                "sort": [
                    "2021-01-31 09:27:29"
                ]
            },
            {
                "_index": "elk-2021.01.31",
                "_type": "_doc",
                "_id": "KB9aYHcBbZJ5iJayD4M2",
                "_score": null,
                "_source": {
                    "systemName": "ccn",
                    "logLevel": "INFO",
                    "moduleName": "ccn-webapp",
                    "lineNum": "60",
                    "methodName": "<init>",
                    "className": "com.netflix.discovery.InstanceInfoReplicator",
                    "message": "InstanceInfoReplicator onDemand update allowed rate per min is 4",
                    "threadName": "main",
                    "timestamp": "2021-01-31 09:27:29"
                },
                "sort": [
                    "2021-01-31 09:27:29"
                ]
            }
        ]
    }
}

6 参考文档

  1. https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-overview.html
  2. https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
  3. https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-charfilters.html
  4. https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html
  5. https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html
  6. https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html
本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2021-02-27,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 程序猿杜小头 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Elasticsearch Analyzer
    • 1 Analyzer的类型
      • 1.1 Built-in Analyzer
      • 1.2 Custom Analyzer
    • 2 Elasticsearch Analyzer的结构
      • 2.1 Character filter
      • 2.2 Tokenizer
      • 2.3 Token filter
    • 3 Specify the analyzer for a text field
      • 4 Analyze API
        • 4.1 Request
        • 4.2 Path parameters
        • 4.3 Query parameters
        • 4.4 体验
      • 5 自定义分析器
        • 5.1 需求
        • 5.2 实现
      • 6 参考文档
      相关产品与服务
      Elasticsearch Service
      腾讯云 Elasticsearch Service(ES)是云端全托管海量数据检索分析服务,拥有高性能自研内核,集成X-Pack。ES 支持通过自治索引、存算分离、集群巡检等特性轻松管理集群,也支持免运维、自动弹性、按需使用的 Serverless 模式。使用 ES 您可以高效构建信息检索、日志分析、运维监控等服务,它独特的向量检索还可助您构建基于语义、图像的AI深度应用。
      领券
      问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档