前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Elasticsearch分词:拼音分词器

Elasticsearch分词:拼音分词器

原创
作者头像
HLee
修改2021-07-16 18:06:44
2.1K0
修改2021-07-16 18:06:44
举报
文章被收录于专栏:房东的猫

简介

Git地址:https://github.com/medcl/elasticsearch-analysis-pinyin

Optional Parameters

  • lowercase:lowercase non Chinese letters, default: true
  • keep_first_letter:when this option enabled, eg: 刘德华>ldh, default: true
  • keep_separate_first_letter:when this option enabled, will keep first letters separately, eg: 刘德华>l,d,h, default: false, NOTE: query result maybe too fuzziness due to term too frequency
  • limit_first_letter_length:set max length of the first_letter result, default: 16
  • keep_full_pinyin:when this option enabled, eg: 刘德华> [liu,de,hua], default: true
  • keep_joined_full_pinyin:when this option enabled, eg: 刘德华> [liudehua], default: false
  • keep_none_chinese:keep non chinese letter or number in result, default: true
  • keep_none_chinese_together:keep non chinese letter together, default: true, eg: DJ音乐家 -> DJ,yin,yue,jia, when set to false, eg: DJ音乐家 -> D,J,yin,yue,jia, NOTE: keep_none_chinese should be enabled first
  • keep_none_chinese_in_first_letter:keep non Chinese letters in first letter, eg: 刘德华AT2016->ldhat2016, default: true
  • keep_none_chinese_in_joined_full_pinyin:keep non Chinese letters in joined full pinyin, eg: 刘德华2016->liudehua2016, default: false
  • none_chinese_pinyin_tokenize:break non chinese letters into separate pinyin term if they are pinyin, default: true, eg: liudehuaalibaba13zhuanghan -> liu,de,hua,a,li,ba,ba,13,zhuang,han, NOTE: keep_none_chinese and keep_none_chinese_together should be enabled first
  • keep_original:when this option enabled, will keep original input as well, default: false
  • trim_whitespace default: true
  • remove_duplicated_term:when this option enabled, duplicated term will be removed to save index, eg: de的>de, default: false, NOTE: position related query maybe influenced
  • ignore_pinyin_offset:after 6.0, offset is strictly constrained, overlapped tokens are not allowed, with this parameter, overlapped token will allowed by ignore offset, please note, all position related query or highlight will become incorrect, you should use multi fields and specify different settings for different query purpose. if you need offset, please set it to false. default: true.

Setting

代码语言:javascript
复制
PUT /doctor
{
    "settings":{
        "number_of_shards":1,
        "analysis":{
            "analyzer":{
                "pinyin_analyzer":{
                    "tokenizer":"my_pinyin"
                }
            },
            "tokenizer":{
                "my_pinyin":{
                    "lowercase":"true",
                    "keep_original":"false",
                    "keep_first_letter":"true",
                    "keep_separate_first_letter":"true",
                    "type":"pinyin",
                    "limit_first_letter_length":"16",
                    "keep_full_pinyin":"true",
                    "keep_none_chinese_in_joined_full_pinyin":"true",
                    "keep_joined_full_pinyin":"true"
                }
            }
        }
    }
}

Analyzer

代码语言:javascript
复制
POST /doctor/_analyze
{
  "analyzer": "pinyin_analyzer",
  "text": "刘德华2019"
}

结果:
{
  "tokens" : [
    {
      "token" : "l",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "liu",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "d",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "de",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "h",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "hua",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "2019",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "liudehua2019",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "ldh2019",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 3
    }
  ]
}

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 简介
    • Optional Parameters
    • Setting
    • Analyzer
    相关产品与服务
    Elasticsearch Service
    腾讯云 Elasticsearch Service(ES)是云端全托管海量数据检索分析服务,拥有高性能自研内核,集成X-Pack。ES 支持通过自治索引、存算分离、集群巡检等特性轻松管理集群,也支持免运维、自动弹性、按需使用的 Serverless 模式。使用 ES 您可以高效构建信息检索、日志分析、运维监控等服务,它独特的向量检索还可助您构建基于语义、图像的AI深度应用。
    领券
    问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档