ElasticSearch 6.x 学习笔记：14.mapping参数

程裕强

发布于 2022-05-06 19:12:56

1.3K0

发布于 2022-05-06 19:12:56

14.1 mapping参数概述

官方文档 https://www.elastic.co/guide/en/elasticsearch/reference/6.1/mapping-params.html ElasticSearch提供了丰富的映射参数对字段的映射进行参数设计，比如字段的分词器、字段权重、日期格式、检索模型等等。

14.2 analyzer

https://www.elastic.co/guide/en/elasticsearch/reference/6.1/analyzer.html

指定分词器(分析器更合理)，对索引和查询都有效。如下，指定ik分词的配置（1）定义索引

DELETE my_index
PUT my_index

（2）ik_smart分词

GET my_index/_analyze
{
  "analyzer": "ik_smart",
  "text":"安徽省长江流域"
}

{
  "tokens": [
    {
      "token": "安徽省",
      "start_offset": 0,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "长江流域",
      "start_offset": 3,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

（3）定义mapping

POST my_index/fulltext/_mapping
{
  "properties": {
      "content": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
      }
  }
}

（4）插入数据

POST my_index/fulltext/1
{"content":"美国留给伊拉克的是个烂摊子吗"}

POST my_index/fulltext/2
{"content":"公安部：各地校车将享最高路权"}

POST my_index/fulltext/3
{"content":"中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"}

POST my_index/fulltext/4
{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}

（5）查询

POST /index/fulltext/_search
{
    "query" : { "match" : { "content" : "中国" }}
}

查询结果

{
  "took": 135,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.6489038,
    "hits": [
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "4",
        "_score": 0.6489038,
        "_source": {
          "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
        }
      },
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "content": "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"
        }
      }
    ]
  }
}

14.3 normalizer

https://www.elastic.co/guide/en/elasticsearch/reference/6.1/normalizer.html normalizer用于解析前的标准化配置，比如把所有的字符转化为小写等。

DELETE my_index

PUT my_index
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase", "asciifolding"]
        }
      }
    }
  },
  "mappings": {
    "type": {
      "properties": {
        "foo": {
          "type": "keyword",
          "normalizer": "my_normalizer"
        }
      }
    }
  }
}

PUT my_index/type/1 
{"foo": "BÀR"}

PUT my_index/type/2
{"foo": "bar"}

PUT my_index/type/3
{"foo": "baz"}

POST my_index/_refresh

GET my_index/_search
{
  "query": {
    "match": {
      "foo": "BAR"
    }
  }
}

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "my_index",
        "_type": "type",
        "_id": "2",
        "_score": 0.2876821,
        "_source": {
          "foo": "bar"
        }
      },
      {
        "_index": "my_index",
        "_type": "type",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "foo": "BÀR"
        }
      }
    ]
  }
}

14.4 boost

https://www.elastic.co/guide/en/elasticsearch/reference/6.1/mapping-boost.html

官方建议：index time boost is deprecated. Instead, the field mapping boost is applied at query time. 也就是说，官方推荐在查询时指定boost。

我们可以通过指定一个boost值来控制每个查询子句的相对权重，该值默认为1。一个大于1的boost会增加该查询子句的相对权重。

DELETE my_index

put my_index

PUT my_index/my_type/1
{
  "title":"quick brown fox"

}

POST _search
{
    "query": {
        "match" : {
            "title": {
                "query": "quick brown fox",
                "boost": 2
            }
        }
    }
}

查询结果

{
  "took": 48,
  "timed_out": false,
  "_shards": {
    "total": 45,
    "successful": 45,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1.7260926,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1.7260926,
        "_source": {
          "title": "quick brown fox"
        }
      }
    ]
  }
}

boost参数被用来增加一个子句的相对权重(当boost大于1时)，或者减小相对权重(当boost介于0到1时)，但是增加或者减小不是线性的。换言之，boost设为2并不会让最终的_score加倍。相反，新的_score会在适用了boost后被归一化(Normalized)。每种查询都有自己的归一化算法(Normalization Algorithm)。但是能够说一个高的boost值会产生一个高的_score。

14.5 coerce

https://www.elastic.co/guide/en/elasticsearch/reference/6.1/coerce.html#coerce

coerce属性用于清除脏数据，coerce的默认值是true。整型数字5有可能会被写成字符串“5”或者浮点数5.0.coerce属性可以用来清除脏数据：

字符串会被强制转换为整数
浮点数被强制转换为整数【例子】（1）重新创建my_index

DELETE my_index

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "number_one": {
          "type": "integer"
        },
        "number_two": {
          "type": "integer",
          "coerce": false
        }
      }
    }
  }
}

（2）写入一条测试文档

PUT my_index/my_type/1
{
  "number_one": "10" 
}

{
  "_index": "my_index",
  "_type": "my_type",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

（3）写入另一条测试文档

PUT my_index/my_type/2
{
  "number_two": "10" 
}

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse [number_two]"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "failed to parse [number_two]",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "Integer value passed as String"
    }
  },
  "status": 400
}

14.6 copy-to

https://www.elastic.co/guide/en/elasticsearch/reference/6.1/copy-to.html copy_to属性用于配置自定义的_all字段。换言之，就是多个字段可以合并成一个超级字段。比如，first_name和last_name可以合并为full_name字段。【例子】（1）

DELETE my_index

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "first_name": {
          "type": "text",
          "copy_to": "full_name" 
        },
        "last_name": {
          "type": "text",
          "copy_to": "full_name" 
        },
        "full_name": {
          "type": "text"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "first_name": "John",
  "last_name": "Smith"
}

（2）查询

GET my_index/_search
{
  "query": {
    "match": {
      "full_name": { 
        "query": "John Smith",
        "operator": "and"
      }
    }
  }
}

{
  "took": 22,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.5753642,
        "_source": {
          "first_name": "John",
          "last_name": "Smith"
        }
      }
    ]
  }
}

14.7 doc_values

https://www.elastic.co/guide/en/elasticsearch/reference/6.1/doc-values.html doc_values是为了加快排序、聚合操作，在建立倒排索引的时候，额外增加一个列式存储映射，是一个空间换时间的做法。默认是开启的，对于确定不需要聚合或者排序的字段可以关闭。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "status_code": { 
          "type":       "keyword"
        },
        "session_id": { 
          "type":       "keyword",
          "doc_values": false
        }
      }
    }
  }
}

14.8 dynamic

https://www.elastic.co/guide/en/elasticsearch/reference/6.1/dynamic.html

dynamic属性用于检测新发现的字段，有三个取值：

true:新发现的字段添加到映射中。（默认）
flase:新检测的字段被忽略。必须显式添加新字段。
strict:如果检测到新字段，就会引发异常并拒绝文档

【例子】（1）新建索引取值为strict，非布尔值要加引号

DELETE my_index

PUT my_index
{
  "mappings": {
    "my_type": {
      "dynamic": "strict", 
      "properties": {
        "title": { "type": "text"}
      }
    }
  }
}

（2）插入新文档

PUT my_index/my_type/1
{
  "title": "test",
  "content": "test dynamic"
}

抛出异常

{
  "error": {
    "root_cause": [
      {
        "type": "strict_dynamic_mapping_exception",
        "reason": "mapping set to strict, dynamic introduction of [content] within [my_type] is not allowed"
      }
    ],
    "type": "strict_dynamic_mapping_exception",
    "reason": "mapping set to strict, dynamic introduction of [content] within [my_type] is not allowed"
  },
  "status": 400
}

14.9 enabled

https://www.elastic.co/guide/en/elasticsearch/reference/6.1/enabled.html ELasticseaech默认会索引所有的字段，enabled设为false的字段，es会跳过字段内容，该字段只能从_source中获取，但是不可搜。而且字段可以是任意类型。

【例子】（1）新建索引，插入文档

DELETE my_index

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "name":{"enabled": false}
      } 
    }
  }
}
PUT my_index/my_type/1
{
  "title": "test enabled",
  "name":"chengyuqiang"
}

（2）查看文档

GET my_index/my_type/1

{
  "_index": "my_index",
  "_type": "my_type",
  "_id": "1",
  "_version": 1,
  "found": true,
  "_source": {
    "title": "test enabled",
    "name": "chengyuqiang"
  }
}

（3）搜索字段

GET my_index/_search
{
  "query": {
    "match": {
      "name": "chengyuqiang"
    }
  }
}

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

14.10 fielddata

14.11 format

在《12.5 date类型》一节已经介绍了日期格式化。这里需要强调的是：epoch_millis表示毫秒数，epoch_second表示秒数。

14.12 ignore_above

ignore_above用于指定字段索引和存储的长度最大值，超过最大值的会被忽略

DELETE my_index

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "message": {
          "type": "keyword",
          "ignore_above": 20 
        }
      }
    }
  }
}

PUT my_index/my_type/1 
{
  "message": "Syntax error"
}

PUT my_index/my_type/2 
{
  "message": "Syntax error with some long stacktrace"
}

GET my_index/_search 
{
  "size":0,
  "aggs": {
    "messages": {
      "terms": {
        "field": "message"
      }
    }
  }
}

mapping中指定了ignore_above字段的最大长度为20，第一个文档的字段长小于20，因此索引成功，第二个超过20，因此不索引，返回结果只有”Syntax error”,结果如下

{
  "took": 12,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "messages": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Syntax error",
          "doc_count": 1
        }
      ]
    }
  }
}

14.13 ignore_malformed

https://www.elastic.co/guide/en/elasticsearch/reference/6.1/ignore-malformed.html ignore_malformed可以忽略不规则数据。对于账号userid字段，有人可能填写的是整数类型，也有人填写的是邮件格式。给一个字段索引不合适的数据类型发生异常，导致整个文档索引失败。如果ignore_malformed参数设为true，异常会被忽略，出异常的字段不会被索引，其它字段正常索引。

DELETE my_index

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "number_one": {
          "type": "integer",
          "ignore_malformed": true
        },
        "number_two": {
          "type": "integer"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "text":       "Some text value",
  "number_one": "foo" 
}

PUT my_index/my_type/2
{
  "text":       "Some text value",
  "number_two": "foo" 
}

上面的例子中number_one接受integer类型，ignore_malformed属性设为true，因此文档一种number_one字段虽然是字符串但依然能写入成功；number_two接受integer类型，默认ignore_malformed属性为false，因此写入失败。

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse [number_two]"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "failed to parse [number_two]",
    "caused_by": {
      "type": "number_format_exception",
      "reason": "For input string: \"foo\""
    }
  },
  "status": 400
}

14.14 index_options

https://www.elastic.co/guide/en/elasticsearch/reference/6.1/index-options.html The index_options parameter controls what information is added to the inverted index, for search and highlighting purposes. index_options参数控制将哪些信息添加到倒排索引，用于搜索和突出显示目的。

参数	说明
docs	Only the doc number is indexed. Can answer the question Does this term exist in this field?
freqs	Doc number and term frequencies are indexed. Term frequencies are used to score repeated terms higher than single terms.
positions	Doc number, term frequencies, and term positions (or order) are indexed. Positions can be used for proximity or phrase queries.
offsets	Doc number, term frequencies, positions, and start and end character offsets (which map the term back to the original string) are indexed. Offsets are used by the unified highlighter to speed up highlighting.

注意：The index_options parameter has been deprecated for Numeric fields in 6.0.0。6.0.0中的数字字段已弃用index_options参数。

DELETE my_index

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "text": {
          "type": "text",
          "index_options": "offsets"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "text": "Quick brown fox"
}

GET my_index/_search
{
  "query": {
    "match": {
      "text": "brown fox"
    }
  },
  "highlight": {
    "fields": {
      "text": {} 
    }
  }
}

{
  "took": 50,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.5753642,
        "_source": {
          "text": "Quick brown fox"
        },
        "highlight": {
          "text": [
            "Quick <em>brown</em> <em>fox</em>"
          ]
        }
      }
    ]
  }
}

14.15 index

The index option controls whether field values are indexed. It accepts true or false and defaults to true. Fields that are not indexed are not queryable. index属性指定字段是否索引，不索引也就不可搜索，取值可以为true或者false。

14.16 fields

https://www.elastic.co/guide/en/elasticsearch/reference/6.1/multi-fields.html

It is often useful to index the same field in different ways for different purposes. This is the purpose of multi-fields. For instance, a string field could be mapped as a text field for full-text search, and as a keyword field for sorting or aggregations。 fields可以让同一文本有多种不同的索引方式，比如一个String类型的字段，可以使用text类型做全文检索，使用keyword类型做聚合和排序。

DELETE my_index

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "city": {
          "type": "text",
          "fields": {
            "raw": { 
              "type":  "keyword"
            }
          }
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "city": "New York"
}

PUT my_index/my_type/2
{
  "city": "York"
}

GET my_index/_search
{
  "query": {
    "match": {
      "city": "york" 
    }
  },
  "sort": {
    "city.raw": "asc" 
  },
  "aggs": {
    "Cities": {
      "terms": {
        "field": "city.raw" 
      }
    }
  }
}

{
  "took": 31,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": null,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": null,
        "_source": {
          "city": "New York"
        },
        "sort": [
          "New York"
        ]
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": null,
        "_source": {
          "city": "York"
        },
        "sort": [
          "York"
        ]
      }
    ]
  },
  "aggregations": {
    "Cities": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "New York",
          "doc_count": 1
        },
        {
          "key": "York",
          "doc_count": 1
        }
      ]
    }
  }
}

The city.raw field is a keyword version of the city field (.city.raw字段是城市字段的关键字版本。) The city field can be used for full text search.( city字段可用于全文搜索。) The city.raw field can be used for sorting and aggregations.( city.raw字段可用于排序和聚合)

本文参与腾讯云自媒体分享计划，分享自作者个人站点/博客。

原始发表：2018-01-14，如有侵权请联系 cloudcommunity@tencent.com 删除

https