这是一个样本文档,有以下几点:药品营销建筑的责任。欧元“2020年8月13日欧元”
如何在索引时从内容中删除特殊字符或非ascii unicode字符?我使用的是ES 7.x和storm crawler 1.17
发布于 2020-10-16 15:32:52
看起来像是对字符集的错误检测。您可以在索引之前通过编写custom parse filter来标准化内容,并删除那里不需要的字符。
发布于 2020-10-16 15:41:00
如果编写自定义解析过滤器和规范化对您来说很困难。您可以简单地在您的分析器定义中添加asciifolding token filter,这将把非ascii字符转换成它们的ascii字符,如下所示
发布http://{{hostname}}:{{port}}/_analyze
{
"tokenizer": "standard",
"filter": [
"asciifolding"
],
"text": "Pharmaceutical Marketing Building â responsibilities.  Mass. â Aug. 13, 2020 âÂ"
}
并为您的文本生成标记。
{
"tokens": [
{
"token": "Pharmaceutical",
"start_offset": 0,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "Marketing",
"start_offset": 15,
"end_offset": 24,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "Building",
"start_offset": 25,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "a",
"start_offset": 34,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "responsibilities.A",
"start_offset": 36,
"end_offset": 54,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "A",
"start_offset": 55,
"end_offset": 56,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "Mass",
"start_offset": 57,
"end_offset": 61,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "a",
"start_offset": 63,
"end_offset": 64,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "Aug",
"start_offset": 65,
"end_offset": 68,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "13",
"start_offset": 70,
"end_offset": 72,
"type": "<NUM>",
"position": 9
},
{
"token": "2020",
"start_offset": 74,
"end_offset": 78,
"type": "<NUM>",
"position": 10
},
{
"token": "aA",
"start_offset": 79,
"end_offset": 81,
"type": "<ALPHANUM>",
"position": 11
}
]
}
https://stackoverflow.com/questions/64384571
复制相似问题