文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在索引时停止在内容中存储特殊字符

问如何在索引时停止在内容中存储特殊字符
EN

Stack Overflow用户

提问于 2020-10-16 15:16:33

回答 2查看 40关注 0票数 1

这是一个样本文档，有以下几点:药品营销建筑的责任。欧元“2020年8月13日欧元”

如何在索引时从内容中删除特殊字符或非ascii unicode字符？我使用的是ES 7.x和storm crawler 1.17

elasticsearch-analyzers

elasticsearch

stormcrawler

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-10-16 15:32:52

看起来像是对字符集的错误检测。您可以在索引之前通过编写custom parse filter来标准化内容，并删除那里不需要的字符。

票数 1

Stack Overflow用户

发布于 2020-10-16 15:41:00

如果编写自定义解析过滤器和规范化对您来说很困难。您可以简单地在您的分析器定义中添加asciifolding token filter，这将把非ascii字符转换成它们的ascii字符，如下所示

发布http://{{hostname}}:{{port}}/_analyze

{
    "tokenizer": "standard",
    "filter": [
        "asciifolding"
    ],
    "text": "Pharmaceutical Marketing Building â responsibilities.Â Â Mass. â Aug. 13, 2020 âÂ"
}

并为您的文本生成标记。

{
    "tokens": [
        {
            "token": "Pharmaceutical",
            "start_offset": 0,
            "end_offset": 14,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "Marketing",
            "start_offset": 15,
            "end_offset": 24,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "Building",
            "start_offset": 25,
            "end_offset": 33,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "a",
            "start_offset": 34,
            "end_offset": 35,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "responsibilities.A",
            "start_offset": 36,
            "end_offset": 54,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "A",
            "start_offset": 55,
            "end_offset": 56,
            "type": "<ALPHANUM>",
            "position": 5
        },
        {
            "token": "Mass",
            "start_offset": 57,
            "end_offset": 61,
            "type": "<ALPHANUM>",
            "position": 6
        },
        {
            "token": "a",
            "start_offset": 63,
            "end_offset": 64,
            "type": "<ALPHANUM>",
            "position": 7
        },
        {
            "token": "Aug",
            "start_offset": 65,
            "end_offset": 68,
            "type": "<ALPHANUM>",
            "position": 8
        },
        {
            "token": "13",
            "start_offset": 70,
            "end_offset": 72,
            "type": "<NUM>",
            "position": 9
        },
        {
            "token": "2020",
            "start_offset": 74,
            "end_offset": 78,
            "type": "<NUM>",
            "position": 10
        },
        {
            "token": "aA",
            "start_offset": 79,
            "end_offset": 81,
            "type": "<ALPHANUM>",
            "position": 11
        }
    ]
}

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/64384571

复制

相似问题

问如何在索引时停止在内容中存储特殊字符
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在索引时停止在内容中存储特殊字符EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在索引时停止在内容中存储特殊字符
EN