我有以下同义词扩展:
suco => suco, refresco, bebida de soja我想要的是这样标记搜索:
搜索"suco de laranja“将被标记为"suco”、"laranja“、"refresco”、"bebida de soja“。
但是我把它标记为"suco","laranja","refresco","bebida","soja“。
考虑一下"de“这个词是一个停止词。我希望在查询中忽略它,比如"bebida de laranja“变成"bebida","laranja”。但是我不希望它在同义词标记化上被考虑,所以"bebida de soja“仍然是一个标记"bebida de soja”。
我的设置:
{
"settings":{
"analysis":{
"filter":{
"synonym_br":{
"type":"synonym",
"synonyms":[
"suco => suco, refresco, bebida de soja"
]
},
"brazilian_stop":{
"type":"stop",
"stopwords":"_brazilian_"
}
},
"analyzer":{
"synonyms":{
"filter":[
"synonym_br",
"lowercase",
"brazilian_stop",
"asciifolding"
],
"type":"custom",
"tokenizer":"standard"
}
}
}
}
}发布于 2019-05-02 03:45:26
我建议你做以下两项修改。第一个问题与你提出的问题直接相关,第二个问题是建议。
"suco => suco, refresco, bebida de soja"更改为"suco, refresco, bebida de soja => suco"synonyms分析器中过滤器的顺序。将lowercase置于synonym_br之前。这将确保这种情况不会影响synonym_br令牌筛选器。因此,最终设置为:
{
"settings": {
"analysis": {
"filter": {
"synonym_br": {
"type": "synonym",
"synonyms": [
"suco, refresco, bebida de soja => suco"
]
},
"brazilian_stop": {
"type": "stop",
"stopwords": "_brazilian_"
}
},
"analyzer": {
"synonyms": {
"filter": [
"lowercase",
"synonym_br",
"brazilian_stop",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
}这是怎么回事?
对于输入bebida de soja过滤器,按以下顺序应用:
Input Filter Result tokens
====================================
lowercase bebida, de, soja
synonym_br suco <------- all the above tokens(including position) exactly matches a synonym
brazilian_stop suco
asciifolding suco让我们看看brazilian_stop在起作用。为此,我们需要一个与同义词不匹配但其中包含de的输入。例如de soja
Input Filter Result tokens
=================================
lowercase de, soja
synonym_br de, soja <------- none of the tokens (independently or combined(including position)) matches any synonym
brazilian_stop soja <------- de is removed as it is a stopword
asciifolding sojahttps://stackoverflow.com/questions/55944061
复制相似问题