我正在寻找最好的方式分组数据在elasticsearch。Elasticsearch不支持sql中的'group‘之类的东西。
让我说我有1k的类别和数百万的产品。您认为呈现完整类别树的最佳方法是什么?当然,你需要一些元数据(图标,链接目标,搜索引擎标题,.)以及类别的自定义排序。
例如,使用这3种“解决方案”构建一个类别树很糟糕。解决方案1可能工作(Es1现在不稳定)解决方案2不工作解决方案3是一个痛苦,因为它感觉丑陋,您需要准备大量的数据和方面爆炸。
也许另一种选择是不在ES中存储任何类别数据,而只存储id https://found.no/play/gist/a53e46c91e2bf077f2e1。
然后,您可以从另一个系统获得相关的类别,如redis、memcache或数据库。
这将在干净的代码中结束,但是性能可能会成为一个问题。例如,从Memcache / Redis /a数据库加载1k类别可能很慢。另一个问题是同步2数据库比同步数据库更难。
你是如何处理这些问题的?
我很抱歉的链接,但我不能张贴超过2在一篇文章。
发布于 2014-01-23 07:35:55
聚合API允许使用子聚合按多个字段进行分组。假设要按字段( field1、field2和field3 )分组
{
"aggs": {
"agg1": {
"terms": {
"field": "field1"
},
"aggs": {
"agg2": {
"terms": {
"field": "field2"
},
"aggs": {
"agg3": {
"terms": {
"field": "field3"
}
}
}
}
}
}
}
}当然,这可以在您想要的范围内进行。
更新:
为了完整起见,下面是上面查询的输出。下面还有用于生成聚合查询并将结果扁平化为字典列表的python代码。
{
"aggregations": {
"agg1": {
"buckets": [{
"doc_count": <count>,
"key": <value of field1>,
"agg2": {
"buckets": [{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
},
{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
}, ...
]
},
{
"doc_count": <count>,
"key": <value of field1>,
"agg2": {
"buckets": [{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
},
{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
}, ...
]
}, ...
]
}
}
}下面的python代码通过给定字段列表来执行组。我指定了include_missing=True,它还包括一些字段缺失的值的组合(如果您有这 2.0版的Elasticsearch,就不需要它了)
def group_by(es, fields, include_missing):
current_level_terms = {'terms': {'field': fields[0]}}
agg_spec = {fields[0]: current_level_terms}
if include_missing:
current_level_missing = {'missing': {'field': fields[0]}}
agg_spec[fields[0] + '_missing'] = current_level_missing
for field in fields[1:]:
next_level_terms = {'terms': {'field': field}}
current_level_terms['aggs'] = {
field: next_level_terms,
}
if include_missing:
next_level_missing = {'missing': {'field': field}}
current_level_terms['aggs'][field + '_missing'] = next_level_missing
current_level_missing['aggs'] = {
field: next_level_terms,
field + '_missing': next_level_missing,
}
current_level_missing = next_level_missing
current_level_terms = next_level_terms
agg_result = es.search(body={'aggs': agg_spec})['aggregations']
return get_docs_from_agg_result(agg_result, fields, include_missing)
def get_docs_from_agg_result(agg_result, fields, include_missing):
current_field = fields[0]
buckets = agg_result[current_field]['buckets']
if include_missing:
buckets.append(agg_result[(current_field + '_missing')])
if len(fields) == 1:
return [
{
current_field: bucket.get('key'),
'doc_count': bucket['doc_count'],
}
for bucket in buckets if bucket['doc_count'] > 0
]
result = []
for bucket in buckets:
records = get_docs_from_agg_result(bucket, fields[1:], include_missing)
value = bucket.get('key')
for record in records:
record[current_field] = value
result.extend(records)
return resulthttps://stackoverflow.com/questions/20775040
复制相似问题