edit | blame | history | raw

ElasticSearch 聚合(aggregations)

Syntax	Description
Header	Title
Paragraph	Text

| Syntax      | Description |
| ----------- | ----------- |
| Header      | Title       |
| Paragraph   | Text        |

特点

聚合和搜索是使用同样的数据结构，因此聚合和搜索可以是一起执行的.
这表示我们可以在一次json请求裡，同时对相同的数据进行搜索/过滤 + 分析

桶和度量

桶（bucket）

1.是按照某种方式对数据进行分组，但不包括计算，因此bucket中往往会嵌套另一种聚合：metrics aggregations即度量

2.桶可以被嵌套在其他桶里面

3.比较常用的桶划分方式有
- Terms Aggregation：根据词条内容分组，词条内容完全匹配的为一组
- filter：一个用来过滤的桶 和用在主查询query的 "过滤filter" 的用法是一模一样的，都是过滤
- top_hits桶 : 在某个桶底下找出这个桶的前几笔hits，返回的hits格式和主查询query返回的hits格式一模一样
- Date Histogram Aggregation：根据日期阶梯分组，例如给定阶梯为周，会自动每周分为一组
- Histogram Aggregation：根据数值阶梯分组，与日期类似
- Range Aggregation：数值和日期的范围分组，指定开始和结束，然后按段分组

度量（metrics）

分组完成以后，我们一般会对组中的数据进行聚合运算，例如求平均值、最大、最小、求和等，这些在ES中称为度量

常用的度量集合方式有
- Avg Aggregation：求平均值
- Max Aggregation：求最大值
- Min Aggregation：求最小值
- Percentiles Aggregation：求百分比
- Stats Aggregation：同时返回avg、max、min、sum、count等
- Sum Aggregation：求和
- Top hits Aggregation：求前几
- Value Count Aggregation：求总数

aggs 聚合的模板

当query和aggs一起存在时，会先执行query的主查询，主查询query执行完后会搜出一批结果，而这些结果才会被拿去aggs拿去做聚合
另外要注意aggs后面会先接一层自定义的这个聚合的名字，然后才是接上要使用的聚合桶
如果有些情况不在意查询结果是什麽，而只在意aggs的结果，可以把size设为0，如此可以让返回的hits结果集是0，加快返回的速度
一个aggs裡可以有很多个聚合，每个聚合彼此间都是独立的，因此可以一个聚合拿来统计数量、一个聚合拿来分析数据、一个聚合拿来计算标准差...，让一次搜索就可以把想要做的事情一次做完
aggs可以嵌套在其他的aggs裡面，而嵌套的桶能作用的文档集范围，是外层的桶所输出的结果集
模板
GET /test/doc/_search { "query": { ... }, "size": 0, "aggs": { "custom_name1": { //aggs后面接著的是一个自定义的name "桶": { ... } //再来才是接桶 }, "custom_name2": { //一个aggs裡可以有很多聚合 "桶": { ... } }, "custom_name3": { "桶": { ..... }, "aggs": { //aggs可以嵌套在别的aggs裡面 "in_name": { //记得使用aggs需要先自定义一个name "桶": { ... } //in_name的桶作用的文档是custom_name3的桶的结果 } } } }
结果模板
{ "hits": { "total": 8, "max_score": 0, "hits": [] //因为size设为0，所以没有查询结果返回 }, "aggregations": { "custom_name1": { ... }, "custom_name2": { ... }, "custom_name3": { ... , "in_name": { .... } } } }

数据准备

PUT /test
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "color": {
        "type": "keyword"
      },
      "price": {
        "type": "long"
      }
    }
  }
}

POST /test/_doc/1
{"color":"red","price":100}

POST /test/_doc/2
{"color":"green","price":500}

POST /test/_doc/3
{"color":["red","blue"],"price":1000}

示例

trems桶
- 找出共几组颜色和组内颜色个数
  GET /test/_search { "size": 0, "aggs": { "my_terms": { "terms": { "field": "color" } } } }
  聚合结果
  { "aggregations" : { "my_terms" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "red", "doc_count" : 2 }, { "key" : "blue", "doc_count" : 1 }, { "key" : "green", "doc_count" : 1 } ] } } }
- 在示例1基础上,对分组的颜色求价格平均和最小值
  GET /test/_search { "size": 0, "aggs": { "my_terms": { "terms": { "field": "color" }, "aggs": { "my_avg_price": { "avg": { "field": "price" } }, "my_min_price": { "min": { "field": "price" } } } } } }
  聚合结果
  { "aggregations" : { "my_terms" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "red", "doc_count" : 2, "my_avg_price" : { "value" : 550.0 }, "my_min_price" : { "value" : 100.0 } }, { "key" : "blue", "doc_count" : 1, "my_avg_price" : { "value" : 1000.0 }, "my_min_price" : { "value" : 1000.0 } }, { "key" : "green", "doc_count" : 1, "my_avg_price" : { "value" : 500.0 }, "my_min_price" : { "value" : 500.0 } } ] } }
filter桶
过滤只查看红颜色的分组情况
GET /test/_search { "size": 0, "aggs": { "my_fliter": { "filter": { "bool": { "must": { "terms": { "color": [ "red" ] } } } } } } }
聚合结果
{ "aggregations" : { "my_fliter" : { "doc_count" : 2 } } }
filter桶和terms桶叠加嵌套使用
过滤含有红颜色的文档,再对其中包含的颜色进行分组
GET /test/_search { "size": 0, "aggs": { "my_fliter": { "filter": { "bool": { "must": { "terms": { "color": [ "red" ] } } } }, "aggs": { "my_trems": { "terms": { "field": "color" } } } } } }
聚合结果
- 因为terms桶嵌套在filter桶内，所以query查询出来的文档们会先经过filter桶，如果符合filter桶，才会进入到terms桶内
- 此处通过filter桶的文档只有两笔，分别是{"color": "red"}以及{"color": ["red", "blue"]}，所以terms桶只会对这两笔文档做分组
- 这也是为什麽terms桶裡没有出现color为green的分组，因为这个文档在filter桶就被挡下来了
- 需注意的是聚合中取的是query之后文档内容,如果query中限制只查询green的文档,那么聚合将无对应内容展示
{ "aggregations" : { "my_fliter" : { "doc_count" : 2, "my_trems" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "red", "doc_count" : 2 }, { "key" : "blue", "doc_count" : 1 } ] } } } }
当然也可以先进行trems桶嵌套filter桶,意义则是分组后再进行过滤
GET /test/_search { "size": 0, "aggs": { "my_trems": { "terms": { "field": "color" }, "aggs": { "my_fliter": { "filter": { "bool": { "must": { "terms": { "color": [ "red" ] } } } } } } } } }
聚合结果
- 在分组中进行过滤,可以看到green中my_filter中的doc_count结果为0
- 而至于为什么bule中含有一条doc_count=1,是因为原文档是{"color":["red","blue"]}
```
{
"aggregations" : {
"my_trems" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "red",
"doc_count" : 2,
"my_fliter" : {
"doc_count" : 2
}
},
{
"key" : "blue",
"doc_count" : 1,
"my_fliter" : {
"doc_count" : 1
}
},
{
"key" : "green",
"doc_count" : 1,
"my_fliter" : {
"doc_count" : 0
}
}
]
}
}
}
```
top_hits桶
在某个桶底下找出这个桶的前几笔hits，返回的hits格式和主查询query返回的hits格式一模一样
另外,该桶中不能再嵌套子聚合
Aggregator [my_top_hit] of type [top_hits] cannot accept sub-aggregations
- top_hits桶支持的参数
- from、size
- sort : 设置返回的hits的排序
要注意，假设在主查询query裡已经对数据设置了排序sort，此sort并不会对aggs裡面的数据造成影响，也就是说主查询query查找出来的数据会先丢进aggs而非先经过sort，因此就算主查询设置了sort，也不会影响aggs数据裡的排序
因此如果在top_hits桶裡的返回的hits数据想要排序，需要自己在top_hits桶裡设置sort
如果没有设置sort，默认使用主查询query所查出来的_score排序
- _source : 设置返回的字段
按价格排序,取前两条记录
GET /test/_search { "size": 0, "aggs": { "my_top_hit": { "top_hits": { "size": 2, "sort": ["price"] #默认升序asc #"sort": {"price":"desc"}这种写法也可以 } } } }
聚合结果
{ "aggregations" : { "my_top_hit" : { "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : null, "hits" : [ { "_index" : "test", "_type" : "_doc", "_id" : "1", "_score" : null, "_source" : { "color" : "red", "price" : 100 }, "sort" : [ 100 ] }, { "_index" : "test", "_type" : "_doc", "_id" : "2", "_score" : null, "_source" : { "color" : "green", "price" : 500 }, "sort" : [ 500 ] } ] } } } }