简介
本文介绍如何根据某一个字段进行去重。包括:获取去重后的结果,统计去重后的数量。
在SQL中,我们可以用dinstinct语句进行去重,例如:
- 获取去重后的结果:SELECT DISTINCT name, sex FROM person;
- 统计去重后的数量:SELECT COUNT(DISTINCT name, sex) FROM person;
Elasticsearch也可以做到获取去重后的结果,统计去重后的数量,例如:
- 获取去重后的结果
- 方案1:collapse折叠功能(ES5.3之后支持)
- 推荐。原因:性能高,占内存小
- 方案2:字段聚合+top_hits聚合
- 不推荐。原因:性能差,占内存大
- 方案1:collapse折叠功能(ES5.3之后支持)
- 统计去重后的数量
- 聚合+cardinality聚合函数
索引结构及数据
索引结构
http://localhost:9200/
PUT blog
{ "mappings": { "properties": { "id":{ "type":"long" }, "title": { "type": "text" }, "content": { "type": "text" }, "author":{ "type": "text", "fields": { "keyword": { "type": "keyword" } } }, "category":{ "type": "keyword" }, "createTime": { "type": "date", "format":"yyyy-MM-dd HH:mm:ss.SSS||yyyy-MM-dd'T'HH:mm:ss.SSS||yyyy-MM-dd HH:mm:ss||epoch_millis" }, "updateTime": { "type": "date", "format":"yyyy-MM-dd HH:mm:ss.SSS||yyyy-MM-dd'T'HH:mm:ss.SSS||yyyy-MM-dd HH:mm:ss||epoch_millis" }, "status":{ "type":"integer" }, "serialNum": { "type": "keyword" } } } }
数据
- 每个文档必须独占一行,不能换行。
- 此命令要放到postman中去执行,如果用head执行会失败
http://localhost:9200/
POST _bulk
{"index":{"_index":"blog","_id":1}} {"blogId":1,"title":"Spring Data ElasticSearch学习教程1","content":"这是批量添加的文档1","author":"Tony","category":"ElasticSearch","status":1,"serialNum":"1","createTime":"2021-10-10 11:52:01.249","updateTime":null} {"index":{"_index":"blog","_id":2}} {"blogId":2,"title":"Spring Data ElasticSearch学习教程2","content":"这是批量添加的文档2","author":"Tony","category":"ElasticSearch","status":1,"serialNum":"2","createTime":"2021-10-10 11:52:02.249","updateTime":null} {"index":{"_index":"blog","_id":3}} {"blogId":3,"title":"Spring Data ElasticSearch学习教程3","content":"这是批量添加的文档3","author":"Tony","category":"ElasticSearch","status":1,"serialNum":"3","createTime":"2021-10-10 11:52:03.249","updateTime":null} {"index":{"_index":"blog","_id":4}} {"blogId":4,"title":"Spring Data ElasticSearch学习教程4","content":"这是批量添加的文档4","author":"Tony","category":"ElasticSearch","status":1,"serialNum":"4","createTime":"2021-10-10 11:52:04.249","updateTime":null} {"index":{"_index":"blog","_id":5}} {"blogId":5,"title":"Spring Data ElasticSearch学习教程5","content":"这是批量添加的文档5","author":"Tony","category":"ElasticSearch","status":1,"serialNum":"5","createTime":"2021-10-10 11:52:05.249","updateTime":null} {"index":{"_index":"blog","_id":6}} {"blogId":6,"title":"Java学习教程6","content":"这是批量添加的文档6","author":"Tony","category":"ElasticSearch","status":1,"serialNum":"6","createTime":"2021-10-10 11:52:06.249","updateTime":null} {"index":{"_index":"blog","_id":7}} {"blogId":7,"title":"Java学习教程7","content":"这是批量添加的文档7","author":"Pepper","category":"ElasticSearch","status":1,"serialNum":"7","createTime":"2021-10-10 11:52:07.249","updateTime":null} {"index":{"_index":"blog","_id":8}} {"blogId":8,"title":"Java学习教程8","content":"这是批量添加的文档8","author":"Pepper","category":"ElasticSearch","status":1,"serialNum":"8","createTime":"2021-10-10 11:52:08.249","updateTime":null} {"index":{"_index":"blog","_id":9}} {"blogId":9,"title":"Java学习教程9","content":"这是批量添加的文档9","author":"Pepper","category":"ElasticSearch","status":1,"serialNum":"9","createTime":"2021-10-10 11:52:09.249","updateTime":null} {"index":{"_index":"blog","_id":10}} {"blogId":10,"title":"Java学习教程10","content":"这是批量添加的文档10","author":"Pepper","category":"ElasticSearch","status":1,"serialNum":"10","createTime":"2021-10-10 11:52:10.249","updateTime":null}
执行之后的结果
实例:手写DSL
去重的字段不能是text类型。所以,author的mapping要有keyword,且通过author.keyword去重。
如果去重字段是其他可以直接去重的类型,比如:数字类型、keyword、日期等,则直接用字段名就可以。即:如果本处author是keyword,则author.keyword处写成author就行。
collapse获取去重结果
POST /blog/_search
{ "query": { "match": { "title":{ "query": "java" } } }, "collapse":{ "field": "author.keyword" } }
结果:
我把结果全部贴出来
{ "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 5, "relation": "eq" }, "max_score": null, "hits": [{ "_index": "blog", "_type": "_doc", "_id": "9", "_score": 1.0596458, "_source": { "blogId": 9, "title": "Java学习教程9", "content": "这是批量添加的文档9", "author": "Pepper", "category": "ElasticSearch", "status": 1, "serialNum": "9", "createTime": "2021-10-10 11:52:09.249", "updateTime": null }, "fields": { "author.keyword": [ "Pepper" ] } }, { "_index": "blog", "_type": "_doc", "_id": "6", "_score": 0.7361701, "_source": { "blogId": 6, "title": "Java学习教程6", "content": "这是批量添加的文档6", "author": "Tony", "category": "ElasticSearch", "status": 1, "serialNum": "6", "createTime": "2021-10-10 11:52:06.249", "updateTime": null }, "fields": { "author.keyword": [ "Tony" ] } } ] } }
聚合获取去重结果
POST /blog/_search
{ "query": { "match": { "title": { "query": "java" } } }, "size": 0, "aggs": { "author_aggs": { "terms": { "field": "author.keyword", "size": 10 }, "aggs": { "author_top": { "top_hits": { "sort": [{ "author.keyword": { "order": "desc" } } ], "size": 1 } } } } } }
结果
我把结果全部贴出来
{ "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 5, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "author_aggs": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [{ "key": "Pepper", "doc_count": 4, "author_top": { "hits": { "total": { "value": 4, "relation": "eq" }, "max_score": null, "hits": [{ "_index": "blog", "_type": "_doc", "_id": "9", "_score": null, "_source": { "blogId": 9, "title": "Java学习教程9", "content": "这是批量添加的文档9", "author": "Pepper", "category": "ElasticSearch", "status": 1, "serialNum": "9", "createTime": "2021-10-10 11:52:09.249", "updateTime": null }, "sort": [ "Pepper" ] } ] } } }, { "key": "Tony", "doc_count": 1, "author_top": { "hits": { "total": { "value": 1, "relation": "eq" }, "max_score": null, "hits": [{ "_index": "blog", "_type": "_doc", "_id": "6", "_score": null, "_source": { "blogId": 6, "title": "Java学习教程6", "content": "这是批量添加的文档6", "author": "Tony", "category": "ElasticSearch", "status": 1, "serialNum": "6", "createTime": "2021-10-10 11:52:06.249", "updateTime": null }, "sort": [ "Tony" ] } ] } } } ] } } }
聚合获取去重数量
POST /blog/_search
{ "query": { "match": { "title": { "query": "java" } } }, "size": 0, "aggs": { "author_aggs": { "cardinality": { "field": "author.keyword" } } } }
结果
collapse获取去重结果
见:Spring Data Elasticsearch-ElasticsearchRestTemplate-使用实例 – 自学精灵
请先
!