PostgreSQL 哈希链接和哈希聚合

AustinDatabases

发布于 2024-03-02 10:35:45

2350

发布于 2024-03-02 10:35:45

文章被收录于专栏：AustinDatabases

在PostgreSQL中，表和表之间进行关联关系的情况下，在等值链接中，两个表如果一个是大表一个是小表，PostgreSQL 更倾向与使用 hash join 的方式来解决问题。主要的原因在于通过hash join 会利用内存来进行等值链接的对比针对这种链接的方式，效率更高，

SELECT customer.first_name, customer.last_name, SUM(rental.return_date - rental.rental_date) AS total_rental_duration
FROM customer
JOIN rental ON customer.customer_id = rental.customer_id
GROUP BY customer.customer_id
ORDER BY total_rental_duration DESC
LIMIT 5;

我们用一个示例的语句来说明这个问题，上面的语句是针对两个表进行Join操作，并且针对小表的中的客户来进行分组，找到在dvdrental 中租赁最多的前五。

dvdrental=# select count(*) from rental;
 count 
-------
 16044
(1 row)

dvdrental=# select count(*) from customer;
 count 
-------
   599
(1 row)n Time: 9.835 ms
(15 rows)

 Limit  (cost=511.60..511.62 rows=5 width=33) (actual time=9.752..9.755 rows=5 loops=1)
   ->  Sort  (cost=511.60..513.10 rows=599 width=33) (actual time=9.751..9.753 rows=5 loops=1)
         Sort Key: (sum((rental.return_date - rental.rental_date))) DESC
         Sort Method: top-N heapsort  Memory: 25kB
         ->  HashAggregate  (cost=495.66..501.65 rows=599 width=33) (actual time=9.555..9.645 rows=599 loops=1)
               Group Key: customer.customer_id
               Batches: 1  Memory Usage: 169kB
               ->  Hash Join  (cost=22.48..375.33 rows=16044 width=33) (actual time=0.186..5.451 rows=16044 loops=1)
                     Hash Cond: (rental.customer_id = customer.customer_id)
                     ->  Seq Scan on rental  (cost=0.00..310.44 rows=16044 width=18) (actual time=0.006..1.233 rows=16044 loops=1)
                     ->  Hash  (cost=14.99..14.99 rows=599 width=17) (actual time=0.160..0.161 rows=599 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 39kB
                           ->  Seq Scan on customer  (cost=0.00..14.99 rows=599 width=17) (actual time=0.004..0.073 rows=599 loops=1)
 Planning Time: 0.545 ms
 Execution Time: 9.835 ms
(15 rows)

在实际运行中，通过执行计划，可以很清晰的看到在执行中，先对小表customer 表进行hash buckets 操作，然后在对rental 表进行hash buckets 操作，最后进行了hash join 的表连接，hash buckets 主要的作用是存储具有相同哈希值的键值连接条件。当进行hash 链接，系统将遍历每个hash buckets，搜索具有匹配hash 值的连接，最后返回匹配行。

hash 连接在使用中需要注意，在使用中两个数据集合都需要加载到内存中，来构建hash 表进行hash 操作，并且在使用hash 桶的情况下，需要注意值的倾斜的问题，如果表中的大部分值都是一致的则使用这样的算法会导致一个hash 桶的数据量远远大于其他的桶。其中优化的方式一般通过充足的内存，优化hash 函数，让值更加的分散到hash 桶中等，此为hash 连接。

hash 聚合，哈希聚合是种常用的数据处理算法，他会对如sum, avg max, min 等group by 操作进行数据的分组和聚合计算，在处理的过程中，会将数据分成多个组，每个组具有相同的分组键，聚合计算会对该组中的数据进行合并计算。

hash 聚合的优点减少了磁盘的IO 消耗，将大部分聚合计算都在内存中进行，同时基于hash聚合可以使用并行的能力，充分利用多核心的CPU 来进行计算加速数据的处理。

在PostgreSQL中有四个参数与这部分有关

postgres=# select name,setting from pg_settings where name like '%hash%' or name like 'hash%' or name like '%hash';
         name         | setting 
----------------------+---------
 enable_hashagg       | on
 enable_hashjoin      | on
 enable_parallel_hash | on
 hash_mem_multiplier  | 1

这里有是哪个部分，hash 聚合，hash join ，hash 并行三个部分都可以进行开关，默认是开启的，另一个部分hash_mem_multiplier ，这个参数定义了操作中hash 表所需要的内存倍数，其中指定了 hash表使用内存大小与work_mem 的比例，其中这里有两种设置参数值。

hash_mem_multiplier 可以设置的值为 1 OR 2 ，1 为产生了hash内存值 = work_mem , 2 为 hash内存值 = work_mem /2 , 这里需要注意，如果work_mem 设置的较大，可以采用 2 ,如果 work_mem 本身不大则默认为1 即可。

                                                                      QUERY PLAN                                                     
                 
-------------------------------------------------------------------------------------------------------------------------------------
-----------------
 Limit  (cost=1847.14..1847.15 rows=5 width=33) (actual time=17.283..17.287 rows=5 loops=1)
   ->  Sort  (cost=1847.14..1848.64 rows=599 width=33) (actual time=17.281..17.284 rows=5 loops=1)
         Sort Key: (sum((rental.return_date - rental.rental_date))) DESC
         Sort Method: top-N heapsort  Memory: 25kB
         ->  GroupAggregate  (cost=1431.37..1837.19 rows=599 width=33) (actual time=9.231..17.123 rows=599 loops=1)
               Group Key: customer.customer_id
               ->  Merge Join  (cost=1431.37..1710.87 rows=16044 width=33) (actual time=9.197..14.684 rows=16044 loops=1)
                     Merge Cond: (customer.customer_id = rental.customer_id)
                     ->  Index Scan using customer_pkey on customer  (cost=0.28..37.63 rows=599 width=17) (actual time=0.008..1.470 r
ows=599 loops=1)
                     ->  Sort  (cost=1431.09..1471.20 rows=16044 width=18) (actual time=9.178..10.859 rows=16044 loops=1)
                           Sort Key: rental.customer_id
                           Sort Method: quicksort  Memory: 1632kB
                           ->  Seq Scan on rental  (cost=0.00..310.44 rows=16044 width=18) (actual time=0.020..2.945 rows=16044 loops
=1)
 Planning Time: 1.963 ms
 Execution Time: 17.963 ms
(15 rows)

dvdrental=# select name,setting from pg_settings where name like '%hash%' or name like 'hash%' or name like '%hash';
         name         | setting 
----------------------+---------
 enable_hashagg       | off
 enable_hashjoin      | off
 enable_parallel_hash | on
 hash_mem_multiplier  | 1
(4 rows)

dvdrental=#

从这个结果我们可以看到在hash join ,hash 聚合关闭的情况下，对我们整体的执行计划至少慢了1倍。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2024-02-27，如有侵权请联系 cloudcommunity@tencent.com 删除

内存