前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >PG中的查询:2.统计--(2)

PG中的查询:2.统计--(2)

作者头像
yzsDBA
发布2022-03-29 11:36:11
7260
发布2022-03-29 11:36:11
举报
文章被收录于专栏:PostgreSQL研究与原理解析

直方图

当不同值的数量变得太大而无法将他们全部存储在数组时,系统开始使用直方图表示。直方图使用多个存储桶来存储值。存储桶的数量受相同的default_statistics_target参数限制。每个桶的宽度以这样一种方式选择,即在他们之间均匀分布值(如图上具有大致相同面积的矩形表示)。这种表示使系统能够只存储直方图边界,而不是浪费空间来存储每个桶的频率。直方图不包括MCV列表中的值。边界存储在pg_stats的histogram_bounds字段。任何桶的汇总频率等于1/桶数。

直方图存储为存储桶边界数组:

代码语言:javascript
复制
SELECT left(histogram_bounds::text,60) || '...' AS histogram_bounds
FROM pg_stats s
WHERE s.tablename = 'boarding_passes' AND s.attname = 'seat_no';

                        histogram_bounds
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 {10B,10D,10D,10F,11B,11C,11H,12H,13B,14B,14H,15H,16D,16D,16H...
(1 row)

除其他外,直方图与MCV列表一起用于估计“大于”和“小于”操作的选择性。示例:计算为后排座位签发登机牌数量:

代码语言:javascript
复制
EXPLAIN SELECT * FROM boarding_passes WHERE seat_no > '30C';

                             QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Seq Scan on boarding_passes  (cost=0.00..157353.30 rows=2943394 ...
   Filter: ((seat_no)::text > '30C'::text)

(2 rows)截止座位号专门选择在两个桶之间的边缘。此条件的选择性是N/桶数。其中N是具有匹配值的桶数(在截止点右侧)。请记住,直方图没有考虑最常见的值和未定义的值。我们先看看匹配最常见值的分数:

代码语言:javascript
复制
SELECT sum(s.most_common_freqs[
  array_position((s.most_common_vals::text::text[]),v)
])
FROM pg_stats s, unnest(s.most_common_vals::text::text[]) v
WHERE s.tablename = 'boarding_passes' AND s.attname = 'seat_no'
AND v > '30C';

  sum
−−−−−−−−
 0.2127
(1 row)

看看最常见值的分数(从直方图中排除):

代码语言:javascript
复制
SELECT sum(s.most_common_freqs[
  array_position((s.most_common_vals::text::text[]),v)
])
FROM pg_stats s, unnest(s.most_common_vals::text::text[]) v
WHERE s.tablename = 'boarding_passes' AND s.attname = 'seat_no';

  sum
−−−−−−−−
 0.6762
(1 row)

seat_no列中没有NULL值:

代码语言:javascript
复制
SELECT s.null_frac
FROM pg_stats s
WHERE s.tablename = 'boarding_passes' AND s.attname = 'seat_no';

 null_frac
−−−−−−−−−−−
         0
(1 row)

间隔正好涵盖49个桶(总共100个),结果估算:

代码语言:javascript
复制
SELECT round( reltuples * (
   0.2127 -- from most common values
 + (1 - 0.6762 - 0) * (49 / 100.0) -- from histogram
))
FROM pg_class WHERE relname = 'boarding_passes';

  round
−−−−−−−−−
 2943394
(1 row)

真实值是2986429.当截止值不再桶边缘时,该桶的匹配分数是使用线性插值计算的。

较高的default_statistis_target值可能会提高估计精度,但直方图与MCV列表一起已经产生了很好的结果,即使有大量不同值:

代码语言:javascript
复制
SELECT n_distinct FROM pg_stats
WHERE tablename = 'boarding_passes' AND attname = 'seat_no';

 n_distinct
−−−−−−−−−−−−
        461
(1 row)

更高的估计精度只有在提高规划质量时才有用。在没有正当理由下增加default_statistics_target可能会减慢分析和计划,但对优化没有影响。另一方面,降低参数(一直降到0)可能会提高分析和计划速度,但也可能导致计划质量地下,因此这种“节省时间”很少被证明是合理的。

非标量数据类型的统计

非标量数据类型的统计信息可能包括非标量值本身的分布数据,还包括组成元素的分布数据。这允许查询非第一范式中的列时进行更准确的计划。

1) 数组most_common_elems和most_common_elem_freqs 包含最常见的元素及其频率。这些统计数据被采集,并用于估计阵列和tsvector数据的选择性。

2) elem_count_histogram数组是不同值的数量的直方图。采集这些数据,并仅用于估计阵列的选择性

3) 对于范围数据类型,直方图用于表示范围长度的分布及下限和上限的分布。然后,这些直方图有助于估计使用这些数据类型的各种操作的选择性。他们未显示在pg_stats中。

这些统计信息也用于PG14中引入的多范围数据类型。

平均字段宽度

pg_stats中的avg_width字段表示一列中平均字段宽度。数据类型比如integer或char(3)的字段宽度是固定的,但是当使用没有设置宽度的数据类型(例如text)时,值可能会因列而异。

代码语言:javascript
复制
SELECT attname, avg_width FROM pg_stats
WHERE (tablename, attname) IN ( VALUES
  ('tickets', 'passenger_name'), ('ticket_flights','fare_conditions')
);

     attname     | avg_width
−−−−−−−−−−−−−−−−−+−−−−−−−−−−−
 fare_conditions |         8
 passenger_name  |        16
(2 rows)

这些信息帮助评估排序或哈希等操作的内存使用。

相关性

pg_stats中correlation字段表示磁盘上物理行序和列值的逻辑序(大于或者小于)之间的关系,值从-1到1。如果值存储有序,相关值接近1。如果值反向有序,相关值接近于-1。数据在磁盘上分布越混乱,值越接近于0。

代码语言:javascript
复制
SELECT attname, correlation
FROM pg_stats WHERE tablename = 'airports_data'
ORDER BY abs(correlation) DESC;
   attname    | correlation
−−−−−−−−−−−−−−+−−−−−−−−−−−−−
 coordinates  |
 airport_code | −0.21120238
 city         |  −0.1970127
 airport_name | −0.18223621
 timezone     |  0.17961165
(5 rows)

coordinates无法收集列的统计信息,因为没有为点数据类型定义比较操作(小于和大于)。相关行用于索引扫描成本估计。

表达式统计

通常来讲,列统计信息只在操作调用列本身时使用,而不是用于以列为参数的表达式。规划器并不知道函数如何影响列的统计信息。因此向“function-call = constant”这样的条件,总是估计在0.5%。

代码语言:javascript
复制
EXPLAIN SELECT * FROM flights
WHERE extract(
  month FROM scheduled_departure AT TIME ZONE 'Europe/Moscow'
) = 1;

                             QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Seq Scan on flights  (cost=0.00..6384.17 rows=1074 width=63)
   Filter: (EXTRACT(month FROM (scheduled_departure AT TIME ZONE ...
(2 rows)

SELECT round(reltuples * 0.005)
FROM pg_class WHERE relname = 'flights';

 round
−−−−−−−
  1074
(1 row)

规划器不能处理标准函数,而对我们来说,很明显1月份的航班比例将约为总航班的1/12:

代码语言:javascript
复制
SELECT count(*) AS total,
  count(*) FILTER (WHERE extract(
    month FROM scheduled_departure AT TIME ZONE 'Europe/Moscow'
  ) = 1) AS january
FROM flights;
 total  | january
−−−−−−−−+−−−−−−−−−
 214867 |   16831
(1 row)

这就是表达式统计的用武之地。

扩展表达式统计

PG14引入了一种称为扩展表达式统计的特性。扩展表达式统计信息不会自动收集。要手动收集他们,请使用CREATE STATISTICS命令创建扩展统计数据库对象。

代码语言:javascript
复制
CREATE STATISTICS flights_expr ON (extract(
  month FROM scheduled_departure AT TIME ZONE 'Europe/Moscow'
))
FROM flights;

新的统计信息会提高估算准确性:

代码语言:javascript
复制
ANALYZE flights;
EXPLAIN SELECT * FROM flights
WHERE extract(
  month FROM scheduled_departure AT TIME ZONE 'Europe/Moscow'
) = 1;

                             QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Seq Scan on flights  (cost=0.00..6384.17 rows=16222 width=63)
   Filter: (EXTRACT(month FROM (scheduled_departure AT TIME ZONE ...
(2 rows)

要使统计信息起作用,统计信息生成命令中的表达式必须与原始查询中表达式相同。扩展统计信息元数据存储在pg_statistic_ext系统表中,而统计数据本身存储在单独的表pg_statistic_ext_data(PG12及更高版本)中。如有必要,与元数据分开存储,以限制用户访问敏感信息。

有些视图以用户友好的形式显示收集的统计信息。可以使用以下命令显示扩展表达式统计信息:

代码语言:javascript
复制
SELECT left(expr,50) || '...' AS expr,
  null_frac, avg_width, n_distinct,
  most_common_vals AS mcv,
  left(most_common_freqs::text,50) || '...' AS mcf,
  correlation
FROM pg_stats_ext_exprs WHERE statistics_name = 'flights_expr' \gx
-[ RECORD 1 ]−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
expr        | EXTRACT(month FROM (scheduled_departure AT TIME ZO...
null_frac   | 0
avg_width   | 8
n_distinct  | 12
mcv         | {8,9,3,5,12,4,10,7,11,1,6,2}
mcf         | {0.12526667,0.11016667,0.07903333,0.07903333,0.078..
correlation | 0.095407926

可以使用以下命令ALTER STATISTICS命令更改收集的统计数据量:

代码语言:javascript
复制
ALTER STATISTICS flights_expr SET STATISTICS 42;

表达式索引统计

正如使用常规表一样,建立表达式索引时,系统会收集其统计信息。计划者也可以使用这些统计数据。这很方便,但前提是我们真正关系索引。

代码语言:javascript
复制
DROP STATISTICS flights_expr;
CREATE INDEX ON flights(extract(
  month FROM scheduled_departure AT TIME ZONE 'Europe/Moscow'
)); => ANALYZE flights;

EXPLAIN SELECT * FROM flights WHERE extract(
  month FROM scheduled_departure AT TIME ZONE 'Europe/Moscow'
) = 1;

                             QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Bitmap Heap Scan on flights  (cost=318.42..3235.96 rows=16774 wi...
   Recheck Cond: (EXTRACT(month FROM (scheduled_departure AT TIME...
   −> Bitmap Index Scan on flights_extract_idx  (cost=0.00..314.2...
       Index Cond: (EXTRACT(month FROM (scheduled_departure AT TI...
(4 rows)

表达式索引统计信息的存储方式与表统计信息相同。例如,这是不同值的数量:

代码语言:javascript
复制
SELECT n_distinct FROM pg_stats
WHERE tablename = 'flights_extract_idx';

 n_distinct
−−−−−−−−−−−−
         12
(1 row)

PG11及更高版本中,可以使用ALTER INDEX命令更改索引统计的准确性。您可能需要引用该表达式的列名。例如:

代码语言:javascript
复制
SELECT attname FROM pg_attribute
WHERE attrelid = 'flights_extract_idx'::regclass;

 attname
−−−−−−−−−
 extract
(1 row)

ALTER INDEX flights_extract_idx
ALTER COLUMN extract SET STATISTICS 42;

多元统计

PG10引入了同时从多个列收集统计信息的能力,也成为multivariate statistics。需要手动生成必应的扩展统计信息。multivariate statistics分为三种类型。

列之间的功能依赖关系

当一列中的值(完全或部分)由另一列中的值确定时,并且在查询中存在引用两列的条件,则结果基数将被低估。这是一个具有2个条件的示例:

代码语言:javascript
复制
SELECT count(*) FROM flights
WHERE flight_no = 'PG0007' AND departure_airport = 'VKO';

 count
−−−−−−−
   396
(1 row)

估计值明显低于应有值,只有26行:

代码语言:javascript
复制
EXPLAIN SELECT * FROM flights
WHERE flight_no = 'PG0007' AND departure_airport = 'VKO';

                             QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Bitmap Heap Scan on flights  (cost=12.03..1238.70 rows=26 width=63)
   Recheck Cond: (flight_no = 'PG0007'::bpchar)
   Filter: (departure_airport = 'VKO'::bpchar)
   −> Bitmap Index Scan on flights_flight_no_scheduled_departure_key
       (cost=0.00..12.02 rows=480 width=0)
       Index Cond: (flight_no = 'PG0007'::bpchar)
(6 rows)

这就是臭名昭著的correlated predicates问题。规划器期望谓词独立,并计算得到的选择性集合AND作为条件选择性的乘积。应用bitmap heap scan中的departure_airport条件后,为计算flight_no条件的Bitmap索引扫描评估值显著下降。当然,航班号已经明确定义了出发机场,因此第二个条件实际上是多余的。这是扩展功能依赖统计可以帮助改进估计的地方。让我们为两列创建扩展的函数依赖统计信息:

代码语言:javascript
复制
CREATE STATISTICS flights_dep(dependencies)
ON flight_no, departure_airport FROM flights;

再次分析,现在使用新的统计数据,估计得到改善:

代码语言:javascript
复制
ANALYZE flights;

EXPLAIN SELECT * FROM flights
WHERE flight_no = 'PG0007' AND departure_airport = 'VKO';

                             QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Bitmap Heap Scan on flights  (cost=10.56..816.91 rows=276 width=63)
   Recheck Cond: (flight_no = 'PG0007'::bpchar)
   Filter: (departure_airport = 'VKO'::bpchar)
   −> Bitmap Index Scan on flights_flight_no_scheduled_departure_key
       (cost=0.00..10.49 rows=276 width=0)
       Index Cond: (flight_no = 'PG0007'::bpchar)
(6 rows)

统计信息存储在系统表中,可以使用以下命令显示:

代码语言:javascript
复制
SELECT dependencies
FROM pg_stats_ext WHERE statistics_name = 'flights_dep';

               dependencies
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 {"2 => 5": 1.000000, "5 => 2": 0.010567}
(1 row)

数字2和5是表pg_attribute中的列号。他们旁边的值表示函数依赖程度。从0(独立)到1(第2列中的值完全由第一列中的值定义)

不同值的多元数量

对于来自多列的值的不同组合数量的统计将显著提高GROUP BY对多列操作的基数。在此示例中,规划器将离开和抵达机场的对数估计值作为机场总数的平方。然而真正成对数量要低的多。因为并非每2个机场都通过直飞航班连接:

代码语言:javascript
复制
SELECT count(*) FROM (
  SELECT DISTINCT departure_airport, arrival_airport FROM flights
) t;

 count
−−−−−−−
   618
(1 row)

EXPLAIN
SELECT DISTINCT departure_airport, arrival_airport FROM flights;

                             QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 HashAggregate  (cost=5847.01..5955.16 rows=10816 width=8)
   Group Key: departure_airport, arrival_airport
   −> Seq Scan on flights  (cost=0.00..4772.67 rows=214867 width=8)
(3 rows)

让我们为不同值的数量创建一个统计信息:

代码语言:javascript
复制
CREATE STATISTICS flights_nd(ndistinct)
ON departure_airport, arrival_airport FROM flights;
ANALYZE flights;

EXPLAIN
SELECT DISTINCT departure_airport, arrival_airport FROM flights;

                             QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 HashAggregate (cost=5847.01..5853.19 rows=618 width=8)
   Group Key: departure_airport, arrival_airport
   −> Seq Scan on flights  (cost=0.00..4772.67 rows=214867 width=8)
(3 rows)

统计信息存储在系统表:

代码语言:javascript
复制
SELECT n_distinct
FROM pg_stats_ext WHERE statistics_name = 'flights_nd';
  n_distinct
−−−−−−−−−−−−−−−
 {"5, 6": 618}
(1 row)

多元最常见值列表

当值分布不均匀时,仅功能依赖数据可能不够,因为估计值将根据特定的值对而变化。考虑这个例子,计划者错误地估计了波音737从Sheremetyevo机场起飞的航班数量:

代码语言:javascript
复制
SELECT count(*) FROM flights
WHERE departure_airport = 'SVO' AND aircraft_code = '733'
 count
−−−−−−−
  2037
(1 row)

EXPLAIN SELECT * FROM flights
WHERE departure_airport = 'SVO' AND aircraft_code = '733';

                             QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Seq Scan on flights  (cost=0.00..5847.00 rows=733 width=63)
   Filter: ((departure_airport = 'SVO'::bpchar) AND (aircraft_cod...
(2 rows)

我们可以使用多元MCV列表统计来改进估计:

代码语言:javascript
复制
CREATE STATISTICS flights_mcv(mcv)
ON departure_airport, aircraft_code FROM flights;
ANALYZE flights;
EXPLAIN SELECT * FROM flights
WHERE departure_airport = 'SVO' AND aircraft_code = '733';

                             QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Seq Scan on flights  (cost=0.00..5847.00 rows=2077 width=63)
   Filter: ((departure_airport = 'SVO'::bpchar) AND (aircraft_cod...
(2 rows)

现有系统表有频率数据供规划器使用:

代码语言:javascript
复制
SELECT values, frequency
FROM pg_statistic_ext stx
  JOIN pg_statistic_ext_data stxd ON stx.oid = stxd.stxoid,
  pg_mcv_list_items(stxdmcv) m
WHERE stxname = 'flights_mcv'
AND values = '{SVO,773}';
  values   |      frequency
−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−
 {SVO,773} | 0.005733333333333333
(1 row)

一个multivariate most common值列表存储default_statistics_target 值,就像一个常规 MCV 列表一样。如果参数是在列级别定义,则使用最大值。与扩展表达式统计信息一样,您可以更改列表大小(PG13及更高版本):

代码语言:javascript
复制
ALTER STATISTICS ... SET STATISTICS ...;

在这些示例中,仅为2列收集了multivariate statistics,但是可以根据需要为任意多的列收集他们。还可以将不同类型的统计信息收集到单个扩展统计信息对象中。为此,只需在创建对象时列出逗号分隔的所需统计类型。如果没有定义特定的统计类型,系统将一次收集所有可用的统计信息。

PG14在做multivariate和表达式统计时,不仅允许使用列名,也可以使用任意表达式。

原文

https://postgrespro.com/blog/pgsql/5969296

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2022-03-16,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 yanzongshuaiDBA 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 直方图
  • 非标量数据类型的统计
  • 平均字段宽度
  • 相关性
  • 表达式统计
  • 扩展表达式统计
  • 表达式索引统计
  • 多元统计
  • 原文
相关产品与服务
对象存储
对象存储(Cloud Object Storage,COS)是由腾讯云推出的无目录层次结构、无数据格式限制,可容纳海量数据且支持 HTTP/HTTPS 协议访问的分布式存储服务。腾讯云 COS 的存储桶空间无容量上限,无需分区管理,适用于 CDN 数据分发、数据万象处理或大数据计算与分析的数据湖等多种场景。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档