Hive语法内关于With as 的数据是否会缓存到内存分析

挽风

发布于 2021-12-07 17:42:58

1.1K0

发布于 2021-12-07 17:42:58

文章被收录于专栏：小道

WITH AS短语，也叫做子查询部分（subquery factoring），可以定义一个SQL片断，该SQL片断可以被后面的SQL语句引用，从而使SQL语句的可读性更高。

在关系型数据库中，对于UNION ALL，使用WITH AS定义了一个UNION ALL语句，当该片断被调用2次以上，优化器会自动将该WITH AS短语所获取的数据放入一个Temp表中。

Hive在后面的版本也引入了WITH AS 这个公用表表达式（CTE）语法糖，但是对于后面语句的多次引用是否会继续将该WITH AS短语所获取的数据放入一个Temp表中呢？下面将通过对SQL的执行计划进行分析得到结论。

explain 
with top10 as 
(
select
	a.member_id,
	a.store_id,
	a.store_name,
	count(a.order_id) freq
from
	ads_member_label_p_ordertop10 a 
where
	dt = '20210614'
group by
	a.member_id,
	a.store_id,
	a.store_name
)
select
	*
from
	top10
union all
select
	*
from
	top10

通过执行上述HQL可以得到如下执行计划：

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1, Stage-3
  Stage-3 is a root stage
  Stage-0 depends on stages: Stage-2

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: a
            filterExpr: (dt = '20210614') (type: boolean)
            Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: member_id (type: string), order_id (type: string), store_id (type: string), store_name (type: string)
              outputColumnNames: member_id, order_id, store_id, store_name
              Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: count(order_id)
                keys: member_id (type: string), store_id (type: string), store_name (type: string)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                  sort order: +++
                  Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                  Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col3 (type: bigint)
      Execution mode: vectorized
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3
          Statistics: Num rows: 12321230 Data size: 8341066198 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            Union
              Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
          TableScan
            Union
              Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: a
            filterExpr: (dt = '20210614') (type: boolean)
            Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: member_id (type: string), order_id (type: string), store_id (type: string), store_name (type: string)
              outputColumnNames: member_id, order_id, store_id, store_name
              Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: count(order_id)
                keys: member_id (type: string), store_id (type: string), store_name (type: string)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                  sort order: +++
                  Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                  Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col3 (type: bigint)
      Execution mode: vectorized
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3
          Statistics: Num rows: 12321230 Data size: 8341066198 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1, Stage-3
  Stage-3 is a root stage
  Stage-0 depends on stages: Stage-2

通过对执行计划进行分析可以得知Stage-1和Stage-3对表ads_member_label_p_ordertop10分别执行了两次WITH AS内逻辑，由STAGE DEPENDENCIES得知而且在Stage-2依赖于Stage-1和Stage-3，且Stage-2中UNOIN操作对表的扫描操作（TableScan）并未进行数据过滤，Stage1-3影响的数据行数均为24642460行。

所以在Hive内 WITH AS语法默认并不会将执行结果作为临时表存入内存，而是每一次引用都会执行一次WITH AS内的计算逻辑，而MySQL和SqlServer这种关系型数据库执行WITH AS时会将计算结果作为临时表。

在hive中有一个参数

	hive.optimize.cte.materialize.threshold

这个参数在默认情况下是-1（关闭的）；当开启（大于0），比如设置为2，则如果with…as语句被引用2次及以上时，会把with…as语句生成的table物化，从而做到with…as语句只执行一次，来提高效率。

将hive.optimize.cte.materialize.threshold设置为1，可以得到如下执行计划

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
  Stage-3 depends on stages: Stage-0
  Stage-2 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: a
            filterExpr: (dt = '20210620') (type: boolean)
            Statistics: Num rows: 32891415 Data size: 22266423831 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: member_id (type: string), order_id (type: string), store_id (type: string), store_name (type: string)
              outputColumnNames: member_id, order_id, store_id, store_name
              Statistics: Num rows: 32891415 Data size: 22266423831 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: count(order_id)
                keys: member_id (type: string), store_id (type: string), store_name (type: string)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 32891415 Data size: 22266423831 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                  sort order: +++
                  Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                  Statistics: Num rows: 32891415 Data size: 22266423831 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col3 (type: bigint)
      Execution mode: vectorized
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3
          Statistics: Num rows: 16445707 Data size: 11133211577 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 16445707 Data size: 11133211577 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                name: dv_report.top10

  Stage: Stage-0
    Move Operator
      files:
          hdfs directory: true
          destination: hdfs://iphost/tmp/hive/hive/16166a7f-d924-442a-9763-24b59ce050cb/_tmp_space.db/249708d7-bbc6-491b-b871-6b11385798d0

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: top10
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Select Operator
              expressions: member_id (type: string), store_id (type: string), store_name (type: string), freq (type: bigint)
              outputColumnNames: _col0, _col1, _col2, _col3
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Union
                Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
          TableScan
            alias: top10
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Select Operator
              expressions: member_id (type: string), store_id (type: string), store_name (type: string), freq (type: bigint)
              outputColumnNames: _col0, _col1, _col2, _col3
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Union
                Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-2
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

由执行计划可知，在设置参数之后计算结果作为一张临时表存储在hdfs://iphost/tmp/hive/hive/16166a7f-d924-442a-9763-24b59ce050cb/_tmp_space.db/249708d7-bbc6-491b-b871-6b11385798d0路径下,即通过Stage-0阶段将计算结果存入一张临时表内。

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2021/06/21 ，如有侵权请联系 cloudcommunity@tencent.com 删除

hive