前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Hive语法内关于With as 的数据是否会缓存到内存分析

Hive语法内关于With as 的数据是否会缓存到内存分析

作者头像
挽风
发布2021-12-07 17:42:58
9240
发布2021-12-07 17:42:58
举报
文章被收录于专栏:小道小道

  WITH AS短语,也叫做子查询部分(subquery factoring),可以定义一个SQL片断,该SQL片断可以被后面的SQL语句引用,从而使SQL语句的可读性更高。

  在关系型数据库中,对于UNION ALL,使用WITH AS定义了一个UNION ALL语句,当该片断被调用2次以上,优化器会自动将该WITH AS短语所获取的数据放入一个Temp表中。

  Hive在后面的版本也引入了WITH AS 这个公用表表达式(CTE)语法糖,但是对于后面语句的多次引用是否会继续将该WITH AS短语所获取的数据放入一个Temp表中呢?下面将通过对SQL的执行计划进行分析得到结论。

代码语言:javascript
复制
explain 
with top10 as 
(
select
	a.member_id,
	a.store_id,
	a.store_name,
	count(a.order_id) freq
from
	ads_member_label_p_ordertop10 a 
where
	dt = '20210614'
group by
	a.member_id,
	a.store_id,
	a.store_name
)
select
	*
from
	top10
union all
select
	*
from
	top10

  通过执行上述HQL可以得到如下执行计划:

代码语言:javascript
复制
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1, Stage-3
  Stage-3 is a root stage
  Stage-0 depends on stages: Stage-2

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: a
            filterExpr: (dt = '20210614') (type: boolean)
            Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: member_id (type: string), order_id (type: string), store_id (type: string), store_name (type: string)
              outputColumnNames: member_id, order_id, store_id, store_name
              Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: count(order_id)
                keys: member_id (type: string), store_id (type: string), store_name (type: string)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                  sort order: +++
                  Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                  Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col3 (type: bigint)
      Execution mode: vectorized
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3
          Statistics: Num rows: 12321230 Data size: 8341066198 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            Union
              Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
          TableScan
            Union
              Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: a
            filterExpr: (dt = '20210614') (type: boolean)
            Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: member_id (type: string), order_id (type: string), store_id (type: string), store_name (type: string)
              outputColumnNames: member_id, order_id, store_id, store_name
              Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: count(order_id)
                keys: member_id (type: string), store_id (type: string), store_name (type: string)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                  sort order: +++
                  Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                  Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col3 (type: bigint)
      Execution mode: vectorized
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3
          Statistics: Num rows: 12321230 Data size: 8341066198 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
代码语言:javascript
复制
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1, Stage-3
  Stage-3 is a root stage
  Stage-0 depends on stages: Stage-2

  通过对执行计划进行分析可以得知Stage-1和Stage-3对表ads_member_label_p_ordertop10分别执行了两次WITH AS内逻辑,由STAGE DEPENDENCIES得知而且在Stage-2依赖于Stage-1和Stage-3,且Stage-2中UNOIN操作对表的扫描操作(TableScan)并未进行数据过滤,Stage1-3影响的数据行数均为24642460行。

  所以在Hive内 WITH AS语法默认并不会将执行结果作为临时表存入内存,而是每一次引用都会执行一次WITH AS内的计算逻辑,而MySQL和SqlServer这种关系型数据库执行WITH AS时会将计算结果作为临时表。

  在hive中有一个参数

代码语言:javascript
复制
	hive.optimize.cte.materialize.threshold

  这个参数在默认情况下是-1(关闭的);当开启(大于0),比如设置为2,则如果with…as语句被引用2次及以上时,会把with…as语句生成的table物化,从而做到with…as语句只执行一次,来提高效率。

  将hive.optimize.cte.materialize.threshold设置为1,可以得到如下执行计划

代码语言:javascript
复制
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
  Stage-3 depends on stages: Stage-0
  Stage-2 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: a
            filterExpr: (dt = '20210620') (type: boolean)
            Statistics: Num rows: 32891415 Data size: 22266423831 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: member_id (type: string), order_id (type: string), store_id (type: string), store_name (type: string)
              outputColumnNames: member_id, order_id, store_id, store_name
              Statistics: Num rows: 32891415 Data size: 22266423831 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: count(order_id)
                keys: member_id (type: string), store_id (type: string), store_name (type: string)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 32891415 Data size: 22266423831 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                  sort order: +++
                  Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                  Statistics: Num rows: 32891415 Data size: 22266423831 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col3 (type: bigint)
      Execution mode: vectorized
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3
          Statistics: Num rows: 16445707 Data size: 11133211577 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 16445707 Data size: 11133211577 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                name: dv_report.top10

  Stage: Stage-0
    Move Operator
      files:
          hdfs directory: true
          destination: hdfs://iphost/tmp/hive/hive/16166a7f-d924-442a-9763-24b59ce050cb/_tmp_space.db/249708d7-bbc6-491b-b871-6b11385798d0

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: top10
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Select Operator
              expressions: member_id (type: string), store_id (type: string), store_name (type: string), freq (type: bigint)
              outputColumnNames: _col0, _col1, _col2, _col3
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Union
                Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
          TableScan
            alias: top10
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Select Operator
              expressions: member_id (type: string), store_id (type: string), store_name (type: string), freq (type: bigint)
              outputColumnNames: _col0, _col1, _col2, _col3
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Union
                Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-2
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

  由执行计划可知,在设置参数之后计算结果作为一张临时表存储在hdfs://iphost/tmp/hive/hive/16166a7f-d924-442a-9763-24b59ce050cb/_tmp_space.db/249708d7-bbc6-491b-b871-6b11385798d0路径下,即通过Stage-0阶段将计算结果存入一张临时表内。

本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2021-06-21 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
数据库
云数据库为企业提供了完善的关系型数据库、非关系型数据库、分析型数据库和数据库生态工具。您可以通过产品选择和组合搭建,轻松实现高可靠、高可用性、高性能等数据库需求。云数据库服务也可大幅减少您的运维工作量,更专注于业务发展,让企业一站式享受数据上云及分布式架构的技术红利!
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档