HAWQ与Hive查询性能对比测试

一、实验目的

        本实验通过模拟一个典型的应用场景和实际数据量,测试并对比HAWQ内部表、外部表与Hive的查询性能。

二、硬件环境

1. 四台VMware虚机组成的Hadoop集群。

2. 每台机器配置如下:

(1)15K RPM SAS 100GB

(2)Intel(R) Xeon(R) E5-2620 v2 @ 2.10GHz,双核双CPU

(3)8G内存,8GSwap

(4)10000Mb/s虚拟网卡

三、软件环境

1. Linux:CentOS release 6.4,核心2.6.32-358.el6.x86_64

2. Ambari:2.4.1

3. Hadoop:HDP 2.5.0

4. Hive(Hive on Tez):2.1.0

5. HAWQ:2.1.1.0

6. HAWQ PXF:3.1.1

四、数据模型

1. 表结构

        实验模拟一个记录页面点击数据的应用场景。数据模型中包含日期、页面、浏览器、引用、状态5个维度表,1个页面点击事实表。表结构和关系如图1所示。

图1

2. 记录数

        各表的记录数如表1所示。

表名

行数

page_click_fact

1亿

page_dim

20万

referrer_dim

100万

browser_dim

2万

status_code

70

date_dim

366

表1

五、建表并生成数据

1. 建立hive库表

create database test;
use test;

create table browser_dim(
  browser_sk bigint, 
  browser_nm varchar(100), 
  browser_version_no varchar(100), 
  flash_version_no varchar(100), 
  flash_enabled_flg int, 
  java_version_no varchar(100), 
  platform_desc string, 
  java_enabled_flg int, 
  java_script_enabled_flg int, 
  cookies_enabled_flg int, 
  user_language_cd varchar(100), 
  screen_color_depth_no varchar(100), 
  screen_size_txt string)
row format delimited 
  fields terminated by ',' 
stored as orc; 

create table date_dim(
  cal_dt date, 
  day_in_cal_yr_no int, 
  day_of_week_no int, 
  start_of_month_dt date, 
  start_of_quarter_dt date, 
  start_of_week_dt date, 
  start_of_year_dt date)
row format delimited 
  fields terminated by ',' 
stored as orc;

create table page_dim(
  page_sk bigint, 
  domain_nm varchar(200), 
  reachability_cd string, 
  page_desc string, 
  protocol_nm varchar(20))
row format delimited 
  fields terminated by ',' 
stored as orc;
 
create table referrer_dim(
  referrer_sk bigint, 
  referrer_txt string, 
  referrer_domain_nm varchar(200))
row format delimited 
  fields terminated by ',' 
stored as orc;

create table status_code_dim(
  status_cd varchar(100), 
  client_error_flg int, 
  status_cd_desc string, 
  server_error_flg int)
row format delimited 
  fields terminated by ',' 
stored as orc;

create table page_click_fact(
  visitor_id varchar(100), 
  detail_tm timestamp, 
  page_click_dt date, 
  page_sk bigint, 
  client_session_dt date, 
  previous_page_sk bigint, 
  referrer_sk bigint, 
  next_page_sk bigint, 
  status_cd varchar(100), 
  browser_sk bigint, 
  bytes_received_cnt bigint, 
  bytes_sent_cnt bigint, 
  client_detail_tm timestamp, 
  entry_point_flg int, 
  exit_point_flg int, 
  ip_address varchar(20), 
  query_string_txt string, 
  seconds_spent_on_page_cnt int, 
  sequence_no int, 
  requested_file_txt string)
row format delimited 
  fields terminated by ',' 
stored as orc;

        说明:hive表使用ORCfile存储格式。

2. 用Java程序生成hive表数据

        ORC压缩后的各表对应的HDFS文件大小如下:

2.2 M   /apps/hive/warehouse/test.db/browser_dim
641     /apps/hive/warehouse/test.db/date_dim
4.1 G   /apps/hive/warehouse/test.db/page_click_fact
16.1 M  /apps/hive/warehouse/test.db/page_dim
22.0 M  /apps/hive/warehouse/test.db/referrer_dim
1.1 K   /apps/hive/warehouse/test.db/status_code_dim

3. 分析hive表

analyze table date_dim compute statistics;
analyze table browser_dim compute statistics;
analyze table page_dim compute statistics;
analyze table referrer_dim compute statistics;
analyze table status_code_dim compute statistics;
analyze table page_click_fact compute statistics;

4. 建立HAWQ外部表

create schema ext;
set search_path=ext;

create external table date_dim(
cal_dt              date,                                    
day_in_cal_yr_no    int4,                                     
day_of_week_no      int4,                                     
start_of_month_dt   date,                                    
start_of_quarter_dt date,                                    
start_of_week_dt    date,                                    
start_of_year_dt    date  
)
location ('pxf://hdp1:51200/test.date_dim?profile=hiveorc')  
format 'custom' (formatter='pxfwritable_import'); 

create external table browser_dim(
browser_sk              int8,                                  
browser_nm              varchar(100),                            
browser_version_no      varchar(100),                            
flash_version_no        varchar(100),                            
flash_enabled_flg       int,                                     
java_version_no         varchar(100),                            
platform_desc           text,                                  
java_enabled_flg        int,    
java_script_enabled_flg int,    
cookies_enabled_flg     int,                                     
user_language_cd        varchar(100),         
screen_color_depth_no   varchar(100),            
screen_size_txt         text   
)  
location ('pxf://hdp1:51200/test.browser_dim?profile=hiveorc')  
format 'custom' (formatter='pxfwritable_import'); 

create external table page_dim(
page_sk             int8,                                  
domain_nm           varchar(200),                            
reachability_cd     text,                           
page_desc           text,                                  
protocol_nm         varchar(20) 
) 
location ('pxf://hdp1:51200/test.page_dim?profile=hiveorc')  
format 'custom' (formatter='pxfwritable_import'); 

create external table referrer_dim(
referrer_sk         int8,                 
referrer_txt        text,                
referrer_domain_nm  varchar(200) 
)
location ('pxf://hdp1:51200/test.referrer_dim?profile=hiveorc')  
format 'custom' (formatter='pxfwritable_import'); 

create external table status_code_dim(
status_cd           varchar(100),                            
client_error_flg    int4,                                     
status_cd_desc      text,                                  
server_error_flg    int4
) 
location ('pxf://hdp1:51200/test.status_code_dim?profile=hiveorc')  
format 'custom' (formatter='pxfwritable_import'); 

create external table page_click_fact(
visitor_id                varchar(100),                            
detail_tm                 timestamp,                               
page_click_dt             date,                                    
page_sk                   int8,                                  
client_session_dt         date, 
previous_page_sk          int8,                                  
referrer_sk               int8,                                  
next_page_sk              int8,                                  
status_cd                 varchar(100),                            
browser_sk                int8,                                  
bytes_received_cnt        int8,                                  
bytes_sent_cnt            int8,                                  
client_detail_tm          timestamp,                               
entry_point_flg           int4,                                     
exit_point_flg            int4,                                     
ip_address                varchar(20),                             
query_string_txt          text,                                  
seconds_spent_on_page_cnt int4,                                     
sequence_no               int4,                                     
requested_file_txt        text    
) 
location ('pxf://hdp1:51200/test.page_click_fact?profile=hiveorc')  
format 'custom' (formatter='pxfwritable_import');

        说明:HAWQ外部表使用PXF协议,指向相应的hive表。

5. 建立HAWQ内部表

set search_path=public;

create table date_dim(
cal_dt              date,      
day_in_cal_yr_no    int4,
day_of_week_no      int4,
start_of_month_dt   date, 
start_of_quarter_dt date,     
start_of_week_dt    date,         
start_of_year_dt    date) with (compresstype=snappy,appendonly=true); 

create table browser_dim(
browser_sk              int8,                                  
browser_nm              varchar(100),                            
browser_version_no      varchar(100),                            
flash_version_no        varchar(100),                            
flash_enabled_flg       int,                                     
java_version_no         varchar(100),                            
platform_desc           text,                                  
java_enabled_flg        int,    
java_script_enabled_flg int,    
cookies_enabled_flg     int,                                     
user_language_cd        varchar(100),         
screen_color_depth_no   varchar(100),            
screen_size_txt         text   
) with (compresstype=snappy,appendonly=true); 

create table page_dim(
page_sk             int8,                                  
domain_nm           varchar(200),                            
reachability_cd     text,                           
page_desc           text,                                  
protocol_nm         varchar(20) 
) with (compresstype=snappy,appendonly=true); 

create table referrer_dim(
referrer_sk         int8,                 
referrer_txt        text,                
referrer_domain_nm  varchar(200) 
) with (compresstype=snappy,appendonly=true); 

create table status_code_dim(
status_cd           varchar(100),                            
client_error_flg    int4,                                     
status_cd_desc      text,                                  
server_error_flg    int4
) with (compresstype=snappy,appendonly=true); 

create table page_click_fact(
visitor_id                varchar(100),                            
detail_tm                 timestamp,                               
page_click_dt             date,                                    
page_sk                   int8,                                  
client_session_dt         date, 
previous_page_sk          int8,                                  
referrer_sk               int8,                                  
next_page_sk              int8,                                  
status_cd                 varchar(100),                            
browser_sk                int8,                                  
bytes_received_cnt        int8,                                  
bytes_sent_cnt            int8,                                  
client_detail_tm          timestamp,                               
entry_point_flg           int4,                                     
exit_point_flg            int4,                                     
ip_address                varchar(20),                             
query_string_txt          text,                                  
seconds_spent_on_page_cnt int4,                                     
sequence_no               int4,                                     
requested_file_txt        text    
) with (compresstype=snappy,appendonly=true);

        说明:内部表结构定义与hive表等价,使用snappy压缩的行存储格式。

6. 生成HAWQ内部表数据

insert into date_dim select * from hcatalog.test.date_dim;
insert into browser_dim select * from hcatalog.test.browser_dim; 
insert into page_dim select * from hcatalog.test.page_dim; 
insert into referrer_dim select * from hcatalog.test.referrer_dim; 
insert into status_code_dim select * from hcatalog.test.status_code_dim; 
insert into page_click_fact select * from hcatalog.test.page_click_fact;

        说明:通过HCatalog直接查询hive表,插入到HAWQ内部表中。snappy压缩后的各表对应的HDFS文件大小如下:

6.2 K   /hawq_data/16385/177422/177677
3.3 M   /hawq_data/16385/177422/177682
23.9 M  /hawq_data/16385/177422/177687
39.3 M  /hawq_data/16385/177422/177707
1.8 K   /hawq_data/16385/177422/177726
7.9 G   /hawq_data/16385/177422/177731

7. 分析HAWQ内部表

analyze date_dim;
analyze browser_dim;
analyze page_dim;
analyze referrer_dim;
analyze status_code_dim;
analyze page_click_fact;

六、执行查询

        分别在hive表、HAWQ外部表、HAWQ内部表上执行以下5个查询语句,记录执行时间。

1. 查询给定周中support.sas.com站点上访问最多的目录

-- hive查询
select top_directory, count(*) as unique_visits     
  from (select distinct visitor_id, substr(requested_file_txt,1,10) top_directory  
          from page_click_fact, page_dim, browser_dim  
         where domain_nm = 'support.sas.com'   
           and flash_enabled_flg=1   
           and weekofyear(detail_tm) = 19   
           and year(detail_tm) = 2017  
       ) directory_summary  
 group by top_directory  
 order by unique_visits;  

-- HAWQ查询,只是用extract函数代替了hive的weekofyear和year函数,与hive的查询语句等价。
select top_directory, count(*) as unique_visits     
  from (select distinct visitor_id, substr(requested_file_txt,1,10) top_directory  
          from page_click_fact, page_dim, browser_dim  
         where domain_nm = 'support.sas.com'   
           and flash_enabled_flg=1   
           and extract(week from detail_tm) = 19   
           and extract(year from detail_tm) = 2017  
       ) directory_summary  
 group by top_directory  
 order by unique_visits;

2. 查询各月从www.google.com访问的页面

-- hive查询
select domain_nm, requested_file_txt, count(*) as unique_visitors, month  
  from (select distinct domain_nm, requested_file_txt, visitor_id, month(detail_tm) as month  
          from page_click_fact, page_dim, referrer_dim   
         where domain_nm = 'support.sas.com'   
           and referrer_domain_nm = 'www.google.com'  
       ) visits_pp_ph_summary  
 group by domain_nm, requested_file_txt, month  
 order by domain_nm, requested_file_txt, unique_visitors desc, month asc;  

-- HAWQ查询,只是用extract函数代替了hive的month函数,与hive的查询语句等价。
select domain_nm, requested_file_txt, count(*) as unique_visitors, month  
  from (select distinct domain_nm, requested_file_txt, visitor_id, extract(month from detail_tm) as month  
          from page_click_fact, page_dim, referrer_dim   
         where domain_nm = 'support.sas.com'   
           and referrer_domain_nm = 'www.google.com'  
       ) visits_pp_ph_summary  
 group by domain_nm, requested_file_txt, month  
 order by domain_nm, requested_file_txt, unique_visitors desc, month asc;

3. 给定年份support.sas.com站点上的搜索字符串计数

-- hive查询
select query_string_txt, count(*) as count  
  from page_click_fact, page_dim  
 where query_string_txt <> ''   
   and domain_nm='support.sas.com'   
   and year(detail_tm) = '2017'  
 group by query_string_txt  
 order by count desc;
 
-- HAWQ查询,只是用extract函数代替了hive的year函数,与hive的查询语句等价。
select query_string_txt, count(*) as count  
  from page_click_fact, page_dim  
 where query_string_txt <> ''   
   and domain_nm='support.sas.com'   
   and extract(year from detail_tm) = '2017'  
 group by query_string_txt  
 order by count desc;

4. 查询使用Safari浏览器访问每个页面的人数

-- hive查询
select domain_nm, requested_file_txt, count(*) as unique_visitors  
  from (select distinct domain_nm, requested_file_txt, visitor_id  
          from page_click_fact, page_dim, browser_dim  
         where domain_nm='support.sas.com'   
           and browser_nm like '%Safari%'   
           and weekofyear(detail_tm) = 19   
           and year(detail_tm) = 2017  
       ) uv_summary  
 group by domain_nm, requested_file_txt  
 order by unique_visitors desc;  
 
-- HAWQ查询,只是用extract函数代替了hive的weekofyear和year函数,与hive的查询语句等价。
select domain_nm, requested_file_txt, count(*) as unique_visitors  
  from (select distinct domain_nm, requested_file_txt, visitor_id  
          from page_click_fact, page_dim, browser_dim  
         where domain_nm='support.sas.com'   
           and browser_nm like '%Safari%'   
           and extract(week from detail_tm) = 19   
           and extract(year from detail_tm) = 2017  
       ) uv_summary  
 group by domain_nm, requested_file_txt  
 order by unique_visitors desc;

5. 查询给定周中support.sas.com站点上浏览超过10秒的页面

-- hive查询
select domain_nm, requested_file_txt, count(*) as unique_visits  
  from (select distinct domain_nm, requested_file_txt, visitor_id  
          from page_click_fact, page_dim  
         where domain_nm='support.sas.com'   
           and weekofyear(detail_tm) = 19   
           and year(detail_tm) = 2017   
           and seconds_spent_on_page_cnt > 10 
       ) visits_summary  
 group by domain_nm, requested_file_txt  
 order by unique_visits desc;  

-- HAWQ查询,只是用extract函数代替了hive的weekofyear和year函数,与hive的查询语句等价。 
select domain_nm, requested_file_txt, count(*) as unique_visits  
  from (select distinct domain_nm, requested_file_txt, visitor_id  
          from page_click_fact, page_dim  
         where domain_nm='support.sas.com'   
           and extract(week from detail_tm) = 19   
           and extract(year from detail_tm) = 2017   
           and seconds_spent_on_page_cnt > 10 
       ) visits_summary  
 group by domain_nm, requested_file_txt  
 order by unique_visits desc;

七、测试结果

        Hive、HAWQ外部表、HAWQ内部表查询时间对比如表2所示。每种查询情况执行三次取平均值。

查询

Hive(秒)

HAWQ外部表(秒)

HAWQ内部表(秒)

1

74.337

304.134

19.232

2

169.521

150.882

3.446

3

73.482

101.216

18.565

4

66.367

359.778

1.217

5

60.341

118.329

2.789

表2

        从图2中的对比可以看到,HAWQ内部表比Hive on Tez快的多(4-50倍)。同样的查询,在HAWQ的Hive外部表上执行却很慢。因此,在执行分析型查询时最好使用HAWQ内部表。如果不可避免地需要使用外部表,为了获得满意的查询性能,需要保证外部表数据量尽可能小。同时要使查询尽可能简单,尽量避免在外部表上执行聚合、分组、排序等复杂操作。

图2 

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏吉浦迅科技

PGI OpenACC 2018版:原来你是这样的编译器

对于CUDA Fortran用户来说,PGI编译器是必然要用到的。 其实PGI编译器不仅仅可以支持Fortran,还可以支持C/C++。而对于集群用户来说,要将...

3977
来自专栏PPV课数据科学社区

数据流编程教程:R语言与DataFrame

DataFrame DataFrame 是一个表格或者类似二维数组的结构,它的各行表示一个实例,各列表示一个变量。 一. DataFrame数据流编程 ? 二....

38512
来自专栏AILearning

Apache Spark 2.2.0 中文文档 - SparkR (R on Spark) | ApacheCN

SparkR (R on Spark) 概述 SparkDataFrame 启动: SparkSession 从 RStudio 来启动 创建 S...

3625
来自专栏铭毅天下

干货 | 论Elasticsearch数据建模的重要性

数据模型是抽象描述现实世界的一种工具和方法,是通过抽象的实体及实体之间联系的形式,用图形化的形式去描述业务规则的过程,从而表示现实世界中事务的相互关系的一种映射...

672
来自专栏一心无二用,本人只专注于基础图像算法的实现与优化。

SSE图像算法优化系列二:高斯模糊算法的全面优化过程分享(一)。

     这里的高斯模糊采用的是论文《Recursive implementation of the Gaussian filter》里描述的递归算法。 ? ...

3156
来自专栏开发 & 算法杂谈

基于Lockset的数据竞争检测方法汇总(三)

        上一篇文章中我们看到了有关共享对象状态变迁在Eraser基础上进行的改进,但是改进的不是特别明显,下面这篇论文不是单纯的用Lockset作为数据...

1283
来自专栏吕晟的专栏

机器学习库初探之MXnet

与其他工具相比,MXnet 结合了符号语言和过程语言的编程模型,并试图最大化各自优势,利用统一的执行引擎进行自动多 GPU 并行调度优化。不同的编程模型有各自的...

3821
来自专栏大数据架构

Spark SQL 性能优化再进一步 CBO 基于代价的优化

上文Spark SQL 内部原理中介绍的 Optimizer 属于 RBO,实现简单有效。它属于 LogicalPlan 的优化,所有优化均基于 Logical...

523
来自专栏UAI人工智能

使用 Ray 用 15 行 Python 代码实现一个参数服务器

参数服务器是很多机器学习应用的核心部分。其核心作用是存放机器学习模型的参数(如,神经网络的权重)和提供服务将参数传给客户端(客户端通常是处理数据和计算参数更新的...

502
来自专栏WeTest质量开放平台团队的专栏

移动平台 Unity3D 应用性能优化(上)

一些关于移动平台上Unity3D的性能优化经验,供分享。

2400

扫码关注云+社区