前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Impala TPC-DS基准测试

Impala TPC-DS基准测试

作者头像
Fayson
发布2018-07-12 15:12:57
2K0
发布2018-07-12 15:12:57
举报
文章被收录于专栏:Hadoop实操Hadoop实操

温馨提示:要看高清无码套图,请使用手机打开并单击图片放大查看。

Fayson的github:https://github.com/fayson/cdhproject

提示:代码块部分可以左右滑动查看噢

1.文档编写目的


在前面Fayson介绍了《如何编译及使用hive-testbench生成Hive基准测试数据》,在生成的Hive基准测试数据的基础上,如何进行Impala的TPC-DS基准测试,本篇文章主要介绍如何准备Impala基准测试数据及使用99条SQL对Impala进行基准测试。

  • 内容概述

1.环境准备

2.Impala基准测试数据准备

3.TPC-DS测试

  • 测试环境

1.RedHat7.3

2.CM和CDH版本为5.13.1

  • 前置条件

1.已使用hive-testbench生成好Hive的基准测试数据

2.环境准备


1.在《如何编译及使用hive-testbench生成Hive基准测试数据》文章中Fayson已生成好了2GB的测试数据,并创建好了Hive的外部表

2.登录Hue查看tpcds_text_2库下表

以上的环境准备具体可以参考Fayson前面的文章《如何编译及使用hive-testbench生成Hive基准测试数据

3.准备Impala基准测试数据


我们基于hive-testbench生成的2GB的Hive基准测试数据来生成Impala Parquet格式的表,并对标执行统计分析操作,具体操作步骤及脚本如下:

1.创建SQL脚本alltables_parquet.sql用于生成parquet格式的表,内容如下

代码语言:javascript
复制
drop database if exists ${VAR:DB} cascade;
create database ${VAR:DB};
use ${VAR:DB};
set parquet_file_size=512M;
set COMPRESSION_CODEC=snappy;
drop table if exists call_center;
create table ${VAR:DB}.call_center
stored as parquet
as select * from ${VAR:HIVE_DB}.call_center;
drop table if exists catalog_page;
create table ${VAR:DB}.catalog_page
stored as parquet
as select * from ${VAR:HIVE_DB}.catalog_page;
drop table if exists catalog_returns;
create table ${VAR:DB}.catalog_returns
stored as parquet
as select * from ${VAR:HIVE_DB}.catalog_returns;
drop table if exists catalog_sales;
create table ${VAR:DB}.catalog_sales
stored as parquet
as select * from ${VAR:HIVE_DB}.catalog_sales;
drop table if exists customer_address;
create table ${VAR:DB}.customer_address
stored as parquet
as select * from ${VAR:HIVE_DB}.customer_address;
drop table if exists customer_demographics;
create table ${VAR:DB}.customer_demographics
stored as parquet
as select * from ${VAR:HIVE_DB}.customer_demographics;
drop table if exists customer;
create table ${VAR:DB}.customer
stored as parquet
as select * from ${VAR:HIVE_DB}.customer;
drop table if exists date_dim;
create table ${VAR:DB}.date_dim
stored as parquet
as select * from ${VAR:HIVE_DB}.date_dim;
drop table if exists household_demographics;
create table ${VAR:DB}.household_demographics
stored as parquet
as select * from ${VAR:HIVE_DB}.household_demographics;
drop table if exists income_band;
create table ${VAR:DB}.income_band
stored as parquet
as select * from ${VAR:HIVE_DB}.income_band;
drop table if exists inventory;
create table ${VAR:DB}.inventory
stored as parquet
as select * from ${VAR:HIVE_DB}.inventory;
drop table if exists item;
create table ${VAR:DB}.item
stored as parquet
as select * from ${VAR:HIVE_DB}.item;
drop table if exists promotion;
create table ${VAR:DB}.promotion
stored as parquet
as select * from ${VAR:HIVE_DB}.promotion;
drop table if exists reason;
create table ${VAR:DB}.reason
stored as parquet
as select * from ${VAR:HIVE_DB}.reason;
drop table if exists ship_mode;
create table ${VAR:DB}.ship_mode
stored as parquet
as select * from ${VAR:HIVE_DB}.ship_mode;
drop table if exists store_returns;
create table ${VAR:DB}.store_returns
stored as parquet
as select * from ${VAR:HIVE_DB}.store_returns;
drop table if exists store_sales;
create table ${VAR:DB}.store_sales
stored as parquet
as select * from ${VAR:HIVE_DB}.store_sales;
drop table if exists store;
create table ${VAR:DB}.store
stored as parquet
as select * from ${VAR:HIVE_DB}.store;
drop table if exists time_dim;
create table ${VAR:DB}.time_dim
stored as parquet
as select * from ${VAR:HIVE_DB}.time_dim;
drop table if exists warehouse;
create table ${VAR:DB}.warehouse
stored as parquet
as select * from ${VAR:HIVE_DB}.warehouse;
drop table if exists web_page;
create table ${VAR:DB}.web_page
stored as parquet
as select * from ${VAR:HIVE_DB}.web_page;
drop table if exists web_returns;
create table ${VAR:DB}.web_returns
stored as parquet
as select * from ${VAR:HIVE_DB}.web_returns;
drop table if exists web_sales;
create table ${VAR:DB}.web_sales
stored as parquet
as select * from ${VAR:HIVE_DB}.web_sales;
drop table if exists web_site;
create table ${VAR:DB}.web_site
stored as parquet
as select * from ${VAR:HIVE_DB}.web_site;

(可左右滑动)

注意:在脚本中使用了${VAR:variable_name}动态传参的方式指定Hive数据库及Impala的数据库。

2.在Impala Daemon节点执行如下命令,生成Impala基准测试数据

代码语言:javascript
复制
[root@ip-172-31-30-69 ~]# impala-shell -i ip-172-31-30-69.ap-southeast-1.compute.internal --var=DB=tpcds_parquet_2 --var=HIVE_DB=tpcds_text_2 -f alltables_parquet.sql

(可左右滑动)

等待命令执行成功,登录Impala查看数据库及对应的表是否创建成功

tpcds_parquet_2的数据库已创建

tpcds_parquet_2库中的表与tpcds_text_2库中表一致,数据量一致。

3.创建analyze.sql,用于统计分析Impala的表

代码语言:javascript
复制
use ${VAR:DB};
compute stats call_center ;
compute stats catalog_page ;
compute stats catalog_returns ;
compute stats catalog_sales ;
compute stats customer_address ;
compute stats customer_demographics ;
compute stats customer ;
compute stats date_dim ;
compute stats household_demographics ;
compute stats income_band ;
compute stats inventory ;
compute stats item ;
compute stats promotion ;
compute stats reason ;
compute stats ship_mode ;
compute stats store_returns ;
compute stats store_sales ;
compute stats store ;
compute stats time_dim ;
compute stats warehouse ;
compute stats web_page ;
compute stats web_returns ;
compute stats web_sales ;
compute stats web_site ;

(可左右滑动)

4.执行如下命令对Impala的表进行统计分析

代码语言:javascript
复制
[root@ip-172-31-30-69 ddl-tpcds]# impala-shell -i ip-172-31-30-69.ap-southeast-1.compute.internal --var=DB=tpcds_parquet_2 -f analyze.sql

(可左右滑动)

至此已完成Impala基准测试的环境准备。

4.TPC-DS测试


1.准备好TPC-DS的99条SQL语句,这里的99条SQL就不贴出来了Fayson会放到GitHub上

2.编写批量运行脚本run_all_queries.sh,将结果输出到日志文件

代码语言:javascript
复制
[root@ip-172-31-30-69 impala-tpcds]# vim run_all_queries.sh 
#!/bin/bash

impala_demon=ip-172-31-30-69.ap-southeast-1.compute.internal
database_name=tpcds_parquet_2
current_path=`pwd`
queries_dir=${current_path}/queries
rm -rf logs
mkdir logs
for t in `ls ${queries_dir}`
do
    echo "current query will be ${queries_dir}/${t}"
    impala-shell --database=$database_name -i $impala_demon -f ${queries_dir}/${t} &>logs/${t}.log
done
echo "all queries execution are finished, please check logs for the result!"

(可左右滑动)

将脚本中impala_daemon和database_name修改为你自己环境的配置即可。

3.脚本执行成功后可以在logs目录下查看执行结果及运行时间

SQL执行结果

4.通过logs目录下的log文件可以查看每条SQL的执行结果和执行时间

5.总结


Fayson这里只是介绍了如何进行TPC-DS来实现Impala的基准测试,以2GB的测试数据为例来说明。

GitHub地址:

https://github.com/fayson/cdhproject/tree/master/impala-tpcds

目录结构说明

init.sh脚本用于初始化impala基准测试表、数据及统计分析。

run_all_queries.sh脚本用于执行99条SQL并输出日志目录

logs目录主要存放执行的SQL结果

ddl-tpcds目录创建Impala基准测试表的SQL文件。

queries目录存放了99条基准测试的SQL。

提示:代码块部分可以左右滑动查看噢

为天地立心,为生民立命,为往圣继绝学,为万世开太平。 温馨提示:要看高清无码套图,请使用手机打开并单击图片放大查看。

推荐关注Hadoop实操,第一时间,分享更多Hadoop干货,欢迎转发和分享。

原创文章,欢迎转载,转载请注明:转载自微信公众号Hadoop实操

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2018-04-19,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 Hadoop实操 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
专用宿主机
专用宿主机(CVM Dedicated Host,CDH)提供用户独享的物理服务器资源,满足您资源独享、资源物理隔离、安全、合规需求。专用宿主机搭载了腾讯云虚拟化系统,购买之后,您可在其上灵活创建、管理多个自定义规格的云服务器实例,自主规划物理资源的使用。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档