TPC-H 测试方法

最近更新时间:2024-08-22 15:30:51

我的收藏
TDSQL-C MySQL 版的只读分析引擎主要针对复杂的 SQL 和数据分析而设计。对于此类场景的性能评估手段在业界常用 TPC-H 的基准测试集来进行。因此,本文主要基于 TPC-H 基准来测试只读分析引擎的性能情况,以供用户参考。

前提条件

准备好实例形态为预置资源的集群,请参见 创建集群
集群状态为运行中,并创建只读分析引擎实例。
准备一个数据库账号,用于数据查询,请参见 创建账号
为集群中的只读分析引擎实例设置一个合理的实例规格。
说明:
测试结果与实例计算规格的大小成正比。在数据分析场景中,想要性能好,则需要更多的资源支撑,可根据实际的情况选用对应的实例规格。

构建测试表

TPC-H 测试集中包含有八张数据表与一个视图,请根据以下创建语句,将表创建在 TDSQL-C MySQL 版集群中。
create database tpch;
use tpch;

drop table if exists customer;
create table `customer` (
`c_custkey` bigint(20) not null,
`c_name` varchar(25) not null,
`c_address` varchar(40) not null,
`c_nationkey` bigint(20) not null,
`c_phone` char(15) not null,
`c_acctbal` decimal(15,2) not null,
`c_mktsegment` char(10) not null,
`c_comment` varchar(117) not null,
primary key (`c_custkey`)
) engine=innodb default charset=utf8mb4 collate=utf8mb4_bin;

drop table if exists lineitem;
create table `lineitem` (
`l_orderkey` bigint(20) not null,
`l_partkey` bigint(20) not null,
`l_suppkey` bigint(20) not null,
`l_linenumber` bigint(20) not null,
`l_quantity` decimal(15,2) not null,
`l_extendedprice` decimal(15,2) not null,
`l_discount` decimal(15,2) not null,
`l_tax` decimal(15,2) not null,
`l_returnflag` char(1) not null,
`l_linestatus` char(1) not null,
`l_shipdate` date not null,
`l_commitdate` date not null,
`l_receiptdate` date not null,
`l_shipinstruct` char(25) not null,
`l_shipmode` char(10) not null,
`l_comment` varchar(44) not null,
primary key (`l_orderkey`,`l_linenumber`)
) engine=innodb default charset=utf8mb4 collate=utf8mb4_bin;

drop table if exists nation;
create table `nation` (
`n_nationkey` bigint(20) not null,
`n_name` char(25) not null,
`n_regionkey` bigint(20) not null,
`n_comment` varchar(152) default null,
primary key (`n_nationkey`)
) engine=innodb default charset=utf8mb4 collate=utf8mb4_bin;

drop table if exists orders;
create table `orders` (
`o_orderkey` bigint(20) not null,
`o_custkey` bigint(20) not null,
`o_orderstatus` char(1) not null,
`o_totalprice` decimal(15,2) not null,
`o_orderdate` date not null,
`o_orderpriority` char(15) not null,
`o_clerk` char(15) not null,
`o_shippriority` bigint(20) not null,
`o_comment` varchar(79) not null,
primary key (`o_orderkey`)
) engine=innodb default charset=utf8mb4 collate=utf8mb4_bin;

drop table if exists part;
create table `part` (
`p_partkey` bigint(20) not null,
`p_name` varchar(55) not null,
`p_mfgr` char(25) not null,
`p_brand` char(10) not null,
`p_type` varchar(25) not null,
`p_size` bigint(20) not null,
`p_container` char(10) not null,
`p_retailprice` decimal(15,2) not null,
`p_comment` varchar(23) not null,
primary key (`p_partkey`)
) engine=innodb default charset=utf8mb4 collate=utf8mb4_bin;

drop table if exists partsupp;
create table `partsupp` (
`ps_partkey` bigint(20) not null,
`ps_suppkey` bigint(20) not null,
`ps_availqty` bigint(20) not null,
`ps_supplycost` decimal(15,2) not null,
`ps_comment` varchar(199) not null,
primary key (`ps_partkey`,`ps_suppkey`)
) engine=innodb default charset=utf8mb4 collate=utf8mb4_bin;

drop table if exists region;
create table `region` (
`r_regionkey` bigint(20) not null,
`r_name` char(25) not null,
`r_comment` varchar(152) default null,
primary key (`r_regionkey`)
) engine=innodb default charset=utf8mb4 collate=utf8mb4_bin;

drop table if exists supplier;
create table `supplier` (
`s_suppkey` bigint(20) not null,
`s_name` char(25) not null,
`s_address` varchar(40) not null,
`s_nationkey` bigint(20) not null,
`s_phone` char(15) not null,
`s_acctbal` decimal(15,2) not null,
`s_comment` varchar(101) not null,
primary key (`s_suppkey`)
) engine=innodb default charset=utf8mb4 collate=utf8mb4_bin;

## 视图
create view revenue0 (supplier_no, total_revenue) as
select l_suppkey, sum(l_extendedprice * (1 - l_discount))
from
lineitem
where
l_shipdate >= date '1995-02-01'
and l_shipdate < date '1995-02-01' + interval '3' month
group by
l_suppkey;

生成测试数据

TPC-H 测试工具可进入到 TPC 官方网站 中进行下载,下载到服务器后执行 make 将 dbgen 进行编译。
cd TPC-H_Tools_v3.0.0
make
编译完成后,即可生成 dbgen 工具,执行如下命令。
scale=100
chunk=10
for i in `seq 1 $chunk`
do
./dbgen -s $scale -C $chunk -S $i -f
done
上述命令中的 scale 代表生成的数据大小,如填写100,则代表生成100G数据,chunk 指生成的数据文件被拆分为多少份。这两个参数可根据实际情况进行调整。

导入数据

下面介绍如何将 TPC-H 测试数据分别导入 TDSQL-C MySQL 版中。下表列出了 TPC-H 100G 场景测试数据集中的表数据条数。
表名
数据条数
customer
15000000
lineitem
600037902
nation
25
orders
150000000
part
20000000
partsupp
80000000
region
5
supplier
1000000
通过 TPC-H 的 dbgen 工具生成的表文件存放于工具所在的目录,文件名以 tbl 后缀结尾。若在生成数据时,进行了分割,则会以数字结尾。
ls *.tbl
customer.tbl lineitem.tbl nation.tbl orders.tbl partsupp.tbl part.tbl region.tbl supplier.tbl
对于 TDSQL-C MySQL 版集群,我们可以通过 load data 工具将数据导入。首先进入存放表数据的目录,然后使用 MySQL 客户端登录至 TDSQL-C MySQL 版集群中执行如下命令。
注意:
下列命令是使用一条语句将测试数据插入到集群中。如果测试数据较大,这对客户端的资源消耗也较多。因为 load data 的逻辑是将数据加载到内存中再插入到数据库。如果单一文件过大则会出现客户端所在的服务器内存占满。建议使用 dbgen 工具,在生成数据的时候将数据文件拆分为多个文件,再通过多次执行 load data 将数据导入。
cd /data/tpchdata
mysql -uroot -h172.16.0.22 -P3306 -p''
mysql> use tpch;

LOAD DATA LOCAL INFILE 'region.tbl' INTO TABLE REGION FIELDS TERMINATED BY '|' LINES TERMINATED BY '\\n';
LOAD DATA LOCAL INFILE 'customer.tbl' INTO TABLE CUSTOMER FIELDS TERMINATED BY '|' LINES TERMINATED BY '\\n';
LOAD DATA LOCAL INFILE 'orders.tbl' INTO TABLE ORDERS FIELDS TERMINATED BY '|' LINES TERMINATED BY '\\n';
LOAD DATA LOCAL INFILE 'lineitem.tbl' INTO TABLE LINEITEM FIELDS TERMINATED BY '|' LINES TERMINATED BY '\\n';
LOAD DATA LOCAL INFILE 'nation.tbl' INTO TABLE NATION FIELDS TERMINATED BY '|' LINES TERMINATED BY '\\n';
LOAD DATA LOCAL INFILE 'partsupp.tbl' INTO TABLE PARTSUPP FIELDS TERMINATED BY '|' LINES TERMINATED BY '\\n';
LOAD DATA LOCAL INFILE 'part.tbl' INTO TABLE PART FIELDS TERMINATED BY '|' LINES TERMINATED BY '\\n';
LOAD DATA LOCAL INFILE 'supplier.tbl' INTO TABLE SUPPLIER FIELDS TERMINATED BY '|' LINES TERMINATED BY '\\n';
完成数据导入到读写实例后,可直接设置 tpch 库整库加载到只读分析引擎中。您可以通过执行命令或者通过控制台查看表加载状态。
select * from information_schema.libra_table_status;
当表的 REPLICATION_STEP 字段状态为 Change Propagation 时,则代表当前数据已经加载完成。
在只读分析引擎中数据完成加载后,即可对表进行统计信息收集。
ANALYZE TABLE customer;
ANALYZE TABLE lineitem;
ANALYZE TABLE nation;
ANALYZE TABLE orders;
ANALYZE TABLE part;
ANALYZE TABLE partsupp;
ANALYZE TABLE region;
ANALYZE TABLE supplier;
收集完统计信息,则可执行 TPC-H 的测试 SQL。TPC-H 的测试 SQL 一共有 22 条。详细的 SQL 文本参考如下。您可以复制 SQL并访问到只读分析引擎中进行粘贴执行:
# Q1
select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate <= date_sub('1998-12-01', interval 108 day) group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus;
# Q2
select /*+ SET_VAR(libra_enable_runtime_filter=1) SET_VAR(libra_enable_cost_based_runtime_filter=0) PX_JOIN_FILTER_ID(0[BLOOM_FILTER],2[BLOOM_FILTER]) */ s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment from part, supplier, partsupp, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = 30 and p_type like '%STEEL'and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'ASIA'and ps_supplycost = (select min(ps_supplycost) from partsupp, supplier, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'ASIA') order by s_acctbal desc, n_name, s_name, p_partkey limit 100;
# Q3
select l_orderkey, sum(l_extendedprice * (1 - l_discount)) as revenue, o_orderdate, o_shippriority from customer, orders, lineitem where c_mktsegment = 'AUTOMOBILE' and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate < '1995-03-13' and l_shipdate > '1995-03-13' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10;
# Q4
select o_orderpriority, count(*) as order_count from orders where o_orderdate >= '1995-01-01' and o_orderdate < date_add('1995-01-01', interval '3' month) and exists ( select * from lineitem where l_orderkey = o_orderkey and l_commitdate < l_receiptdate ) group by o_orderpriority order by o_orderpriority;
# Q5
select /*+ SET_VAR(libra_enable_runtime_filter=1) SET_VAR(libra_enable_cost_based_runtime_filter=0) PX_JOIN_FILTER_ID(3[BLOOM_FILTER],4[BLOOM_FILTER]) */ n_name, sum(l_extendedprice * (1 - l_discount)) as revenue from customer, orders, lineitem, supplier, nation, region where c_custkey = o_custkey and l_orderkey = o_orderkey and l_suppkey = s_suppkey and c_nationkey = s_nationkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'MIDDLE EAST' and o_orderdate >= '1994-01-01' and o_orderdate < date_add('1994-01-01', interval '1' year) group by n_name order by revenue desc;
# Q6
select sum(l_extendedprice * l_discount) as revenue from lineitem where l_shipdate >= '1994-01-01' and l_shipdate < date_add('1994-01-01', interval '1' year) and l_discount between 0.06 - 0.01 and 0.06 + 0.01 and l_quantity < 24;
# Q7
select /*+ SET_VAR(libra_enable_runtime_filter=1) SET_VAR(libra_enable_cost_based_runtime_filter=0) PX_JOIN_FILTER_ID(1[BLOOM_FILTER], 3[BLOOM_FILTER]) */ supp_nation, cust_nation, l_year, sum(volume) as revenue from ( select /*+ leading((n1,supplier),lineitem,((n2,customer),orders)) */ n1.n_name as supp_nation, n2.n_name as cust_nation, extract(year from l_shipdate) as l_year, l_extendedprice * (1 - l_discount) as volume from supplier, lineitem, orders, customer, nation n1, nation n2 where s_suppkey = l_suppkey and o_orderkey = l_orderkey and c_custkey = o_custkey and s_nationkey = n1.n_nationkey and c_nationkey = n2.n_nationkey and ( (n1.n_name = 'JAPAN' and n2.n_name = 'INDIA') or (n1.n_name = 'INDIA' and n2.n_name = 'JAPAN') ) and l_shipdate between '1995-01-01' and '1996-12-31' ) as shipping group by supp_nation, cust_nation, l_year order by supp_nation, cust_nation, l_year;
# Q8
select /*+ SET_VAR(libra_enable_runtime_filter=1) SET_VAR(libra_enable_cost_based_runtime_filter=0) PX_JOIN_FILTER_ID(1[BLOOM_FILTER], 2[BLOOM_FILTER],5[BLOOM_FILTER]) */ o_year, sum(case when nation = 'INDIA' then volume else 0 end) / sum(volume) as mkt_share from ( select /*+ leading(region,n1,customer,orders,(part,lineitem),(supplier,n2)) */ extract(year from o_orderdate) as o_year, l_extendedprice * (1 - l_discount) as volume, n2.n_name as nation from part, supplier, lineitem, orders, customer, nation n1, nation n2, region where p_partkey = l_partkey and s_suppkey = l_suppkey and l_orderkey = o_orderkey and o_custkey = c_custkey and c_nationkey = n1.n_nationkey and n1.n_regionkey = r_regionkey and r_name = 'ASIA' and s_nationkey = n2.n_nationkey and o_orderdate between '1995-01-01' and '1996-12-31' and p_type = 'SMALL PLATED COPPER' ) as all_nations group by o_year order by o_year;
# Q9
select /*+ SET_VAR(libra_enable_runtime_filter=1) SET_VAR(libra_enable_cost_based_runtime_filter=0) PX_JOIN_FILTER_ID(4[BLOOM_FILTER]) JOIN_FILTER_NDV_AMP_RATIO(4:0.1) */ nation, o_year, sum(amount) as sum_profit from ( select /*+ hash_join_probe(orders) */n_name as nation, extract(year from o_orderdate) as o_year, l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount from part, supplier, lineitem, partsupp, orders, nation where s_suppkey = l_suppkey and ps_suppkey = l_suppkey and ps_partkey = l_partkey and p_partkey = l_partkey and o_orderkey = l_orderkey and s_nationkey = n_nationkey and p_name like '%dim%' ) as profit group by nation, o_year order by nation, o_year desc;
# Q10
select /*+ HASH_JOIN_PROBE((nation,customer)), leading(nation, customer,(orders, lineitem)) */ c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as revenue, c_acctbal, n_name, c_address, c_phone, c_comment from customer, orders, lineitem, nation where c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate >= '1993-08-01' and o_orderdate < date_add('1993-08-01', interval '3' month) and l_returnflag = 'R' and c_nationkey = n_nationkey group by c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment order by revenue desc limit 20;
# Q11
select ps_partkey, sum(ps_supplycost * ps_availqty) as value from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'MOZAMBIQUE' group by ps_partkey having sum(ps_supplycost * ps_availqty) > ( select sum(ps_supplycost * ps_availqty) * 0.0001000000 from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'MOZAMBIQUE' ) order by value desc;
# Q12
select /*+ SET_VAR(libra_enable_runtime_filter=1) SET_VAR(libra_enable_cost_based_runtime_filter=0) PX_JOIN_FILTER_ID(0[BLOOM_FILTER]) */ l_shipmode, sum(case when o_orderpriority = '1-URGENT' or o_orderpriority = '2-HIGH' then 1 else 0 end) as high_line_count, sum(case when o_orderpriority <> '1-URGENT' and o_orderpriority <> '2-HIGH' then 1 else 0 end) as low_line_count from orders, lineitem where o_orderkey = l_orderkey and l_shipmode in ('RAIL', 'FOB') and l_commitdate < l_receiptdate and l_shipdate < l_commitdate and l_receiptdate >= '1997-01-01' and l_receiptdate < date_add('1997-01-01', interval '1' year) group by l_shipmode order by l_shipmode;
# Q13
select c_count, count(*) as custdist from (select c_custkey, count(o_orderkey) as c_count from customer left outer join orders on c_custkey = o_custkey and o_comment not like '%pending%deposits%'group by c_custkey ) c_orders group by c_count order by custdist desc, c_count desc
# Q14
select 100.00 * sum(case when p_type like 'PROMO%'then l_extendedprice * (1 - l_discount) else 0 end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue from lineitem, part where l_partkey = p_partkey and l_shipdate >= '1996-12-01'and l_shipdate < date_add('1996-12-01', interval '1' month);
# Q15
select s_suppkey, s_name, s_address, s_phone, total_revenue from supplier, revenue0 where s_suppkey = supplier_no and total_revenue = (select max(total_revenue) from revenue0 ) order by s_suppkey;
# Q16
select p_brand, p_type, p_size, count(distinct ps_suppkey) as supplier_cnt from partsupp, part where p_partkey = ps_partkey and p_brand <> 'Brand#34'and p_type not like 'LARGE BRUSHED%'and p_size in (48, 19, 12, 4, 41, 7, 21, 39) and ps_suppkey not in (select s_suppkey from supplier where s_comment like '%Customer%Complaints%') group by p_brand, p_type, p_size order by supplier_cnt desc, p_brand, p_type, p_size;
# Q17
select /*+ SET_VAR(libra_enable_runtime_filter=1) SET_VAR(libra_enable_cost_based_runtime_filter=0) PX_JOIN_FILTER_ID(0[BLOOM_FILTER],1[BLOOM_FILTER]) */ sum(l_extendedprice) / 7.0 as avg_yearly from lineitem, part where p_partkey = l_partkey and p_brand = 'Brand#44'and p_container = 'WRAP PKG'and l_quantity < (select 0.2 * avg(l_quantity) from lineitem where l_partkey = p_partkey );
# Q18
select /*+ PX_JOIN_FILTER_ID(0,1,2) leading(customer, (orders, lineitem@sel_2)) hash_join_probe(customer) JOIN_FILTER_NDV_AMP_RATIO(0:0.00001) JOIN_FILTER_NDV_AMP_RATIO(2:0.00001) JOIN_FILTER_NDV_AMP_RATIO(1:0.0001) */ c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantity) from customer, orders, lineitem where o_orderkey in (select l_orderkey from lineitem group by l_orderkey having sum(l_quantity) > 314 ) and c_custkey = o_custkey and o_orderkey = l_orderkey group by c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice order by o_totalprice desc, o_orderdate limit 100;
# Q19
select /*+ SET_VAR(libra_enable_runtime_filter=1) SET_VAR(libra_enable_cost_based_runtime_filter=0) PX_JOIN_FILTER_ID(0[BLOOM_FILTER]) */ sum(l_extendedprice* (1 - l_discount)) as revenue from lineitem, part where (p_partkey = l_partkey and p_brand = 'Brand#52'and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') and l_quantity >= 4 and l_quantity <= 4 + 10 and p_size between 1 and 5 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON') or (p_partkey = l_partkey and p_brand = 'Brand#11'and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') and l_quantity >= 18 and l_quantity <= 18 + 10 and p_size between 1 and 10 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON') or (p_partkey = l_partkey and p_brand = 'Brand#51'and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') and l_quantity >= 29 and l_quantity <= 29 + 10 and p_size between 1 and 15 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON');
# Q20
select s_name, s_address from supplier, nation where s_suppkey in (select ps_suppkey from partsupp where ps_partkey in (select p_partkey from part where p_name like 'green%') and ps_availqty > (select 0.5 * sum(l_quantity) from lineitem where l_partkey = ps_partkey and l_suppkey = ps_suppkey and l_shipdate >= '1993-01-01'and l_shipdate < date_add('1993-01-01', interval '1' year) ) ) and s_nationkey = n_nationkey and n_name = 'ALGERIA'order by s_name;
# Q21
select /*+ SET_VAR(libra_enable_runtime_filter=1) SET_VAR(libra_enable_cost_based_runtime_filter=0) JOIN_FILTER_NDV_AMP_RATIO(0:0.5) JOIN_FILTER_NDV_AMP_RATIO(1:0.5) PX_JOIN_FILTER_ID(3[BLOOM_FILTER],1,0) */ s_name, count(*) as numwait from supplier, lineitem l1, orders, nation where s_suppkey = l1.l_suppkey and o_orderkey = l1.l_orderkey and o_orderstatus = 'F' and l1.l_receiptdate > l1.l_commitdate and exists ( select * from lineitem l2 where l2.l_orderkey = l1.l_orderkey and l2.l_suppkey <> l1.l_suppkey ) and not exists ( select * from lineitem l3 where l3.l_orderkey = l1.l_orderkey and l3.l_suppkey <> l1.l_suppkey and l3.l_receiptdate > l3.l_commitdate ) and s_nationkey = n_nationkey and n_name = 'EGYPT' group by s_name order by numwait desc, s_name limit 100;
# Q22
select cntrycode, count(*) as numcust, sum(c_acctbal) as totacctbal from (select substring(c_phone from 1 for 2) as cntrycode, c_acctbal from customer where substring(c_phone from 1 for 2) in ('20', '40', '22', '30', '39', '42', '21') and c_acctbal > (select avg(c_acctbal) from customer where c_acctbal > 0.00 and substring(c_phone from 1 for 2) in ('20', '40', '22', '30', '39', '42', '21') ) and not exists (select * from orders where o_custkey = c_custkey ) ) as custsale group by cntrycode order by cntrycode;