使用TPC-DS基准测试SQL-on-Hadoop系统的性能

大数据杂货铺

发布于 2020-01-14 15:09:58

1.7K0

发布于 2020-01-14 15:09:58

文章被收录于专栏：大数据杂货铺

我们经常问有关SQL-on-Hadoop系统性能的问题：

• 与Presto、SparkSQL或Hive on Tez相比，Hive-LLAP有多快？

• 由于Presto是MPP风格的系统，如果能够成功执行查询，Presto会以最快的速度运行吗？

• 由于它在内存中存储中间数据，因此，SparkSQL的运行速度通常比Hive on Tez快很多吗？

• 什么是运行并发查询的最佳系统？

• …

尽管这些问题本身很有趣，但它们与想要采用最合适的技术来满足其需求的行业从业者特别相关。

互联网上有大量的基准测试结果，但我们仍然需要新的基准测试结果。这是因为所有SQL-on-Hadoop系统都在不断发展，因此格局逐渐变化，以前的基准测试结果可能已经过时。此外，基准测试中使用的硬件可能只支持某些系统，而可能根本没有配置任何系统来获得最佳性能。另一方面，TPC-DS基准仍然是衡量SQL-on-Hadoop系统性能的事实上的标准。

我们报告了实验结果，以回答有关SQL-on-Hadoop系统的一些问题。结果绝不是确定的，但应阐明每个系统的位置以及它在SQL-on-Hadoop的动态环境中的发展方向。特别是，结果可能与对Hive、Presto和SparkSQL的一些普遍看法相矛盾。

实验中使用的集群

我们在三个不同的集群中运行该实验：Red，Gold和Indigo。集群中的所有计算机都运行HDP（HortonWorks Data Platform）并共享以下属性：

• 2个Intel（R）Xeon（R）X5650 CPU

• Red为192GB，Gold和为Indigo为96GB

• 6个500GB硬盘

• 10千兆网络连接

	Red	Gold	Indigo
Hadoop版本	Hadoop 2.7.3（HDP 2.6.4）	Hadoop 2.7.3（HDP 2.6.4）	Hadoop 3.1.0（HDP 3.0.1）
主节点数	1	2	2
从节点数	10	40	19
TPC-DS基准的比例因子	1TB	10TB	3TB
从属节点上Yarn的内存大小	168GB	84GB	84GB
安全	Kerberos	没有	没有

从节点的内存总量为：

• Red集群上10 * 196GB = 1960GB

• Gold集群上40 * 96GB = 3840GB

• Indigo集群上19 * 96GB = 1824GB

我们在Hadoop 2.7.3上使用HDFS复制因子3。

SQL-on-Hadoop系统进行比较

我们比较以下SQL-on-Hadoop系统。请注意，仅在Hadoop 3上正式支持Hive 3.1.0，因此我们修改了源代码，以便也可以在Hadoop 2.7.3上运行它。

在Red和Gold集群（基于Hadoop 2.7.3运行HDP 2.6.4）上：

• HDP 2.6.4中包含的Hive-LLAP

• Presto 0.203e（启用基于成本的优化）

• HDP 2.6.4中包含的SparkSQL 2.2.0

• Hive 3.1.0 on Tez

在Indigo集群（基于Hadoop 3.1.0运行HDP 3.0.1）上：

• HDP 3.0.1中包含的Hive-LLAP

• Presto 0.208e（启用基于成本的优化）

• HDP 3.0.1中包含的SparkSQL 2.3.1

• HDP 3.0.1中包含的Hive on Tez

对于Hive-LLAP，我们使用Ambari设置的默认配置。LLAP守护程序在Red集群上使用160GB，在Gold和Indigo集群上使用76GB。ApplicationMaster在所有集群上使用4GB。

对于Presto，我们使用以下配置（在性能调整后选择）：

# for the Red cluster
query.initial-hash-partitions 10
query.max-memory-per-node 120GB
query.max-total-memory-per-node 120GB
memory.heap-headroom-per-node 16GB
resources.reserved-system-memory 24GB
sink.max-buffer-size 20GB
node-scheduler.min-candidates 10
# for the Gold and Indigo clusters
query.initial-hash-partitions 40
query.max-memory-per-node 60GB
query.max-total-memory-per-node 60GB
memory.heap-headroom-per-node 8GB
resources.reserved-system-memory 12GB
sink.max-buffer-size 10GB
node-scheduler.min-candidates 40
# for all clusters
task.writer-count 4
node-scheduler.network-topology flat
optimizer.optimize-metadata-queries TRUE
join-distribution-type AUTOMATIC
optimizer.join-reordering-strategy COST_BASED/AUTOMATIC

Presto worker在Red集群上使用144GB，在Gold和Indigo集群上使用72GB（对于JVM -Xmx）。

对于SparkSQL，我们使用Ambari设置的默认配置，另外还将spark.sql.cbo.enabled和spark.sql.cbo.joinReorder.enabled设置为true。Spark Thrift Server使用以下选项：

• --num-executors 19 --executor-memory 74g --conf spark.yarn.am.memory=74g 在Red集群上

• --num-executors 39 --executor-memory 72g --conf spark.yarn.am.memory=72g 在Gold集群上

• --num-executors 18 --executor-memory 72g --conf spark.yarn.am.memory=72g 在Indigo集群上

对于Hive on Tez，在Red集群上每个容器使用16GB，在Gold集群上每个容器使用10GB，在Indigo集群上每个容器使用8GB。

测试的结果

在测试中，我们使用Beeline或Presto客户端从TPC-DS基准提交99个查询。对于Red和Gold集群，我们报告运行103个查询的结果，因为查询14、23、24和39分两个阶段进行。对于Indigo集群，我们报告运行99个查询的结果，因为Presto 0.208e不会拆分这四个查询，因此总共执行了99个查询。如果查询失败，我们将计算失败时间并继续进行下一个查询。我们为Red和Indigo集群（而不是Gold集群）上的每个查询设置3600秒的超时。

为了方便读者阅读，我们附上了三个包含实验原始数据的表格。0秒的运行时间表示查询不会编译，负的运行时间（例如-639.367）意味着查询将在639.367秒内失败。这里是[ Google 文档 ] 的链接。

分析完成的查询数

我们计算成功返回答案的查询数：

测试汇总：

• 在Red集群上， Hive on Tez 3.1.0和SparkSQL 2.2.0完成了所有103个查询的执行。

• 在Gold集群上， Hive on Tez 3.1.0仅在Query-16查询上失败，成为完成执行最多数量的查询。

• 在Indigo集群上， HDP 3.0.1的Hive-LLAP在查询78上失败，因为它在编译步骤后被卡住。

分析总运行时间

我们测量所有查询的总运行时间，无论是否成功：

Total Running time (seconds)	HDP 2.6.4 LLAP	Presto 0.203e	Spark 2.2.0	Hive 3.1.0/TEZ
Cluster Red	5293.164	8943.295	23571.871	6001.077
ClusterGold	21388.742	21634.643	96143.904	19887.76

Total Running time (seconds)	HDP 3.0.1 LLAP	Presto 0.208e	Spark 2.3.1	HDP 3.0.1 Tez
Cluster Indigo	5516.89	12948.224	26247.471	12227.317

不幸的是，很难从此结果进行公平的比较，因为并非所有系统在一组完整的查询中都是一致的。例如，Hive 3.0.1 on Tez在Indigo集群上花费了超过12,000秒，因为查询78在3600秒后超时并失败，因此占总运行时间的近三分之一。尽管如此，我们可以做一些有趣的观察：

• Hive2.6.4 LLAP在Red集群上以最快的速度完成了所有查询，而Hive3.1.0 on Tez在Gold集群上以最快的速度完成了所有查询。Hive引擎的性能（LLAP,Hive on Tez）基本上是Spark的4-5倍。

• 在Indigo集群上，HDP 3.0.1的Hive-LLAP是最快的系统。请注意，HDP 3.0.1的Hive-LLAP在查询78上失败。

• 在这三个集群中，SparkSQL是最慢的。这不是因为某些查询由于超时而失败，而是因为几乎所有查询的运行速度都很慢。

• Hive on Tez 3.1.0足够快，足以胜过Presto 0.203e和SparkSQL 2.2.0。同样，HDP 3.0.1中的Hive on Tez足够快，足以胜过Presto 0.208e和SparkSQL 2.3.1。

分析单个查询的排名

为了了解哪个系统可以快速回答查询，我们根据每个查询的运行时间对所有系统进行排名。对于正在考虑的查询，将以最快的速度完成查询的系统分配给最高位置（第一）。如果系统没有编译或无法完成查询的执行，则会为该查询分配最低的位置（第4位）。这样，我们可以从最终用户而不是系统管理员的角度更准确地评估这六个系统。

这是Red集群的结果：

• 从左到右，该列对应于：HDP 2.6.4的Hive-LLAP，Presto 0.203e，SparkSQL 2.2.0，Hive on Tez 3.1.0。

• 第一位到最后一位用蓝色（第一），灰色，浅蓝色，深蓝色（最后）着色。

• Hive on Tez通常在前几个查询中速度较慢，因为它没有活动容器，并且仅在提交第一个查询后才分配新容器。但是，其他系统从预热的容器/工人开始，因此倾向于在前几个查询中快速运行。

query	HDP 2.6.4 LLAP	Presto 0.203e	Spark 2.2.0	Hive 3.1.0/TEZ
query 1	1	2	3	4
query 2	1	4	3	2
query 3	1	3	4	2
query 4	1	2	4	3
query 5	1	4	3	2
query 6	1	3	4	2
query 7	1	3	4	2
query 8	1	3	4	2
query 9	1	3	4	2
query 10	1	3	4	2
query 11	1	2	4	3
query 12	1	2	4	3
query 13	1	3	4	2
query 14	1	3	4	2
query 14 - 2	2	3	4	1
query 15	1	2	4	3
query 16	2	1	4	3
query 17	2	1	4	3
query 18	3	2	4	1
query 19	1	3	4	2
query 20	1	2	4	3
query 21	1	2	3	4
query 22	4	1	2	3
query 23	1	3	4	2
query 23 - 2	2	3	4	1
query 24	1	2	3	4
query 24 - 2	1	3	4	2
query 25	2	3	4	1
query 26	1	3	4	2
query 27	4	2	3	1
query 28	1	3	4	2
query 29	2	3	4	1
query 30	1	2	4	3
query 31	1	3	4	2
query 32	1	3	4	2
query 33	1	3	4	2
query 34	1	2	4	3
query 35	1	2	4	3
query 36	4	1	3	2
query 37	1	2	4	3
query 38	1	2	4	3
query 39	3	1	4	2
query 39 - 2	2	3	4	1
query 40	1	3	4	2
query 41	2	4	1	3
query 42	1	3	4	2
query 43	1	3	4	2
query 44	1	3	4	2
query 45	3	1	4	2
query 46	1	3	4	2
query 47	2	3	4	1
query 48	1	3	4	2
query 49	2	3	4	1
query 50	1	3	4	2
query 51	2	1	4	3
query 52	1	3	4	2
query 53	1	3	4	2
query 54	1	3	4	2
query 55	1	3	4	2
query 56	1	3	4	2
query 57	1	3	4	2
query 58	1	3	4	2
query 59	2	3	4	1
query 60	1	3	4	2
query 61	1	3	4	2
query 62	1	3	4	2
query 63	1	3	4	2
query 64	1	3	4	2
query 65	4	1	3	2
query 66	1	3	4	2
query 67	4	3	2	1
query 68	1	3	4	2
query 69	1	3	4	2
query 70	3	2	4	1
query 71	1	3	4	2
query 72	3	2	4	1
query 73	1	2	4	3
query 74	1	3	4	2
query 75	2	3	4	1
query 76	1	3	4	2
query 77	1	4	3	2
query 78	1	2	3	4
query 79	1	3	4	2
query 80	1	3	4	2
query 81	1	3	4	2
query 82	1	2	4	3
query 83	1	3	4	2
query 84	1	2	4	3
query 85	1	4	3	2
query 86	3	2	4	1
query 87	2	1	4	3
query 88	2	3	4	1
query 89	1	3	4	2
query 90	1	4	3	2
query 91	1	3	4	2
query 92	1	3	4	2
query 93	1	3	4	2
query 94	2	1	4	3
query 95	2	1	4	3
query 96	1	3	4	2
query 97	4	1	3	2
query 98	1	2	4	3
query 99	1	3	4	2

我们观察到HDP 2.6.4的Hive-LLAP在竞争中占主导地位：它在74个查询中排名第一，在17个查询中排名第二。接下来是Hive 3.1.0 on Tez，它在16个查询中排名第一，然后在61个查询中排名第二。Presto 0.203e在12个查询中排名第一，但23个查询中排名第二。请注意，Spark在87个查询中排名最后。

从Gold集群中，出现了明显的变化：

	HDP 2.6.4 LLAP	Presto 0.203e	Spark 2.2.0	Hive 3.1.0/TEZ
query 1	1	3	2	4
query 2	1	4	3	2
query 3	2	3	4	1
query 4	3	2	4	1
query 5	2	3	4	1
query 6	2	3	4	1
query 7	2	3	4	1
query 8	1	3	4	2
query 9	1	3	4	2
query 10	1	2	4	3
query 11	1	3	4	2
query 12	1	3	4	2
query 13	1	3	4	2
query 14	1	3	4	2
query 14 - 2	3	4	1	2
query 15	2	3	4	1
query 16	2	1	4	2
query 17	1	3	4	2
query 18	4	1	3	2
query 19	2	3	4	1
query 20	1	3	4	2
query 21	1	2	3	4
query 22	4	1	2	3
query 23	2	3	4	1
query 23 - 2	2	3	4	1
query 24	2	1	4	3
query 24 - 2	2	1	4	3
query 25	1	3	4	2
query 26	1	3	4	2
query 27	3	2	4	1
query 28	1	2	4	3
query 29	1	3	4	2
query 30	2	1	4	3
query 31	1	3	4	2
query 32	1	3	4	2
query 33	1	4	3	2
query 34	1	3	4	2
query 35	2	1	4	3
query 36	4	2	3	1
query 37	1	2	4	3
query 38	2	1	4	3
query 39	2	1	4	3
query 39 - 2	2	1	4	3
query 40	1	3	4	2
query 41	2	3	1	4
query 42	1	3	4	2
query 43	1	3	4	2
query 44	2	3	4	1
query 45	3	2	4	1
query 46	2	4	3	1
query 47	2	4	3	1
query 48	1	3	4	2
query 49	4	2	3	1
query 50	2	3	4	1
query 51	1	3	4	2
query 52	1	3	4	2
query 53	1	3	4	2
query 54	2	4	3	1
query 55	1	3	4	2
query 56	1	3	4	2
query 57	1	4	3	2
query 58	1	4	3	2
query 59	1	3	4	2
query 60	1	4	3	2
query 61	1	4	3	2
query 62	1	2	3	4
query 63	1	4	2	3
query 64	1	3	4	2
query 65	3	1	2	4
query 66	1	3	4	2
query 67	4	3	2	1
query 68	1	3	4	2
query 69	1	2	4	3
query 70	4	2	3	1
query 71	1	3	4	2
query 72	2	3	4	1
query 73	1	4	3	2
query 74	3	3	2	1
query 75	2	4	3	1
query 76	2	1	4	3
query 77	1	3	4	2
query 78	3	1	2	4
query 79	1	3	4	2
query 80	1	3	4	2
query 81	4	2	3	1
query 82	2	1	4	3
query 83	1	4	3	2
query 84	1	2	4	3
query 85	1	3	4	2
query 86	4	2	3	1
query 87	2	1	3	4
query 88	2	3	4	1
query 89	1	3	4	2
query 90	1	3	4	2
query 91	1	3	4	2
query 92	1	3	4	2
query 93	1	2	4	3
query 94	4	1	3	2
query 95	1	3	4	2
query 96	3	2	4	1
query 97	2	1	3	4
query 98	1	3	4	2
query 99	2	1	4	3

HDP 2.6.4的Hive-LLAP仍然在查询数量最多的情况下排在首位（57个查询，在Red集群上为74个查询），但6个查询（在Red集群上为6个查询）排在最后。顺便说一句，SparkSQL 2.2.0仅在Red和Gold集群上对查询41排名第一，有71个查询排在最后。

Indigo集群的结果对于Hive-LLAP和Hive on Tez之间的比较特别重要，因为两个系统都基于相同版本的Hive，即Hive 3.1.0。Presto和SparkSQL也是较新的版本，因此结果比Red和Gold集群更准确地反映了每个SQL-on-Hadoop系统的当前状态。这是Indigo集群的结果：

• 从左到右，该列对应于：HDP 3.0.1的Hive-LLAP，Presto 0.208e，SparkSQL 2.3.1，Hive on Tez 3.1.0。

	HDP 3.0.1 LLAP	Presto 0.208e	Spark 2.3.1	HDP 3.0.1 Tez
query 1	1	2	4	3
query 2	1	4	2	3
query 3	1	2	3	4
query 4	1	2	4	3
query 5	1	4	3	2
query 6	1	2	4	3
query 7	1	3	4	2
query 8	1	2	3	4
query 9	1	3	4	2
query 10	2	1	4	3
query 11	1	2	4	3
query 12	1	2	3	4
query 13	1	3	4	2
query 14	1	3	4	2
query 14 - 2	3	1	1	4
query 15	1	2	4	3
query 16	1	2	4	3
query 17	1	2	4	3
query 18	1	2	4	3
query 19	1	2	4	3
query 20	1	2	3	4
query 21	2	1	3	4
query 22	2	1	4	3
query 23	1	2	3	4
query 23 - 2	3	1	4	1
query 24	1	2	4	3
query 24 - 2	1	4	3	2
query 25	1	2	4	3
query 26	1	2	4	3
query 27	1	3	4	2
query 28	1	4	3	2
query 29	1	2	4	3
query 30	2	1	3	4
query 31	1	3	4	2
query 32	1	2	4	3
query 33	1	3	4	2
query 34	1	2	4	3
query 35	2	1	3	4
query 36	1	2	4	3
query 37	2	1	3	4
query 38	2	1	4	3
query 39	2	1	3	4
query 39 - 2	1	4	3	2
query 40	1	3	4	2
query 41	2	3	1	4
query 42	1	3	4	2
query 43	1	2	3	4
query 44	1	3	2	4
query 45	1	2	4	3
query 46	1	2	4	3
query 47	1	4	2	3
query 48	1	3	4	2
query 49	1	4	3	2
query 50	1	2	4	3
query 51	2	1	4	3
query 52	1	2	4	3
query 53	1	2	3	4
query 54	1	4	3	2
query 55	1	2	4	3
query 56	1	3	4	2
query 57	1	4	2	3
query 58	1	4	3	2
query 59	1	4	2	3
query 60	1	3	4	2
query 61	1	3	4	2
query 62	1	2	3	4
query 63	1	2	3	4
query 64	1	3	4	2
query 65	3	1	2	4
query 66	1	2	3	4
query 67	1	4	3	2
query 68	1	3	4	2
query 69	1	2	4	3
query 70	1	3	4	2
query 71	1	3	4	2
query 72	1	3	3	2
query 73	1	2	4	3
query 74	1	2	4	3
query 75	1	2	4	3
query 76	3	2	1	4
query 77	1	3	4	2
query 78	1	3	2	3
query 79	1	3	4	2
query 80	1	3	4	2
query 81	1	2	4	3
query 82	2	1	4	3
query 83	1	4	2	3
query 84	2	1	4	3
query 85	1	4	3	2
query 86	1	2	3	4
query 87	2	1	3	4
query 88	1	3	4	2
query 89	1	2	4	3
query 90	2	1	3	4
query 91	1	2	3	4
query 92	1	2	3	4
query 93	1	3	4	2
query 94	1	2	4	3
query 95	1	3	4	2
query 96	1	2	3	4
query 97	2	4	1	3
query 98	1	2	3	4
query 99	1	2	3	4

我们观察到，Hive 3.1.0 LLAP性能最好，有84个查询排第一，15个查询排第二，没有排名最后的查询。对于Presto .208e，与基于Presto .203e的先前结果没有太大差异。对于SparkSQL 2.3.1，它仍然是所有系统中最慢的。顺便说一句，它仍然排在查询41的首位。Hive on Tez的性能基本上和Presto差不多。

测试总结

通过上面的分析，我们发现基于Hive的系统确实是SQL-on-Hadoop领域的强大竞争对手，不仅因为它们的稳定性和多功能性，而且现在还因为它们的速度。我们将顺序测试的结果总结如下：

• Hive-LLAP是最快的SQL-on-Hadoop系统。

• 在相同的配置下，HiveLLAP运行速度比Hive on Tez快一些。

• Presto稳定且运行速度比SparkSQL快得多，但平均速度不及Hive-LLAP。

• 与Hive和Presto相比，在Spark上运行的SparkSQL非常慢。我们的实验结果表明，在Hive和Presto随时可用的计算环境中，根本不需要使用SparkSQL。

来源：https://mr3.postech.ac.kr/blog/2018/10/30/performance-evaluation-0.4/

Oct 30, 2018 • Sungwoo Park

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2020-01-08，如有侵权请联系 cloudcommunity@tencent.com 删除

hive

本文分享自大数据杂货铺微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

hive

使用TPC-DS基准测试SQL-on-Hadoop系统的性能

使用TPC-DS基准测试SQL-on-Hadoop系统的性能

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐