说三种情况,看大家有没有遇到类似的场景。
第一种情况:
这种情况也是我经常会遇到的一个场景,之前也有同学拿着sql来问,说这样写会不会影响运行效率:
select
tmp.A
from (select A,B from testdata2) tmp
结论是
不用担心,这样写完全可以被优化
== Analyzed Logical Plan ==
Project [A#3]
+- SubqueryAlias tmp
+- Project [A#3, B#4]
+- SubqueryAlias testdata2
+- View (`testData2`, [a#3,b#4])
+- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#3, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#4]
+- ExternalRDD [obj#2]
== Optimized Logical Plan ==
Project [A#3]
+- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#3]
+- ExternalRDD [obj#2]
从执行计划上清晰的看到,最终被优化成
select A from testdata2
这样的效果,主要是 ColumnPruning(列裁剪) 和 CollapseProject(合并Project)这两种优化器起到作用。
第二种情况:
这种情况之前一直没在意,发现我写过的一些代码里默默都这么用了
-- 其中myudf是一个自定义UDF函数,返回一个数组
select
myudf(A,B)[0] as a1,
myudf(A,B)[1] as a2,
myudf(A,B)[2] as a3
from testdata2
这里myudf(A,B)执行几遍?
结论是
会执行三遍。
如果myudf是一个很复杂的函数,要合并两个非常复杂的字符串A和B,这个也是我工作中的一个场景。这样的话,执行三遍,非常不合理。
怎么办?
改写:
-- 其中myudf是一个自定义UDF函数,返回一个数组
select
atmp[0] as a1,
atmp[1] as a2,
atmp[2] as a3
from
(select
myudf(A,B) as atmp
from testdata2
) tmp
这样改写就万事大吉了嘛?
在sparksql branch3.3 这样改写完全没问题,但毕竟3.3是新版本,大部分人都还没用上,换到3.3之前的版本,分分钟再给变(优化)成第一种写法(执行三遍的)。
branch3.3(是ok的,内层先计算出myudf的值,外层用计算过的值取数):
== Analyzed Logical Plan ==
Project [atmp#10[0] AS a1#11, atmp#10[1] AS a2#12, atmp#10[2] AS a3#13]
+- SubqueryAlias tmp
+- Project [myudf(A#3,B#4) AS atmp#10]
+- SubqueryAlias testdata2
+- View (`testData2`, [a#3,b#4])
+- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#3, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#4]
+- ExternalRDD [obj#2]
== Optimized Logical Plan ==
Project [atmp#10[0] AS a1#11, atmp#10[1] AS a2#12, atmp#10[2] AS a3#13]
+- Project [myudf(A#3,B#4) AS atmp#10]
+- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#3, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#4]
+- ExternalRDD [obj#2]
branch3.3之前的版本(不管我们愿不愿意,都给合并喽):
== Analyzed Logical Plan ==
Project [atmp#10[0] AS a1#11, atmp#10[1] AS a2#12, atmp#10[2] AS a3#13]
+- SubqueryAlias tmp
+- Project [concat(array(A#3), array(B#4)) AS atmp#10]
+- SubqueryAlias testdata2
+- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#3, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#4]
+- ExternalRDD [obj#2]
== Optimized Logical Plan ==
Project [myudf(A#3,B#4)[0] AS a1#11, myudf(A#3,B#4)[1] AS a2#12, myudf(A#3,B#4)[2] AS a3#13]
+- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#3, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#4]
+- ExternalRDD [obj#2]
一直不信,怎么会不这么不智能,具体原因是啥?该怎么避免?这个在上周六的直播分享里讲过了。
第三种情况:
这种也会经常遇到,并且也会经常被其他朋友问到能不能被优化
// 其中用collect_set来代表聚合函数
select
collect_set(a)[0] as c1,
collect_set(a)[1] as c2,
collect_set(a)[3] as c3
from testdata2
group by b
这里的collect_set(a)会执行几遍?
结论是
执行一遍。
这样类似的还有:count(xxx),count(distinct xxx) 等等,聚合函数在重复用时,不用担心,sparksql会给优化。所以,我们在写代码时就不用考虑再在外面写一层,从而避免多写一层,造成数据多流转一次的浪费。
看看吧,不同的情况,会有不同的优化结果,如果知道原理,就能避开一些坑。