记录一下个人对sparkSql的catalyst这个函数式的可扩展的查询优化器的理解,目录如下:
0. Overview
1. Catalyst工作流程
2. Parser模块
3. Analyzer模块
4. Optimizer模块
5. SparkPlanner模块
6. Job UI
7. Reference
Spark SQL的核心是Catalyst优化器,是以一种新颖的方式利用Scala的的模式匹配和quasiquotes机制来构建的可扩展查询优化器。
sparkSql pipeline
sparkSql的catalyst优化器是整个sparkSql pipeline的中间核心部分,其执行策略主要两方向,
Optimizer是catalyst工作最后阶段了,后面生成physical plan以及执行,主要是由sparkSql来完成。
将sparkSql字符串切分成一个一个token,再根据一定语义规则解析为一个抽象语法树/AST。Parser模块目前基本都使用第三方类库ANTLR来实现,比如Hive,presto,sparkSql等。
parser切词
Spark 1.x版本使用的是Scala原生的Parser Combinator构建词法和语法分析器,而Spark 2.x版本使用的是第三方语法解析器工具ANTLR4。
Spark2.x SQL语句的解析采用的是ANTLR4,ANTLR4根据语法文件SqlBase.g4自动解析生成两个Java类:词法解析器SqlBaseLexer和语法解析器SqlBaseParser。
SqlBaseLexer和SqlBaseParser都是使用ANTLR4自动生成的Java类。使用这两个解析器将SQL字符串语句解析成了ANTLR4的ParseTree语法树结构。然后在parsePlan过程中,使用AstBuilder.scala将ParseTree转换成catalyst表达式逻辑计划LogicalPlan。
通过解析后ULP有了基本骨架,但是系统对表的字段信息是不知道的。如sum,select,join,where还有score,people都表示什么含义,此时需要基本的元数据信息schema catalog
来表达这些token。最重要的元数据信息就是,
Analyzer会再次遍历整个AST,对树上的每个节点进行数据类型绑定以及函数绑定,比如people词素会根据元数据表信息解析为包含age、id以及name三列的表,people.age会被解析为数据类型为int的变量,sum会被解析为特定的聚合函数,
词义注入
//org.apache.spark.sql.catalyst.analysis.Analyzer.scala
lazy val batches: Seq[Batch] = Seq( //不同Batch代表不同的解析策略
Batch("Substitution", fixedPoint,
CTESubstitution,
WindowsSubstitution,
EliminateUnions,
new SubstituteUnresolvedOrdinals(conf)),
Batch("Resolution", fixedPoint,
ResolveTableValuedFunctions ::
ResolveRelations :: //通过catalog解析表或列基本数据类型,命名等信息
ResolveReferences :: //解析从子节点的操作生成的属性,一般是别名引起的,比如people.age
ResolveCreateNamedStruct ::
ResolveDeserializer ::
ResolveNewInstance ::
ResolveUpCast ::
ResolveGroupingAnalytics ::
ResolvePivot ::
ResolveOrdinalInOrderByAndGroupBy ::
ResolveMissingReferences ::
ExtractGenerator ::
ResolveGenerate ::
ResolveFunctions :: //解析基本函数,如max,min,agg
ResolveAliases ::
ResolveSubquery :: //解析AST中的字查询信息
ResolveWindowOrder ::
ResolveWindowFrame ::
ResolveNaturalAndUsingJoin ::
ExtractWindowExpressions ::
GlobalAggregates :: //解析全局的聚合函数,比如select sum(score) from table
ResolveAggregateFunctions ::
TimeWindowing ::
ResolveInlineTables ::
TypeCoercion.typeCoercionRules ++
extendedResolutionRules : _*),
Batch("Nondeterministic", Once,
PullOutNondeterministic),
Batch("UDF", Once,
HandleNullInputsForUDF),
Batch("FixNullability", Once,
FixNullability),
Batch("Cleanup", fixedPoint,
CleanupAliases)
)
Optimizer是catalyst的核心,分为RBO和CBO两种。 RBO的优化策略就是对语法树进行一次遍历,模式匹配能够满足特定规则的节点,再进行相应的等价转换,即将一棵树等价地转换为另一棵树。SQL中经典的常见优化规则有,
由下往上走,从join后再filter优化为filter再join
从`100+80`优化为`180`,避免每一条record都需要执行一次`100+80`的操作
剪裁不需要的字段,特别是嵌套里面的不需要字段。如只需people.age,不需要people.address,那么可以将address字段丢弃
//@see http://blog.csdn.net/oopsoom/article/details/38121259
//org.apache.spark.sql.catalyst.optimizer.Optimizer.scala
def batches: Seq[Batch] = {
// Technically some of the rules in Finish Analysis are not optimizer rules and belong more
// in the analyzer, because they are needed for correctness (e.g. ComputeCurrentTime).
// However, because we also use the analyzer to canonicalized queries (for view definition),
// we do not eliminate subqueries or compute current time in the analyzer.
Batch("Finish Analysis", Once,
EliminateSubqueryAliases,
ReplaceExpressions,
ComputeCurrentTime,
GetCurrentDatabase(sessionCatalog),
RewriteDistinctAggregates) ::
//////////////////////////////////////////////////////////////////////////////////////////
// Optimizer rules start here
//////////////////////////////////////////////////////////////////////////////////////////
// - Do the first call of CombineUnions before starting the major Optimizer rules,
// since it can reduce the number of iteration and the other rules could add/move
// extra operators between two adjacent Union operators.
// - Call CombineUnions again in Batch("Operator Optimizations"),
// since the other rules might make two separate Unions operators adjacent.
Batch("Union", Once,
CombineUnions) ::
Batch("Subquery", Once,
OptimizeSubqueries) ::
Batch("Replace Operators", fixedPoint,
ReplaceIntersectWithSemiJoin,
ReplaceExceptWithAntiJoin,
ReplaceDistinctWithAggregate) ::
Batch("Aggregate", fixedPoint,
RemoveLiteralFromGroupExpressions,
RemoveRepetitionFromGroupExpressions) ::
Batch("Operator Optimizations", fixedPoint,
// Operator push down
PushProjectionThroughUnion,
ReorderJoin,
EliminateOuterJoin,
PushPredicateThroughJoin, //谓词下推之一
PushDownPredicate, //谓词下推之一
LimitPushDown,
ColumnPruning, //列值剪裁,常用于聚合操作,join左右孩子操作,合并相邻project列
InferFiltersFromConstraints,
// Operator combine
CollapseRepartition,
CollapseProject,
CollapseWindow,
CombineFilters, //谓词下推之一,合并两个相邻的Filter。合并2个节点,就可以减少树的深度从而减少重复执行过滤的代价
CombineLimits, //合并Limits
CombineUnions,
// Constant folding and strength reduction
NullPropagation,
FoldablePropagation,
OptimizeIn(conf),
ConstantFolding, //常量累加之一
ReorderAssociativeOperator,
LikeSimplification,
BooleanSimplification, //常量累加之一,布尔表达式的提前短路
SimplifyConditionals,
RemoveDispensableExpressions,
SimplifyBinaryComparison,
PruneFilters,
EliminateSorts,
SimplifyCasts,
SimplifyCaseConversionExpressions,
RewriteCorrelatedScalarSubquery,
EliminateSerialization,
RemoveRedundantAliases,
RemoveRedundantProject) ::
Batch("Check Cartesian Products", Once,
CheckCartesianProducts(conf)) ::
Batch("Decimal Optimizations", fixedPoint,
DecimalAggregates) ::
Batch("Typed Filter Optimization", fixedPoint,
CombineTypedFilters) ::
Batch("LocalRelation", fixedPoint,
ConvertToLocalRelation,
PropagateEmptyRelation) ::
Batch("OptimizeCodegen", Once,
OptimizeCodegen(conf)) ::
Batch("RewriteSubquery", Once,
RewritePredicateSubquery,
CollapseProject) :: Nil
}
至此,OLP已经得到了比较完善的优化,然而此时OLP依然没有办法真正执行,它们只是逻辑上可行,实际上spark并不知道如何去执行这个OLP。
optimized logical plan -> physical plan
此时就需要将左边的OLP转换为physical plan物理执行计划,将逻辑上可行的执行计划变为spark可以真正执行的计划。
CBO off
CBO on
CBO中常见的优化是join换位
,以便尽量减少中间shuffle数据集大小,达到最优输出。