前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Apache Calcite 文档翻译 - 关系代数

Apache Calcite 文档翻译 - 关系代数

作者头像
tyrantlucifer
发布2022-04-19 10:23:49
1.2K0
发布2022-04-19 10:23:49
举报
文章被收录于专栏:Tyrant LuciferTyrant Lucifer

代数

关系代数是Calcite的核心。每个查询都被表示为一棵关系运算符的树。你可以将一条SQL语句翻译为关系代数,也可以直接建立树状结构。

规则器规则使用保留语义的数学特性来转换表达树。例如,如果一个过滤操作没有引入其他输入的列,那么可以将一个过滤器下推至连接之前。

Calcite通过对关系表达式进行反复应用规划器规则来优化查询。一个成本模型指导了优化的整个过程,规划器生成一个替代的表达式,语义与之前表达式相同,但具有更低的成本。

规划过程是可扩展的。你可以添加自己的关系运算符、规划器规则、成本模型和统计数据。

代数构建

建立关系表达式的最简单方法时使用代数构建器RelBuilder,下面是一个示例:

表扫描
代码语言:javascript
复制
final FrameworkConfig config;
final RelBuilder builder = RelBuilder.create(config);
final RelNode node = builder
  .scan("EMP")
  .build();
System.out.println(RelOptUtil.toString(node));

(你可以从这个文件RelBUilderExample.java中获取更多的例子)

以上代码输出下列内容:

代码语言:javascript
复制
LogicalTableScan(table=[[scott, EMP]])

它相当于对EMP表进行了全表扫描,相当于这句SQL:

代码语言:javascript
复制
SELECT *
FROM scott.EMP;
增加项目

现在,让我们添加一个项目,相当于这句SQL:

代码语言:javascript
复制
SELECT ename, deptno
FROM scott.EMP;

我们仅仅只需要在调用build方法之前调用project方法即可:

代码语言:javascript
复制
final RelNode node = builder
  .scan("EMP")
  .project(builder.field("DEPTNO"), builder.field("ENAME"))
  .build();
System.out.println(RelOptUtil.toString(node));

代码输出以下内容:

代码语言:javascript
复制
LogicalProject(DEPTNO=[$7], ENAME=[$1])
  LogicalTableScan(table=[[scott, EMP]])

两次调用builder.field方法创建了简单的表达式,从输入的通过调用scan方法产生的关系表达式(TableScan)中返回字段。

Calcite将字段按照序号进行转换,7和1

增加过滤和聚合

一个查询包含了过滤和聚合:

代码语言:javascript
复制
final RelNode node = builder
  .scan("EMP")
  .aggregate(builder.groupKey("DEPTNO"),
      builder.count(false, "C"),
      builder.sum(false, "S", builder.field("SAL")))
  .filter(
      builder.call(SqlStdOperatorTable.GREATER_THAN,
          builder.field("C"),
          builder.literal(10)))
  .build();
System.out.println(RelOptUtil.toString(node));

相等于下列SQL语句:

代码语言:javascript
复制
SELECT deptno, count(*) AS c, sum(sal) AS s
FROM emp
GROUP BY deptno
HAVING count(*) > 10

同时代码输出以下内容:

代码语言:javascript
复制
LogicalFilter(condition=[>($1, 10)])
  LogicalAggregate(group=[{7}], C=[COUNT()], S=[SUM($5)])
    LogicalTableScan(table=[[scott, EMP]])
入栈和出栈

构建器使用栈来存储一个步骤所产生的关系表达式,并将其作为输入传递给下一个步骤,这使得产生关系表达式的方法可以产生一个构造器。

大多数时候,你将使用的唯一栈方法是build(),以获得最后的关系表达式,即树的根部。

有时栈会嵌套的很深,让人感到困惑。为了让事情变得更加简单一些,你可以从栈中移除表达式。例如,在这里我们建立了一个灌木状的连接:

代码语言:javascript
复制
.
               join
             /      \
        join          join
      /      \      /      \
CUSTOMERS ORDERS LINE_ITEMS PRODUCTS

我们分三个步骤来构建它。将中间结果存储在leftright两个变量中,当创建最终的Join时,使用push()方法将它们放回栈中:

代码语言:javascript
复制
final RelNode left = builder
  .scan("CUSTOMERS")
  .scan("ORDERS")
  .join(JoinRelType.INNER, "ORDER_ID")
  .build();

final RelNode right = builder
  .scan("LINE_ITEMS")
  .scan("PRODUCTS")
  .join(JoinRelType.INNER, "PRODUCT_ID")
  .build();

final RelNode result = builder
  .push(left)
  .push(right)
  .join(JoinRelType.INNER, "ORDER_ID")
  .build();
切换公约

默认的Relbuilder创建的是没有公约的逻辑RelNode,但你可以通过adoptConvention()切换到使用一个不同的约定:

代码语言:javascript
复制
final RelNode result = builder
  .push(input)
  .adoptConvention(EnumerableConvention.INSTANCE)
  .sort(toCollation)
  .build();

在这种情况下,我们在输入的RelNode之上创建了一个EnumerableSort

字段名和序号

你可以通过名称或者序号来引用一个字段。

序号是从0开始。每个运算符都保证其输出字段的出现顺序。例如,Project返回由每个标量表达式产生的字段。

操作符的字段名保证是唯一的,但有时这意味着名字并不完全是你期望的那样。例如,当你将EMP连接到DEPT时,其中一个输出字段被称为DEPTNO,而另一个字段被称为DEPTNO_1

一些关系表达式方法让你对字段名称有更多的控制:

  • project让你用alias(expr, fieldName)来包装表达式。它删除了包装器但是保留了建议的名称(只要它是唯一的)
  • values(String[] fieldNames, Object... values) 接受一个字段名的数组。如果数组中的任何元素为空,构建器将生成一个唯一的名称。

如果一个表达式使用了一个输入字段,或者转换了一个输入字段,那么它将使用该输入字段的名称。

一旦唯一的字段名被分配,这些名字是不可更改的。如果你有一个特定的RelNode实例,你可以相信字段名不会改变。事实上,整个关系表达式不可变的。

但是如果一个关系表达式已经通过了几个重写规则(详见RelOptRule),那么产生的表达式的字段名可能与原来的不太一样。在这一点上,最好使用序号来引用字段。

当你建立一个接收多个输入的关系表达式时,你需要建立考虑到这一点的字段引用。这在构建连接条件的时候常常发生。

假设你正在建立一个关于EMP的连接,EMP有8个字段[EMNO, ENAME, JOB, MGR, HIREDATE, SAL, COMM, DEPTNO],而DEPT有3个字段[DEPTNO, DNAME, LOC]。在内部,Calcite将这些字段表示为一个有11个字段的组合输入行:左边输入的第一个字段是字段#0(基于0,记住),右边输入的第一个字段是字段#8

但是通过 builder API,你可以指定哪个字段的输入。要引用 "SAL",内部字段#5,可以写 builder.field(2, 0, "SAL"),builder.field(2, "EMP", "SAL"),或者 builder.field(2, 0, 5)。这意味着 "两个输入中的输入#0的字段#5"。(为什么它需要知道有两个输入?因为它们被存储在堆栈中;输入 #1 在堆栈的顶部,而输入 #0 在它的下面。如果我们不告诉构建器有两个输入,它就不知道要到多深的地方去找0号输入)。

同样,要引用 "DNAME",内部字段 #9 (8 + 1),请写 builder.field(2, 1, "DNAME"), builder.field(2, "DEPT", "DNAME"), 或者 builder.field(2, 1, 1)。

递归查询

警告。目前的API是实验性的,如有变化,恕不另行通知。一个SQL递归查询,例如这个产生序列1, 2, 3, ...10的查询:

代码语言:javascript
复制
WITH RECURSIVE aux(i) AS (
  VALUES (1)
  UNION ALL
  SELECT i+1 FROM aux WHERE i < 10
)
SELECT * FROM aux

可以通过对TransientTableRepeatUnion的扫描来生成:

代码语言:javascript
复制
final RelNode node = builder
  .values(new String[] { "i" }, 1)
  .transientScan("aux")
  .filter(
      builder.call(
          SqlStdOperatorTable.LESS_THAN,
          builder.field(0),
          builder.literal(10)))
  .project(
      builder.call(
          SqlStdOperatorTable.PLUS,
          builder.field(0),
          builder.literal(1)))
  .repeatUnion("aux", true)
  .build();
System.out.println(RelOptUtil.toString(node));

以上代码输出以下内容:

代码语言:javascript
复制
LogicalRepeatUnion(all=[true])
  LogicalTableSpool(readType=[LAZY], writeType=[LAZY], tableName=[aux])
    LogicalValues(tuples=[[{ 1 }]])
  LogicalTableSpool(readType=[LAZY], writeType=[LAZY], tableName=[aux])
    LogicalProject($f0=[+($0, 1)])
      LogicalFilter(condition=[<($0, 10)])
        LogicalTableScan(table=[[aux]])
API汇总
关系操作符

以下方法创建一个关系表达式(RelNode),将其推入栈,并返回RelBuilder。

METHOD

DESCRIPTION

scan(tableName)

Creates a TableScan.

functionScan(operator, n, expr...) functionScan(operator, n, exprList)

Creates a TableFunctionScan of the n most recent relational expressions.

transientScan(tableName [, rowType])

Creates a TableScan on a TransientTable with the given type (if not specified, the most recent relational expression’s type will be used).

values(fieldNames, value...) values(rowType, tupleList)

Creates a Values.

filter([variablesSet, ] exprList) filter([variablesSet, ] expr...)

Creates a Filter over the AND of the given predicates; if variablesSet is specified, the predicates may reference those variables.

project(expr...) project(exprList [, fieldNames])

Creates a Project. To override the default name, wrap expressions using alias, or specify the fieldNames argument.

projectPlus(expr...) projectPlus(exprList)

Variant of project that keeps original fields and appends the given expressions.

projectExcept(expr...) projectExcept(exprList)

Variant of project that keeps original fields and removes the given expressions.

permute(mapping)

Creates a Project that permutes the fields using mapping.

convert(rowType [, rename])

Creates a Project that converts the fields to the given types, optionally also renaming them.

aggregate(groupKey, aggCall...) aggregate(groupKey, aggCallList)

Creates an Aggregate.

distinct()

Creates an Aggregate that eliminates duplicate records.

pivot(groupKey, aggCalls, axes, values)

Adds a pivot operation, implemented by generating an Aggregate with a column for each combination of measures and values

unpivot(includeNulls, measureNames, axisNames, axisMap)

Adds an unpivot operation, implemented by generating a Join to a Values that converts each row to several rows

sort(fieldOrdinal...) sort(expr...) sort(exprList)

Creates a Sort. In the first form, field ordinals are 0-based, and a negative ordinal indicates descending; for example, -2 means field 1 descending. In the other forms, you can wrap expressions in as, nullsFirst or nullsLast.

sortLimit(offset, fetch, expr...) sortLimit(offset, fetch, exprList)

Creates a Sort with offset and limit.

limit(offset, fetch)

Creates a Sort that does not sort, only applies with offset and limit.

exchange(distribution)

Creates an Exchange.

sortExchange(distribution, collation)

Creates a SortExchange.

correlate(joinType, correlationId, requiredField...) correlate(joinType, correlationId, requiredFieldList)

Creates a Correlate of the two most recent relational expressions, with a variable name and required field expressions for the left relation.

join(joinType, expr...) join(joinType, exprList) join(joinType, fieldName...)

Creates a Join of the two most recent relational expressions. The first form joins on a boolean expression (multiple conditions are combined using AND). The last form joins on named fields; each side must have a field of each name.

semiJoin(expr)

Creates a Join with SEMI join type of the two most recent relational expressions.

antiJoin(expr)

Creates a Join with ANTI join type of the two most recent relational expressions.

union(all [, n])

Creates a Union of the n (default two) most recent relational expressions.

intersect(all [, n])

Creates an Intersect of the n (default two) most recent relational expressions.

minus(all)

Creates a Minus of the two most recent relational expressions.

repeatUnion(tableName, all [, n])

Creates a RepeatUnion associated to a TransientTable of the two most recent relational expressions, with n maximum number of iterations (default -1, i.e. no limit).

snapshot(period)

Creates a Snapshot of the given snapshot period.

match(pattern, strictStart, strictEnd, patterns, measures, after, subsets, allRows, partitionKeys, orderKeys, interval)

Creates a Match.

参数类型详解:

  • exprinterval,RexNode
  • expr..., requiredField... Array of RexNode
  • exprList, measureList, partitionKeys, orderKeys, requiredFieldList Iterable of RexNode
  • fieldOrdinal Ordinal of a field within its row (starting from 0)
  • fieldName Name of a field, unique within its row
  • fieldName... Array of String
  • fieldNames Iterable of String
  • rowType RelDataType
  • groupKey RelBuilder.GroupKey
  • aggCall... Array of RelBuilder.AggCall
  • aggCallList Iterable of RelBuilder.AggCall
  • value... Array of Object
  • value Object
  • tupleList Iterable of List of RexLiteral
  • all, distinct, strictStart, strictEnd, allRows boolean
  • alias String
  • correlationId CorrelationId
  • variablesSet Iterable of CorrelationId
  • varHolder Holder of RexCorrelVariable
  • patterns Map whose key is String, value is RexNode
  • subsets Map whose key is String, value is a sorted set of String
  • distribution RelDistribution
  • collation RelCollation
  • operator SqlOperator
  • joinType JoinRelType

构造器的方法包含各种优化,包括:

  • 如果被要求按顺序映射所有列,project则返回输入值
  • filter会扁平化简化筛选操作(所以一个andor操作可能有多个操作数),例如(将x = 1 AND TRUE转换为x = 1)
  • 如果你先sortlimit,效果和你使用sortLimit一样

有一些注释方法可以为栈上的顶级关系表达式添加注释信息:

METHOD

DESCRIPTION

as(alias)

Assigns a table alias to the top relational expression on the stack

variable(varHolder)

Creates a correlation variable referencing the top relational expression

栈方法

METHOD

DESCRIPTION

build()

Pops the most recently created relational expression off the stack

push(rel)

Pushes a relational expression onto the stack. Relational methods such as scan, above, call this method, but user code generally does not

pushAll(collection)

Pushes a collection of relational expressions onto the stack

peek()

Returns the relational expression most recently put onto the stack, but does not remove it

标量表达方法

以下方法返回一个标量表达式(RexNode)。

它们中的许多都使用栈中的内容。例如,field("DEPTNO")返回一个对刚刚添加到堆栈中的关系表达式的"DEPTNO"字段的引用。

METHOD

DESCRIPTION

literal(value)

Constant

field(fieldName)

Reference, by name, to a field of the top-most relational expression

field(fieldOrdinal)

Reference, by ordinal, to a field of the top-most relational expression

field(inputCount, inputOrdinal, fieldName)

Reference, by name, to a field of the (inputCount - inputOrdinal)th relational expression

field(inputCount, inputOrdinal, fieldOrdinal)

Reference, by ordinal, to a field of the (inputCount - inputOrdinal)th relational expression

field(inputCount, alias, fieldName)

Reference, by table alias and field name, to a field at most inputCount - 1 elements from the top of the stack

field(alias, fieldName)

Reference, by table alias and field name, to a field of the top-most relational expressions

field(expr, fieldName)

Reference, by name, to a field of a record-valued expression

field(expr, fieldOrdinal)

Reference, by ordinal, to a field of a record-valued expression

fields(fieldOrdinalList)

List of expressions referencing input fields by ordinal

fields(mapping)

List of expressions referencing input fields by a given mapping

fields(collation)

List of expressions, exprList, such that sort(exprList) would replicate collation

call(op, expr...) call(op, exprList)

Call to a function or operator

and(expr...) and(exprList)

Logical AND. Flattens nested ANDs, and optimizes cases involving TRUE and FALSE.

or(expr...) or(exprList)

Logical OR. Flattens nested ORs, and optimizes cases involving TRUE and FALSE.

not(expr)

Logical NOT

equals(expr, expr)

Equals

isNull(expr)

Checks whether an expression is null

isNotNull(expr)

Checks whether an expression is not null

alias(expr, fieldName)

Renames an expression (only valid as an argument to project)

cast(expr, typeName) cast(expr, typeName, precision) cast(expr, typeName, precision, scale)

Converts an expression to a given type

desc(expr)

Changes sort direction to descending (only valid as an argument to sort or sortLimit)

nullsFirst(expr)

Changes sort order to nulls first (only valid as an argument to sort or sortLimit)

nullsLast(expr)

Changes sort order to nulls last (only valid as an argument to sort or sortLimit)

cursor(n, input)

Reference to inputth (0-based) relational input of a TableFunctionScan with n inputs (see functionScan)

子查询方法

以下方法将子查询转换为标量值(在in、exists、some、all、unique的情况下是BOOLEAN;对于scalarQuery是任何标量类型。对于arrayQuery是ARRAY,对于mapQuery是MAP,对于multisetQuery是MULTISET)

就如下所示,relFn是一个函数,它接收一个RelBuilder参数并返回一个RelNode。你通常把它作为一个lambda表达式来实现;该方法用一个具有正确上下文的RelBuilder来调用你的代码,而你的代码则返回将成为子查询的RelNode

METHOD

DESCRIPTION

all(expr, op, relFn)

Returns whether expr has a particular relation to all of the values of the sub-query

arrayQuery(relFn)

Returns the rows of a sub-query as an ARRAY

exists(relFn)

Tests whether sub-query is non-empty

in(expr, relFn) in(exprList, relFn)

Tests whether a value occurs in a sub-query

mapQuery(relFn)

Returns the rows of a sub-query as a MAP

multisetQuery(relFn)

Returns the rows of a sub-query as a MULTISET

scalarQuery(relFn)

Returns the value of the sole column of the sole row of a sub-query

some(expr, op, relFn)

Returns whether expr has a particular relation to one or more of the values of the sub-query

unique(relFn)

Returns whether the rows of a sub-query are unique

模式匹配方法

以下方法返回用于匹配的模式。

METHOD

DESCRIPTION

patternConcat(pattern...)

Concatenates patterns

patternAlter(pattern...)

Alternates patterns

patternQuantify(pattern, min, max)

Quantifies a pattern

patternPermute(pattern...)

Permutes a pattern

patternExclude(pattern)

Excludes a pattern

分组方法

以下方法返回RelBuilder.GroupKey。

METHOD

DESCRIPTION

groupKey(fieldName...) groupKey(fieldOrdinal...) groupKey(expr...) groupKey(exprList)

Creates a group key of the given expressions

groupKey(exprList, exprListList)

Creates a group key of the given expressions with grouping sets

groupKey(bitSet [, bitSets])

Creates a group key of the given input columns, with multiple grouping sets if bitSets is specified

聚合方法

以下方法返回RelBuilder.AggCall。

METHOD

DESCRIPTION

aggregateCall(op, expr...) aggregateCall(op, exprList)

Creates a call to a given aggregate function

count([ distinct, alias, ] expr...) count([ distinct, alias, ] exprList)

Creates a call to the COUNT aggregate function

countStar(alias)

Creates a call to the COUNT(*) aggregate function

sum([ distinct, alias, ] expr)

Creates a call to the SUM aggregate function

min([ alias, ] expr)

Creates a call to the MIN aggregate function

max([ alias, ] expr)

Creates a call to the MAX aggregate function

要进一步修改AggCall,请调用以下方法:

METHOD

DESCRIPTION

approximate(approximate)

Allows approximate value for the aggregate of approximate

as(alias)

Assigns a column alias to this expression (see SQL AS)

distinct()

Eliminates duplicate values before aggregating (see SQL DISTINCT)

distinct(distinct)

Eliminates duplicate values before aggregating if distinct

filter(expr)

Filters rows before aggregating (see SQL FILTER (WHERE ...))

sort(expr...) sort(exprList)

Sorts rows before aggregating (see SQL WITHIN GROUP)

unique(expr...) unique(exprList)

Makes rows unique before aggregating (see SQL WITHIN DISTINCT)

over()

Converts this AggCall into a windowed aggregate (see OverCall below)

窗口聚合方法

创建一个 RelBuilder.OverCall,它代表了对窗口聚合函数的调用,创建一个聚合调用,然后调用其over()方法,例如count().over()

要进一步修改OverCall,调用以下方法:

METHOD

DESCRIPTION

rangeUnbounded()

Creates an unbounded range-based window, RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

rangeFrom(lower)

Creates a range-based window bounded below, RANGE BETWEEN lower AND CURRENT ROW

rangeTo(upper)

Creates a range-based window bounded above, RANGE BETWEEN CURRENT ROW AND upper

rangeBetween(lower, upper)

Creates a range-based window, RANGE BETWEEN lower AND upper

rowsUnbounded()

Creates an unbounded row-based window, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

rowsFrom(lower)

Creates a row-based window bounded below, ROWS BETWEEN lower AND CURRENT ROW

rowsTo(upper)

Creates a row-based window bounded above, ROWS BETWEEN CURRENT ROW AND upper

rowsBetween(lower, upper)

Creates a rows-based window, ROWS BETWEEN lower AND upper

partitionBy(expr...) partitionBy(exprList)

Partitions the window on the given expressions (see SQL PARTITION BY)

orderBy(expr...) sort(exprList)

Sorts the rows in the window (see SQL ORDER BY)

allowPartial(b)

Sets whether to allow partial width windows; default true

nullWhenCountZero(b)

Sets whether whether the aggregate function should evaluate to null if no rows are in the window; default false

as(alias)

Assigns a column alias (see SQL AS) and converts this OverCall to a RexNode

toRex()

Converts this OverCall to a RexNode

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2022-03-19,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 Tyrant Lucifer 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 代数
    • 代数构建
      • 表扫描
      • 增加项目
      • 增加过滤和聚合
      • 入栈和出栈
      • 切换公约
      • 字段名和序号
      • 递归查询
      • API汇总
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档