Presto对ORC格式的优化

哒呵呵

发布于 2019-05-16 10:39:24

2.5K0

发布于 2019-05-16 10:39:24

文章被收录于专栏：鸿的学习笔记

参考文章：https://prestosql.io/blog/2019/04/23/even-faster-orc.html

最近Presto的官网发表了一篇文章，叙述了新版本的Presto对ORC格式读取的性能优化过程，包含了很多代码细节，非常有趣，故进行简单编译。

Why is this important?

在 TPC-DS benchmark 测试中，对于 ORC 格式新的读取方式 Presto 总的查询耗费时间减少了约5%，CPU使用量减少了约9%。

What improved?

ORC格式对数据的解码分为两个步骤：第一步是使用传统的压缩格式（例如，gzip）去减少数据的存储空间；第二步是针对特定的数据类型使用特定的压缩算法去将原生的byte类型变成Value（例如text、number、timestamp）。本文主要是针对第二步过程进行了优化。

How much faster is the decoder?

对于ORC各个数据类型的优化

Why exactly is this faster?

Presto的老版本代码可以简化成如下逻辑：

if (dataStream == null) {
    presentStream.skip(nextBatchSize);
    return RunLengthEncodedBlock.create(type, null, nextBatchSize);
}

BlockBuilder builder = type.createBlockBuilder(null, nextBatchSize);
if (presentStream == null) {
    for (int i = 0; i < nextBatchSize; i++) {
        type.writeLong(builder, dataStream.next());
    }
}
else {
    for (int i = 0; i < nextBatchSize; i++) {
        if (presentStream.nextBit()) {
            type.writeLong(builder, dataStream.next());
        }
        else {
            builder.appendNull();
        }
    }
}
return builder.build();

这段代码做的好的地方：

对于所有值都是null的情况，会返回一个Run-Length Encoding的block块（http://www.fileformat.info/mirror/egff/ch09_03.htm）
将unconditional no nulls循环从conditional mixed nulls分离。这里主要是针对column中没有null值的情况，因为unconditional loops要比conditional loops快。

但这段代码也包含了一些性能问题：

这段代码reads one value at a time，而很多编码格式read in bulk性能会更好。
当调用一些不同的类型实例（type instances），会导致非常慢的 dynamic dispatch call。
从一个 null 值循环中读取 Value 的代价非常昂贵。

Optimize for bulk reads

在之前的老版本代码中，Presto 对于每种数据类型都是用同一个的 batch size ，也就是说每次都会读取1024个固定的 Value。但是ORC格式对于一些数据类型，例如 booleans、numbers、bytes 等，使用不同的 batch size 性能会更好（将对 float 和 double 类型的读取从loading a value byte at a time变成直接从输入 bulk loading an entire array of values，性能提升了10倍），但是代码会变得更加复杂（参考对Boolean类型的优化，https://github.com/prestosql/presto/blob/308/presto-orc/src/main/java/io/prestosql/orc/stream/BooleanInputStream.java#L218）。

Avoid dynamic dispatch in loops

dynamic dispatch的性能问题在老版本的代码中不容易被发现，也很容易在benchmarks测试中被忽略掉。dynamic dispatch的性能问题往往发生在一个循环中目标类（target class）生命周期中拥有不同的数据类型的。例如下面的例子：

for (int i = 0; i < nextBatchSize; i++) {
    type.writeLong(builder, dataStream.next());
}

此时type可能表示着不同的类（但拥有着同一种方法），其的性能问题取决于type在执行周期中会有多少次对应的不同的类；在ORC格式中，大多数的调用一般都只会调用有一次数据类型，但是对于Long格式的调用，对应着BIGINT、INTEGER、SMALLINT、TINYINT 和 DATE五种类型；这就会导致JVM产生dynamic dispatch，从而对性能产生影响。因此Presto选择了直接指明type的实例，避免dynamic dispatch的性能问题：

if (type instanceof BigintType) {
    BlockBuilder builder = type.createBlockBuilder(null, nextBatchSize);
    for (int i = 0; i < nextBatchSize; i++) {
        type.writeLong(builder, dataStream.next());
    }
    return builder.build();
}
if (type instanceof IntegerType) {
    BlockBuilder builder = type.createBlockBuilder(null, nextBatchSize);
    for (int i = 0; i < nextBatchSize; i++) {
        type.writeLong(builder, dataStream.next());
    }
    return builder.build();
}
...

关于dynamic dispatch的性能问题详细的讨论请参考：https://shipilev.net/blog/2015/black-magic-method-dispatch/

Improve null reading

在做完上面的优化后，Presto在大多数不带null值的数据类型的测试中获得了约（0.5ns到3ns）/Value的提升，但是对于带null值的数据类型的测试反倒下降了 6ns/Value。Presto做了很多努力，并最终找到一种性能提升的方法。抽象代码如下（在bulk read过程区分null和Value值，并针对null值进行特别的优化）：

// bulk read and count null values
boolean[] isNull = new boolean[nextBatchSize];
int nullCount = presentStream.getUnsetBits(nextBatchSize, isNull);

// bulk read non-values into an array large enough for full results
long[] result = new long[nextBatchSize];
dataStream.next(longNonNullValueTemp, nextBatchSize - nullCount);

// copy non-null values into output position (in reverse order)
int nullSuppressedPosition = nextBatchSize - nullCount - 1;
for (int outputPosition = isNull.length - 1; outputPosition >= 0; outputPosition--) {
    if (isNull[outputPosition]) {
        result[outputPosition] = 0;
    }
    else {
        result[outputPosition] = result[nullSuppressedPosition];
        nullSuppressedPosition--;
    }
}

这样bulk read时就只会读到value值，但是在性能测试中依然发现对于每一次读取value都有约4ns的性能下降。Presto的开发者注意到可以使用一个临时的buffer去缓存result[outputPosition]，以减少in-place的读取。

// bulk read and count null values
boolean[] isNull = new boolean[nextBatchSize];
int nullCount = presentStream.getUnsetBits(nextBatchSize, isNull);

// bulk read non-values into a temporary array
dataStream.next(tempBuffer, nextBatchSize - nullCount);

// copy values into result
long[] result = new long[isNull.length];
int position = 0;
for (int i = 0; i < isNull.length; i++) {
    result[i] = tempBuffer[position];
    if (!isNull[i]) {
        position++;
    }
}

在做了这些改变后，null值的对于性能的影响下降到约1.5ns/value。但是这样的做法有两个弊端：

会存在一个额外的temporary buffer，但是因为读取时是单线程，所以这个buffer可以全局使用。
null值不再是0，这可能会引发一些隐性的bug；为了解决这个bug，开发者尝试不设置null值，但是这个相对于上面的方法更慢，且又增加了一个临时的buffer，在各方权衡下Presto选择了前者。