1. 背景
2. 广告日志数据湖
3. 湖仓一体方案遇到的挑战和改进
spark.sql.iceberg.write.commit-by-manifest = true; // 默认是false
22/08/19 14:50:04 INFO util.CloseableIterableWithFilterMetrics: MANIFEST_DATA File Filter (Filtered: 1684, Total: 1731)-- Filter by: ManifestMetrics, filtered: 1684
22/08/19 14:50:04 INFO util.CloseableIterableWithFilterMetrics: DATAFILE File Filter (Filtered: 11062771, Total: 11063575)-- Filter by: PartitionFilter, filtered: 97763-- Filter by: ManifestMetrics, filtered: 10965008
SELECT * FROM iceberg.db.table WHERE start_with('addr', 'some_value');
DATAFILE File Filter (Filtered: 20, Total: 25)-- Filter by: SchemaFilter, filtered: 15-- Filter by: DataFileMetricsFilter, filtered: 5
2022-07-27 17:09:17,173 INFO util.CloseableIterableWithFilterMetrics:MANIFEST_DATA File Filter (Filtered: 0, Total: 5)
spark.sql.iceberg.enable-dynamic-partition-pruning = true; // 默认是开启的
[1, 2, 2, 5, 5, 5, 5, 7, 7, 7] -> [1 1, 2 2, 4 5, 3 7]
ColumnVector columnVector = ...int numRecords = readInt();int bitPackedValue = unBitPack(readValue());for (int i = 0; i columnVector[startOffset + i] = bitPackedValue;}
4. 项目收益
5、未来规划
本文为从大数据到人工智能博主「bajiebajie2333」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。