Iceberg

原创

jasong

修改于 2024-11-26 11:17:54

3270

文章被收录于专栏：LakeHouseLakeHouse ClickHouse

一 Iceberg

1 hive

1.数据更改效率低
2.无法在一次操作中安全地更改多个分区中的数据
3.在实践中，修改同一数据集的多个作业不是安全的操作
4.大型表所需的所有目录列表都需要很长时间
5.用户比较知道每张表实际的物理布局
6.性能差

2 Why Iceberg

1. 提供始终正确且始终一致的表格视图
2. 实现更快的查询规划和执行
3. 为用户提供良好的响应时间，而无需他们知道数据的物理布局
4. 实现更好、更安全的表演变
5. 在数据、用户和应用程序规模上实现上述所有目标

3 What is Iceberg

iceberg = 元数据+数据

元数据 = metadata + manifestlist_file + manifest_file
数据 = datafile

二 Metadata

metadata/metatdata.json 元数据文件

v1.metadata.json
v2.metadata.json

vx.metadata.json = {tableid + location + x + schema + snapshots + current-snapshot-id}

    "snapshot_id": {
        "long": 1257424822184505300
    },

一个metadata 存放多个snapshot, snapshot:mainfest_list = 1:1

生成一个metadata必定生成一个snapshot 文件

一个metadata按照各自时间戳，对应一个snapshot文件文件

一个snapshot跟manifest_list 1对应关系，即代表了snapshot 部分信息，一部分在metadata.json

一个mainfest file :mainfest_file是n:n得关系，但manifestfile存放了生成它的snapshotid,即最开始生成它的metadata文件记录了生成它的过程，即生成过程应该就是你像的，先有parquet数据，再有metadata等文件

snapshot.vector

    "snapshots" : [ {
        "snapshot-id" : 8271497753230544300,
        "timestamp-ms" : 1611694406483,
        "summary" : {
            "operation" : "append",
            "spark.app.id" : "application_1611687743277_0002",
            "added-data-files" : "1",
            "added-records" : "1",
            "added-files-size" : "960",
            "changed-partition-count" : "1",
            "total-records" : "1",
            "total-data-files" : "1",
            "total-delete-files" : "0",
            "total-position-deletes" : "0",
            "total-equality-deletes" : "0"
        },
        "manifest-list" : "/home/hadoop/warehouse/db2/part_table2/metadata/snap-8271497753230544300-1-d8a778f9-ad19-4e9c-88ff-28f49ec939fa.avro"
    }, 
    {
        "snapshot-id" : 1257424822184505371,
        "parent-snapshot-id" : 8271497753230544300,
        "timestamp-ms" : 1611694436618,
        "summary" : {
            "operation" : "append",
            "spark.app.id" : "application_1611687743277_0002",
            "added-data-files" : "1",
            "added-records" : "1",
            "added-files-size" : "973",
            "changed-partition-count" : "1",
            "total-records" : "2",
            "total-data-files" : "2",
            "total-delete-files" : "0",
            "total-position-deletes" : "0",
            "total-equality-deletes" : "0"
        },
        "manifest-list" : "/home/hadoop/warehouse/db2/part_table2/metadata/snap-1257424822184505371-1-eab8490b-8d16-4eb1-ba9e-0dede788ff08.avro"
    } ]

metadata/mainfestlist.avro 清单列表

manifestlist = n * (manifest_file) {manifest_path.avro 
+ length + snapshot... + deleted_data_files_count}

[
{

},
{
    "manifest_path": "/home/hadoop/warehouse/db2/part_table2/metadata/eab8490b-8d16-4eb1-ba9e-0dede788ff08-m0.avro",
    "manifest_length": 4884,
    "partition_spec_id": 0,
    "added_snapshot_id": {
        "long": 1257424822184505300
    },
    "added_data_files_count": {
        "int": 1
    },
    "existing_data_files_count": {
        "int": 0
    },
    "deleted_data_files_count": {
        "int": 0
    },
    "partitions": {
        "array": [ {
            "contains_null": false,
            "lower_bound": {
                "bytes": "¹Ô\\u0006\\u0000"
            },
            "upper_bound": {
                "bytes": "¹Ô\\u0006\\u0000"
            }
        } ]
    },
    "added_rows_count": {
        "long": 1
    },
    "existing_rows_count": {
        "long": 0
    },
    "deleted_rows_count": {
        "long": 0
    }
}
]

Metadata/mainfest(avro/orc/parquet) 清单文件

manifest = stats + snapshot_id + data_file{file_path.parquet}

{
    "status": 1,
    "snapshot_id": {
        "long": 1257424822184505300
    },
    "data_file": {
        "file_path": "/home/hadoop/warehouse/db2/part_table2/data/ts_hour=2021-01-26-01/00000-6-7c6cf3c0-8090-4f15-a4cc-3a3a562eed7b-00001.parquet",
        "file_format": "PARQUET",
        "partition": {
            "ts_hour": {
                "int": 447673
            }
        },
        "record_count": 1,
        "file_size_in_bytes": 973,
        "block_size_in_bytes": 67108864,
        "column_sizes": {
            "array": [ {
                "key": 1,
                "value": 47
            },
            {
                "key": 2,
                "value": 57
            },
            {
                "key": 3,
                "value": 60
            } ]
        },

元数据完整的信息即为

  //v1.metadata.json
  "table-uuid" : "4b96b6e8-9838-48df-a111-ec1ff6422816",
  "location" : "/home/hadoop/warehouse/db2/part_table2"
  "current-snapshot-id" : 1257424822184505371,
   snapshot   {
        "snapshot-id" : 1257424822184505371,
        "parent-snapshot-id" : 8271497753230544300,
        "timestamp-ms" : 1611694436618,
        "summary" : {
            "operation" : "append",
            "spark.app.id" : "application_1611687743277_0002",
            "added-data-files" : "1",
            "added-records" : "1",
            "added-files-size" : "973",
            "changed-partition-count" : "1",
            "total-records" : "2",
            "total-data-files" : "2",
            "total-delete-files" : "0",
            "total-position-deletes" : "0",
            "total-equality-deletes" : "0"
        },
        "manifest-list" : "/home/hadoop/warehouse/db2/part_table2/metadata/snap-1257424822184505371-1-eab8490b-8d16-4eb1-ba9e-0dede788ff08.avro"  //snap-{1257424822184505371}-1-{manifest-list-x} 1 
    } 
   
   
   //snap-1257424822184505371-1-eab8490b-8d16-4eb1-ba9e-0dede788ff08.avro
   mainfest-list[{
       "manifest_path": "/home/hadoop/warehouse/db2/part_table2/metadata/eab8490b-8d16-4eb1-ba9e-0dede788ff08-m0.avro",  //2
        "added_snapshot_id": {
        "long": 1257424822184505300 //snap_id
    },
   }]
   
   manifestfile  = eab8490b-8d16-4eb1-ba9e-0dede788ff08-m0.avro
   {
      "status": 1,
      "snapshot_id": {
        "long": 1257424822184505300
      },
      "data_file": {
        "file_path": "/home/hadoop/warehouse/db2/part_table2/data/ts_hour=2021-01-26-01/00000-6-7c6cf3c0-8090-4f15-a4cc-3a3a562eed7b-00001.parquet",
        "file_format": "PARQUET",
   }

三 EG

1 CREATE

January 26, 2021

CREATE TABLE table1 (
    order_id BIGINT,
    customer_id BIGINT,
    order_amount DECIMAL(10, 2),
    order_ts TIMESTAMP
)
USING iceberg
PARTITIONED BY ( HOUR(order_ts) );

2 INSERT

January 26, 2021

INSERT INTO table1 VALUES (
    123,
    456,
    36.17,
    '2021-01-26 08:10:23'
);

当我们执行此 INSERT 语句时，会发生以下过程：
1 首先创建 Parquet 文件形式的数据 –table1/data/order_ts_hour=2021-01-26-08/00000-5-cae2d.parquet
2 然后，创建指向该数据文件的清单文件 –table1/metadata/1234-m0.avro
3 然后，创建指向该清单文件的清单列表 –table1/metadata/snap-2938-1-1234.avro
4 然后，基于先前的当前元数据文件创建一个新的元数据文件，其中包含新的快照s1，并跟踪先前的快照s0，指向此清单列表（包括附加详细信息和统计信息）–table1/metadata/v2.metadata.json
5 然后，目录中当前元数据指针的值db1.table1被原子更新为指向这个新的元数据文件。

3 MERGE INTO /UPSERT

copy on write

January 27, 2021 at 10:21:46.

MERGE INTO table1
USING ( SELECT * FROM table1_stage ) s
    ON table1.order_id = s.order_id
WHEN MATCHED THEN
    UPDATE table1.order_amount = s.order_amount
WHEN NOT MATCHED THEN
    INSERT *

1 按照前面详述的读取路径来确定table1和中table1_stage具有相同的所有记录order_id。
2 COPY-ON-WRITE order_id=123包含来自的记录的文件table1被读入查询引擎的内存（00000-5-cae2d.parquet），order_id=123然后此内存副本中的记录的字段将order_amount更新以反映order_amount中匹配记录的新内容table1_stage。然后，原始文件的修改副本被写入新的 Parquet 文件 –table1/data/order_ts_hour=2021-01-26-08/00000-1-aef71.parquet即使文件中有其他记录不符合order_id更新条件，整个文件仍会被复制，并且匹配的记录会在复制时更新，新文件会被写出 — 这种策略称为“写入时复制”。Iceberg 即将推出一种新的数据更改策略，称为“读取时合并”，其幕后行为会有所不同，但仍会为您提供相同的更新和删除功能。
3 table1_stage中的任何记录都不匹配的记录table1将以新的 Parquet 文件的形式写入，因为它与匹配的记录属于不同的分区 –table1/data/order_ts_hour=2021-01-27-10/00000-3-0fa3a.parquet
4 然后，创建一个指向这两个数据文件的新清单文件 –table1/metadata/2345-m0.avro
在这种情况下，快照中唯一数据文件中的唯一记录s1发生了更改，因此没有重用清单文件或数据文件。通常情况并非如此，清单文件和数据文件会跨快照重用。
5 然后，创建一个指向该清单文件的新清单列表 –table1/metadata/snap-9fa1-3-2345.avro
6 然后，基于先前的当前元数据文件创建一个新的元数据文件，其中包含新的快照s2，并跟踪先前的快照s0并s1指向此清单列表（包括其他详细信息和统计信息）–table1/metadata/v3.metadata.json
然后，目录中当前元数据指针的值db1.table1被原子更新为指向这个新的元数据文件

4 SELECT ALL

SELECT *
FROM db1.table1

It then opens this manifest list, retrieving the location of the only manifest file

5 Hidden Partitioning

Let’s say a user wants to see all records for a single day, say January 26, 2021, so they issue this query:

SELECT *
FROM table1
WHERE order_ts = DATE '2021-01-26'

it opens this manifest file, looking at each data file’s entry to compare the partition value the data file belongs to with the one requested by the user’s query. The value in this file corresponds to the number of hours since the Unix epoch, which the engine then uses to determine that only the events in one of the data files occurred on January 26, 2021 (or in other words, between January 26, 2021 at 00:00:00 and January 26, 2021 at 23:59:59).

6 Time Travel

SELECT *
FROM table1 AS OF '2021-01-27 00:00:00'
-- (timestamp is from before UPSERT operation)

It then opens this metadata file and looks at the entries in the snapshots array (which contains the millisecond Unix epoch time the snapshot was created, and therefore became the most current snapshot), determines which snapshot was active as of the requested point in time (January 27, 2021 at midnight), and retrieves the entry for the manifest list location for that snapshot, which is s1

7 Compaction

对于数据处理，是一个权衡的过程；

写入：虽然数据测希望低延迟，最终形成了较多的小文件，这样并不推荐；

读区：虽然希望高吞吐，即文件大，但这也会导致数据变更得成本变高；

对读的影响：可以高吞吐得获取99%的数据，但是仍然后低延迟、低吞吐的去读区最近的1%的数据；

对文件的影响：压缩前后的文件格式也可以不一样，比如流式的写入，最终压缩后为Parquet文件；

Iceberg 不是引擎，以上过程实际过程都是集成Iceberg的其他工具或者引擎来完成；

部分翻译：https://www.dremio.com/resources/guides/apache-iceberg-an-architectural-look-under-the-covers/

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

snowflake-cloud-data-platform

azure-data-lake

delta-lake

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

snowflake-cloud-data-platform

azure-data-lake

delta-lake

登录后参与评论

0 条评论

热度

Iceberg

Iceberg

一 Iceberg

1 hive

2 Why Iceberg

3 What is Iceberg

二 Metadata

metadata/metatdata.json 元数据文件

metadata/mainfestlist.avro 清单列表

Metadata/mainfest(avro/orc/parquet) 清单文件

三 EG

1 CREATE

2 INSERT

3 MERGE INTO /UPSERT

4 SELECT ALL

5 Hidden Partitioning

6 Time Travel

7 Compaction

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐