Help & Documentation>Data Lake Compute>Getting Started>Quick Start with Partition Table

Quick Start with Partition Table

Last updated: 2024-01-10 15:53:16

Partitioned Table in Data Lake Compute

With the partition catalog feature, you can store data with different characteristics in different catalogs. In this way, when exploring data, you can filter data by partition through the where condition. This greatly reduces the scanned data volume and improves the query efficiency.
Note
Partitions within the same table should utilize the same data type and format.
Data Lake Compute's native tables implement implicit partitioning, allowing you to disregard the partition directory structure.

Creating a Partitioned Table

Specify the partition field through the PARTITIONED BY parameter in the table creation statement. Example: Creating the test_part partition table
CREATE EXTERNAL TABLE IF NOT EXISTS DataLakeCatalog.test_a_db.test_part (
_c0 int,
_c1 int,
_c2 string,
dt string
) USING PARQUET PARTITIONED BY (dt) LOCATION 'cosn://testbucket/data/';

Add Partition

Add partitions using the alter table add partition command.

If your data partition catalog uses the Hive partitioning rule (partition column name=partition column value), the rule can be used to add partitions. The catalog is organized as follows:

ALTER TABLE DataLakeCatalog.test_a_db.test_part add PARTITION (dt = '202206')
ALTER TABLE DataLakeCatalog.test_a_db.test_part add PARTITION (dt = '202207')
ALTER TABLE DataLakeCatalog.test_a_db.test_part add PARTITION (dt = '202208')
ALTER TABLE DataLakeCatalog.test_a_db.test_part add PARTITION (dt = '202209')
ALTER TABLE DataLakeCatalog.test_a_db.test_part add PARTITION (dt = '202210')

Add partitions by specifying the location with the alter table command.

If your data adopts a general COS catalog (not in the "partition column name=partition column value" format), you can specify a catalog when adding a partition. Sample SQL:
ALTER TABLE DataLakeCatalog.test_a_db.test_part add PARTITION (dt = '202211') LOCATION 'cosn://testbucket/data2/202211'
ALTER TABLE DataLakeCatalog.test_a_db.test_part add PARTITION (dt = '202212') LOCATION 'cosn://testbucket/data2/202212'

Utilizing MSCK REPAIR for Automatic Partition Addition

By using the MSCK REPAIR TABLE statement, the system scans the data directory specified during table creation. If new partition directories exist, the system will automatically add these partitions to the metadata information of the data table. SQL Reference:
MSCK REPAIR TABLE DataLakeCatalog.test_a_db.test_part
We recommend adding partitions primarily through the 'alter table' method. If you choose to use 'msck repair' for automatic partition addition, the following constraints apply:
The MSCK REPAIR TABLE command only adds partitions to the table metadata, it does not delete partitions.
When dealing with large volumes of data, it is not recommended to use the MSCK REPAIR TABLE method, as it scans the entire data volume and may lead to timeouts.
If the partition directory does not follow Hive's partitioning rule: partition column name = partition column value, the MSCK REPAIR TABLE method cannot be used.