The content of this page has been automatically translated by AI. If you encounter any problems while reading, you can view the corresponding content in Chinese.

Data Table Structure Design

Last updated: 2025-04-18 15:46:32

Business Research

Source Data Storage Location

Through researching the existing technical architecture of the mall system, it was found that the data is stored in a MySQL database.
Here, it is assumed that the TencentDB for MySQL database is used.

Target Business Scenario Analysis

By analyzing the target business scenarios: sales performance in various cities and categories, we need to obtain the following tables:
Order Table: (At this point, ignore the design of subtables such as order details, assuming the order table includes product ID, product quantity, product price, shipping address, order time, etc.).
Product Table: (At this point, ignore the design of subtables such as SKU, assuming the product table includes product ID, product category, etc.).
City Table: (Assuming the geographic location coding table only goes down to the city level, the city table includes: city code, city name).
Product Category Table: (Assuming there is only one level of category, the category table includes: category code, category name).

Actual Structure

The following are the actual structures of the researched Order Table and Product Table:

1. Order Table (orders)

Field name
Field Type
Field Length
Field Description
Sample Code
order_id
INT
10
Order ID, primary key, auto-increment
10001
product_id
INT
10
Product ID, foreign key
1001
quantity
INT
5
Quantity of Goods, positive integer
2
unit_price
DECIMAL(10,2)
-
Unit Price of Goods, round to two decimal places
99.99
amount
DECIMAL(10,2)
-
Subtotal Amount of Goods, which is quantity multiplied by the unit price
199.98
order_time
DATETIME
-
Order Time, accurate to the minute
"2024-04-04 10:30:00"
shipping_city_id
INT
10
Shipping Address City ID, foreign key
1101
shipping_address
TEXT
-
Shipping Address, includes province, city, district, and detailed address
"Chaoyang District, Beijing, XXX Community"

2. Product Table (products)

Field Code
Field Type
Field Length
Field Description
Sample Code
product_id
INT
10
Product ID,primary key,Auto-increment
1001
category_id
INT
10
Category ID,Foreign key
101
product_name
VARCHAR(100)
-
Product Name
"Smartphone"

3. City Table (cities)

Field Code
Field Type
Field Length
Field Description
Sample Code
city_id
INT
10
City Code,primary key,Auto-increment
1101
city_name
VARCHAR(50)
-
City Name
"Beijing City"

Product Category Table (categories)

Field Code
Field Type
Field Length
Field Description
Sample Code
category_id
INT
10
Category ID,primary key,Auto-increment
1
category_name
VARCHAR(50)
-
Category Name
"Electronic Products"

Architecture Design

Based on business scenario needs, the final business output involves Data Warehouse Layering and Data Table Structure.

Model Specification

Model specifications help teams unify data warehouse design rules, streamline the data development process, better accumulate data assets, and lay a foundation for building data services and data marts.
During the design of data warehouse model specifications, multiple categories are included, such as data domain and principal domain.
In this scenario, the core objective is DataInLong and the data development process, and therefore detailed teaching of data model specifications is not covered in this tutorial.
Below are examples of model specifications related to this scenario:
Category
Chinese Description
English Name
Business Category
Sales
trade
Data Domain
Order
Product
order
product
Business process
Order Creation
ordercreate
Subject Domain
Product
product
Dimension
Date
Region
Category
date
city
category
Metrics
Sales Volume
Sales Quantity
amount
quantity

Data Warehouse Layering

1. Data Ingestion Layer ODS

Import raw data that hasn't undergone any processing into the data warehouse. The table structure in the ODS layer is consistent with the table structure in the original data system.
Therefore, we need to create 4 Hive tables based on the raw data (table creation operations are not needed here; this will be covered in subsequent lessons), with table structures identical to the MySQL source data tables.
The naming of the four tables is as follows:
Order Table: ods_order_order
Product Table: ods_product_product
Category Table: ods_product_category
City Table: ods_order_city
Note
The suggested naming convention is: ods_{data domain}_{self-definition content}.

2. Common Dimension Layer DIM

This section focuses on data synchronization logic, temporarily ignoring the design of the dimension layer.
Note
It is recommended to redundantly store the field attributes from the dimension tables in the detailed data tables.

3. Detailed Data Layer DWD

Build the most granular detailed data table. It is advisable to redundantly store some fields in this table to reduce the association between the detailed data table and the dimension table.
Build a detailed table: Table creation is not required here; it will be covered in subsequent lessons.
Product Sales Detail Report: dwd_trade_order_ordercreate_productsales.
Note
The suggested naming convention is: dwd_{business category}_{data domain}_{business process}_{self-definition content}.
Field Code
Field Type
Field Length
Field Description
Sample Code
order_id
INT
10
Order ID, primary key
10001
product_id
INT
10
Product ID
1001
category_id
INT
10
Category ID
101
category_name
STRING
50
Category Name
"Electronic Products"
product_name
STRING
50
Product Name
"Smartphone"
quantity
INT
5
Quantity of Goods, positive integer
2
unit_price
DECIMAL
10,2
Unit Price of Goods, round to two decimal places
99.99
amount
DECIMAL
10,2

Subtotal Amount of Goods, which is quantity multiplied by the unit price
199.98
order_time
DATETIME
-
Order Time, accurate to the minute
"2024-04-04 10:30:00"
shipping_city_id
INT
10
Shipping Address City ID, foreign key
1101
shipping_city_name
STRING
50
City Name
"Beijing City"
shipping_address
TEXT
-
Shipping Address, includes province, city, district, and detailed address
"Chaoyang District, Beijing, XXX Community"
pt_date
STRING
50
Partition Field
"2024-04-01"

Additional Notes: Hive Table Partitioning

Partitioned Table Overview
Partitioning is an essential database optimization technique that improves performance, simplifies management, reduces costs, and enhances data availability and security by dividing datasets into smaller, logically independent parts.
Partitioning is especially important in big data scenarios.
Benefits of Hive Table Partitioning
The benefits of partitioning storage in Hive can be reflected in several aspects:
Advantage
Description
Improve Query Performance
By storing data in different partitions, queries can target specific partitions, avoiding the need to scan the entire table's data and significantly reducing query time.
Optimize Data Management
Partitioning is a logical way of organizing data, making it easier to maintain, clean, and perform bulk operations such as backup and recovery.
Horizontal Scaling
Partitions can horizontally distribute data storage pressure, physically distributing data to different storage units, enhancing system scalability.
Reduce Data Skew
In cases of uneven data distribution, partitioning can prevent data skew by avoiding situations where some partitions have too much data while others have too little.
Data Isolation
Partitions can be used for data isolation. For example, data can be divided into different partitions based on time, facilitating version control and historical data management.
Reduce Data Loading Time
During data loading or ETL processes, data can be loaded into specific partitions more quickly without operating on the entire table.
Save Storage Space
Partitions can help delete or archive old partition data, thereby saving storage space.
Parallel Processing
Partition tables can better leverage Hadoop's MapReduce parallel processing capabilities, as queries can be executed in parallel on different partitions.
Data Security and Access Control
Partitions can be used to implement finer-grained data access control, such as setting stricter access permissions for certain partitions.
Maintain Data Integrity
Partitions can ensure data integrity because each partition can have its own data integrity constraints.
Support Hot and Cold Data Layering
By partitioning, data can be classified into "hot data" and "cold data" based on its frequency of use, and different storage strategies can be applied.
Simplify Data ETL Process
During data extraction, transformation, and loading, partitions can simplify data organization and processing workflows.
Improve Data Availability
Partitions can improve data availability, as the unavailability of one partition does not affect access to other partitions.
Therefore, plan partition fields as early as possible when creating Hive tables.

4. Summary Data Layer DWS

Construct summary indicator data tables with business-use granularity.
Build a summary table: Table creation is not required here; it will be covered in subsequent lessons.
Daily Product Sales Summary Table: dws_trade_order_productsales_1d.
The suggested naming convention is: dws_{business category}_{data domain}_{self-definition content}_{time period}.
Field Code
Field Type
Field Length
Field Description
Sample Code
order_date
DATE
-
Date
2021-04-01
city_id
INT
10
City ID
1
category_id
INT
10
Category ID
1
city_name
STRING
50
City Name
" Beijing "
category_name
STRING
50
Category Name
"Electronic Products"
quantity
INT
10
Total Product Sales
100
amount
DECIMAL
(10, 2)
Total Product Sales Volume
9999.99
pt_date
STRING
50
Partition Field
"2021-04-01"

5. Application Data Layer (ADS)

Build an indicator table for final business analysis requirements. Since this scenario is relatively simple, it will be temporarily ignored here.