Business Research
Source Data Storage Location
Through researching the existing technical architecture of the mall system, it was found that the data is stored in a MySQL database.
Here, it is assumed that the TencentDB for MySQL database is used.
Target Business Scenario Analysis
By analyzing the target business scenarios: sales performance in various cities and categories, we need to obtain the following tables:
Order Table: (At this point, ignore the design of subtables such as order details, assuming the order table includes product ID, product quantity, product price, shipping address, order time, etc.).
Product Table: (At this point, ignore the design of subtables such as SKU, assuming the product table includes product ID, product category, etc.).
City Table: (Assuming the geographic location coding table only goes down to the city level, the city table includes: city code, city name).
Product Category Table: (Assuming there is only one level of category, the category table includes: category code, category name).
Actual Structure
The following are the actual structures of the researched Order Table and Product Table:
1. Order Table (orders)
Field name | Field Type | Field Length | Field Description | Sample Code |
order_id | INT | 10 | Order ID, primary key, auto-increment | 10001 |
product_id | INT | 10 | Product ID, foreign key | 1001 |
quantity | INT | 5 | Quantity of Goods, positive integer | 2 |
unit_price | DECIMAL(10,2) | - | Unit Price of Goods, round to two decimal places | 99.99 |
amount | DECIMAL(10,2) | - | Subtotal Amount of Goods, which is quantity multiplied by the unit price | 199.98 |
order_time | DATETIME | - | Order Time, accurate to the minute | "2024-04-04 10:30:00" |
shipping_city_id | INT | 10 | Shipping Address City ID, foreign key | 1101 |
shipping_address | TEXT | - | Shipping Address, includes province, city, district, and detailed address | "Chaoyang District, Beijing, XXX Community" |
2. Product Table (products)
Field Code | Field Type | Field Length | Field Description | Sample Code |
product_id | INT | 10 | Product ID,primary key,Auto-increment | 1001 |
category_id | INT | 10 | Category ID,Foreign key | 101 |
product_name | VARCHAR(100) | - | Product Name | "Smartphone" |
3. City Table (cities)
Field Code | Field Type | Field Length | Field Description | Sample Code |
city_id | INT | 10 | City Code,primary key,Auto-increment | 1101 |
city_name | VARCHAR(50) | - | City Name | "Beijing City" |
Product Category Table (categories)
Field Code | Field Type | Field Length | Field Description | Sample Code |
category_id | INT | 10 | Category ID,primary key,Auto-increment | 1 |
category_name | VARCHAR(50) | - | Category Name | "Electronic Products" |
Architecture Design
Based on business scenario needs, the final business output involves Data Warehouse Layering and Data Table Structure.
Model Specification
Model specifications help teams unify data warehouse design rules, streamline the data development process, better accumulate data assets, and lay a foundation for building data services and data marts.
During the design of data warehouse model specifications, multiple categories are included, such as data domain and principal domain.
In this scenario, the core objective is DataInLong and the data development process, and therefore detailed teaching of data model specifications is not covered in this tutorial.
Below are examples of model specifications related to this scenario:
Category | Chinese Description | English Name |
Business Category | Sales | trade |
Data Domain | Order Product | order product |
Business process | Order Creation | ordercreate |
Subject Domain | Product | product |
Dimension | Date Region Category | date city category |
Metrics | Sales Volume Sales Quantity | amount quantity |
Data Warehouse Layering
1. Data Ingestion Layer ODS
Import raw data that hasn't undergone any processing into the data warehouse. The table structure in the ODS layer is consistent with the table structure in the original data system.
Therefore, we need to create 4 Hive tables based on the raw data (table creation operations are not needed here; this will be covered in subsequent lessons), with table structures identical to the MySQL source data tables.
The naming of the four tables is as follows:
Order Table: ods_order_order
Product Table: ods_product_product
Category Table: ods_product_category
City Table: ods_order_city
Note
The suggested naming convention is: ods_{data domain}_{self-definition content}.
2. Common Dimension Layer DIM
This section focuses on data synchronization logic, temporarily ignoring the design of the dimension layer.
Note
It is recommended to redundantly store the field attributes from the dimension tables in the detailed data tables.
3. Detailed Data Layer DWD
Build the most granular detailed data table. It is advisable to redundantly store some fields in this table to reduce the association between the detailed data table and the dimension table.
Build a detailed table: Table creation is not required here; it will be covered in subsequent lessons.
Product Sales Detail Report: dwd_trade_order_ordercreate_productsales.
Note
The suggested naming convention is: dwd_{business category}_{data domain}_{business process}_{self-definition content}.
Field Code | Field Type | Field Length | Field Description | Sample Code |
order_id | INT | 10 | Order ID, primary key | 10001 |
product_id | INT | 10 | Product ID | 1001 |
category_id | INT | 10 | Category ID | 101 |
category_name | STRING | 50 | Category Name | "Electronic Products" |
product_name | STRING | 50 | Product Name | "Smartphone" |
quantity | INT | 5 | Quantity of Goods, positive integer | 2 |
unit_price | DECIMAL | 10,2 | Unit Price of Goods, round to two decimal places | 99.99 |
amount | DECIMAL | 10,2 | Subtotal Amount of Goods, which is quantity multiplied by the unit price | 199.98 |
order_time | DATETIME | - | Order Time, accurate to the minute | "2024-04-04 10:30:00" |
shipping_city_id | INT | 10 | Shipping Address City ID, foreign key | 1101 |
shipping_city_name | STRING | 50 | City Name | "Beijing City" |
shipping_address | TEXT | - | Shipping Address, includes province, city, district, and detailed address | "Chaoyang District, Beijing, XXX Community" |
pt_date | STRING | 50 | Partition Field | "2024-04-01" |
Additional Notes: Hive Table Partitioning
Partitioned Table Overview
Partitioning is an essential database optimization technique that improves performance, simplifies management, reduces costs, and enhances data availability and security by dividing datasets into smaller, logically independent parts.
Partitioning is especially important in big data scenarios.
Benefits of Hive Table Partitioning
The benefits of partitioning storage in Hive can be reflected in several aspects:
Advantage | Description |
Improve Query Performance | By storing data in different partitions, queries can target specific partitions, avoiding the need to scan the entire table's data and significantly reducing query time. |
Optimize Data Management | Partitioning is a logical way of organizing data, making it easier to maintain, clean, and perform bulk operations such as backup and recovery. |
Horizontal Scaling | Partitions can horizontally distribute data storage pressure, physically distributing data to different storage units, enhancing system scalability. |
Reduce Data Skew | In cases of uneven data distribution, partitioning can prevent data skew by avoiding situations where some partitions have too much data while others have too little. |
Data Isolation | Partitions can be used for data isolation. For example, data can be divided into different partitions based on time, facilitating version control and historical data management. |
Reduce Data Loading Time | During data loading or ETL processes, data can be loaded into specific partitions more quickly without operating on the entire table. |
Save Storage Space | Partitions can help delete or archive old partition data, thereby saving storage space. |
Parallel Processing | Partition tables can better leverage Hadoop's MapReduce parallel processing capabilities, as queries can be executed in parallel on different partitions. |
Data Security and Access Control | Partitions can be used to implement finer-grained data access control, such as setting stricter access permissions for certain partitions. |
Maintain Data Integrity | Partitions can ensure data integrity because each partition can have its own data integrity constraints. |
Support Hot and Cold Data Layering | By partitioning, data can be classified into "hot data" and "cold data" based on its frequency of use, and different storage strategies can be applied. |
Simplify Data ETL Process | During data extraction, transformation, and loading, partitions can simplify data organization and processing workflows. |
Improve Data Availability | Partitions can improve data availability, as the unavailability of one partition does not affect access to other partitions. |
Therefore, plan partition fields as early as possible when creating Hive tables.
4. Summary Data Layer DWS
Construct summary indicator data tables with business-use granularity.
Build a summary table: Table creation is not required here; it will be covered in subsequent lessons.
Daily Product Sales Summary Table: dws_trade_order_productsales_1d.
The suggested naming convention is: dws_{business category}_{data domain}_{self-definition content}_{time period}.
Field Code | Field Type | Field Length | Field Description | Sample Code |
order_date | DATE | - | Date | 2021-04-01 |
city_id | INT | 10 | City ID | 1 |
category_id | INT | 10 | Category ID | 1 |
city_name | STRING | 50 | City Name | " Beijing " |
category_name | STRING | 50 | Category Name | "Electronic Products" |
quantity | INT | 10 | Total Product Sales | 100 |
amount | DECIMAL | (10, 2) | Total Product Sales Volume | 9999.99 |
pt_date | STRING | 50 | Partition Field | "2021-04-01" |
5. Application Data Layer (ADS)
Build an indicator table for final business analysis requirements. Since this scenario is relatively simple, it will be temporarily ignored here.