The content of this page has been automatically translated by AI. If you encounter any problems while reading, you can view the corresponding content in Chinese.

Offline development

Last updated: 2025-04-18 15:49:07

Through the previous steps, we have synchronized all the raw data to the Hive tables in the EMR cluster.
However, these data are all in their raw structures and cannot be directly used for business purposes.
Combining the content from the Data table structure design step, we have analyzed the business requirements and divided the data warehouse hierarchy.
Next, we will complete the generation and data processing of the detail and summary tables through data development.

Offline development task design





Offline development task development

Detail Table Development

Completing the detail table development mainly includes the following 4 steps:

Create a shopping mall detail table

1. In the Orchestration Space module, under any folder in the shopping mall detail table development workflow directory, click the Hive SQL icon under Data Calculation, and create a new Hive SQL task. Name the task 'Create a shopping mall detail table', task type is Hive SQL, then click Confirm.

2. After confirmation, the Hive SQL development script page will pop up, follow these steps to complete the Hive SQL development task.
Data Source: Select hive_emr-XXX.
Scheduling Resource Group: Select the resource group we purchased, it is recommended to be in the same domain.
Development Script: Write the table creation SQL script for the shopping mall detail table on the script page.
Scheduling Settings: In the right taskbar, click Scheduling Settings, select the scheduling cycle as One-time, and default the execution event.
One-time: This Hive SQL executes only once according to the execution time.
Cycle: This Hive SQL executes periodically according to the execution time.
Click Save button: Save this Hive SQL task.
Click Run button: Execute the task once to verify the script task.
Click Submit button: Officially submit this task to the scheduling resource server. The task will be executed according to the scheduling cycle at the specified time.

3. Create Shopping Mall Detail Table HiveQL Statement:
--Create Detail Table HiveQL Statement CREATE TABLE IF NOT EXISTS emall.dwd_trade_order_ordercreate_productsales ( order_id INT COMMENT 'Order ID, primary key', product_id INT COMMENT 'Product ID',
product_name STRING COMMENT 'Product Name', category_id INT COMMENT 'Product Category ID', category_name STRING COMMENT 'Product Category Name', quantity INT COMMENT 'Quantity, positive integer', unit_price DECIMAL(10, 2) COMMENT 'Unit Price, with two decimal places', amount DECIMAL(10, 2) COMMENT 'Subtotal Amount, i.e., quantity times unit price', order_time TIMESTAMP COMMENT 'Order Time, accurate to the minute', shipping_city_id INT COMMENT 'Shipping City ID, foreign key', shipping_city_name STRING COMMENT 'City Name', shipping_address STRING COMMENT 'Shipping Address, including province, city, district, and detailed address' ) COMMENT 'Product Sales Detail Table, records the sales details of each order' PARTITIONED BY (pt_date STRING) row format delimited fields terminated by '\t' STORED AS PARQUET;
By completing steps 7 - 15, you have created a Hive table in the Hive data source within the EMR cluster.

Write data to the Detail Table

Next, we will start writing data into the detail table:
Please repeat steps 7 - 15, paying attention to the following:
Note:
The following steps correspond to the step numbers indicated in the illustrations
In the same workflow, create a new HiveQL node
Step 8: Name it: Insert data into the Detail Table;
Steps 9 and 10: These two steps are often overlooked;
Step 11: The HiveQL statement is as follows;
Step 12:
Change the schedule cycle to: Cycle
Set the execution time to: 01:00
Description: This task needs to run once daily
Step 14: No matter how familiar you are with the operation, remember to test run before and after submission to ensure the task runs correctly.
HiveQL statement to insert data into the Detail Table:
--Insert data into the Detail Table HiveQL statement SET hive.exec.dynamic.partition.mode=nonstrict; INSERT INTO TABLE emall.dwd_trade_order_ordercreate_productsales PARTITION (pt_date) SELECT o.order_id, o.product_id, p.product_name, p.category_id, ca.category_name, o.quantity, o.unit_price, o.amount, o.order_time, o.shipping_city_id, ci.city_name, o.shipping_address, o.pt_date FROM emall.ods_order_order o JOIN emall.ods_product_product p ON o.product_id = p.product_id JOIN emall.ods_product_category ca ON p.category_id = ca.category_id JOIN emall.ods_order_city ci ON o.shipping_city_id = ci.city_id WHERE o.pt_date = '${yyyy-MM-dd-1d}';
At this point, we have completed the development task for the Detail Table. After the data is synchronized from the original table to the Hive cluster at midnight, the system will automatically link the four tables and compile the data into the Detail Table.
Note
There are some redundant fields in the Detail Table at this point, which is intended to minimize joins when processing the Summary Table to improve computational efficiency.

Summary Table Development

Create the Mall Summary Table

Next, we will start developing the Summary Table
First, please repeat steps 5 - 6 to create a new workflow: Mall Summary Table Development.
Next, repeat steps 7 - 15 to create the summary table. Pay attention to the following points:
Note:
The following steps correspond to the step numbers indicated in the illustrations
Create a new HiveQL node in the same workflow;
Step 8: Name it: Create Mall Summary Table;
Steps 9 and 10: These two steps are often overlooked;
Step 11: The table creation SQL statement is as follows;
Step 12:
Set the scheduling cycle to:Once
Set the execution time to: Default
Description: This task only needs to run once
Step 14: No matter how familiar you are with the operation, remember to test run before and after submission to ensure the task runs correctly.
Create the emall summary table HiveQL statement:
--Create summary table HiveQL statement CREATE TABLE IF NOT EXISTS emall.dws_trade_order_productsales_1d ( order_date DATE COMMENT 'Date of statistics, primary key', city_id INT COMMENT 'City ID',
city_name STRING COMMENT 'City name', category_id INT COMMENT 'Product category ID', category_name STRING COMMENT 'Product category name', quantity INT COMMENT 'Total product sales, positive integer', amount DECIMAL(10, 2) COMMENT 'Total product sales in monetary terms, two decimal places kept' ) COMMENT 'Daily summary table of product sales situations' PARTITIONED BY (pt_date STRING) row format delimited fields terminated by '\t' STORED AS PARQUET;

Write data into the summary table

Next, repeat steps 7 - 15 to insert data into the summary table. Pay attention to the following points:
Note:
The following steps correspond to the step numbers indicated in the illustrations
Create a new HiveQL node in the same workflow;
Step 8: Name it: Insert Data into the Summary Table;
Steps 9 and 10: These two steps are often overlooked;
Step 11: The HiveQL statement is as follows;
Step 12:
Change the schedule cycle to: Cycle
Set the execution time to: 01:00
Description: This task needs to run once daily
Step 14: No matter how familiar you are with the operation, remember to test run before and after submission to ensure the task runs correctly.
Insert data into the Summary Table HiveQL statement:
--Insert data into the Summary Table HiveQL statement SET hive.exec.dynamic.partition.mode=nonstrict; INSERT INTO TABLE emall.dws_trade_order_productsales_1d PARTITION (pt_date) SELECT p.pt_date AS order_date, p.shipping_city_id,
p.shipping_city_name, p.category_id, p.category_name, SUM(p.quantity) AS quantity, SUM(p.amount) AS amount, p.pt_date FROM emall.dwd_trade_order_ordercreate_productsales p WHERE p.pt_date = '${yyyy-MM-dd-1d}' GROUP BY p.pt_date, p.shipping_city_id,
p.shipping_city_name, p.category_id, p.category_name;
We have completed the offline development tasks for the detail table and summary table. Every early morning, WeData will automatically perform the calculation tasks for the detail and summary tables.

Establish dependencies and submit

1. Double-click the mall data development workflow. A workflow canvas will pop up, displaying all the tasks under the workflow. Establish the dependencies among the workflows in sequence. Once completed, click the submit button.
Establish dependencies in sequence:
Mall_Synchronize_Product_Category → Mall_Synchronize_Order_Data.
Mall_Synchronize_City_Information → Mall_Synchronize_Order_Data.
Mall_Synchronize_Product_Information → Mall_Synchronize_Order_Data.
Mall_Synchronize_Order_Data → Insert Data into the Detail Table.
Create Mall Detail Table → Insert Data into the Detail Table.
Insert Data into the Detail Table → Insert Data into the Summary Table.
Create Mall Summary Table → Insert Data into the Summary Table.
Finally, submit the workflow. The tasks will run in an orderly manner according to the set dependencies.


Offline Development Task Ops

You can view the running status of workflows or offline tasks in Task Ops.