The content of this page has been automatically translated by AI. If you encounter any problems while reading, you can view the corresponding content in Chinese.
Through the previous steps, we have synchronized all the raw data to the Hive tables in the EMR cluster.
However, these data are all in their raw structures and cannot be directly used for business purposes.
Combining the content from the Data table structure design step, we have analyzed the business requirements and divided the data warehouse hierarchy.
Next, we will complete the generation and data processing of the detail and summary tables through data development.
Offline development task design
Offline development task development
Detail Table Development
Completing the detail table development mainly includes the following 4 steps:
Create a shopping mall detail table
1. In the Orchestration Space module, under any folder in the shopping mall detail table development workflow directory, click the Hive SQL icon under Data Calculation, and create a new Hive SQL task. Name the task 'Create a shopping mall detail table', task type is Hive SQL, then click Confirm.
2. After confirmation, the Hive SQL development script page will pop up, follow these steps to complete the Hive SQL development task.
Data Source: Select hive_emr-XXX.
Scheduling Resource Group: Select the resource group we purchased, it is recommended to be in the same domain.
Development Script: Write the table creation SQL script for the shopping mall detail table on the script page.
Scheduling Settings: In the right taskbar, click Scheduling Settings, select the scheduling cycle as One-time, and default the execution event.
One-time: This Hive SQL executes only once according to the execution time.
Cycle: This Hive SQL executes periodically according to the execution time.
Click Save button: Save this Hive SQL task.
Click Run button: Execute the task once to verify the script task.
Click Submit button: Officially submit this task to the scheduling resource server. The task will be executed according to the scheduling cycle at the specified time.
product_name STRING COMMENT'Product Name',
category_id INTCOMMENT'Product Category ID',
category_name STRING COMMENT'Product Category Name',
quantity INTCOMMENT'Quantity, positive integer',
unit_price DECIMAL(10,2)COMMENT'Unit Price, with two decimal places',
amount DECIMAL(10,2)COMMENT'Subtotal Amount, i.e., quantity times unit price',
order_time TIMESTAMPCOMMENT'Order Time, accurate to the minute',
shipping_city_id INTCOMMENT'Shipping City ID, foreign key',
shipping_city_name STRING COMMENT'City Name',
shipping_address STRING COMMENT'Shipping Address, including province, city, district, and detailed address')COMMENT'Product Sales Detail Table, records the sales details of each order'
PARTITIONED BY(pt_date STRING)row format delimited fieldsterminatedby'\t'
STORED AS PARQUET;
By completing steps 7 - 15, you have created a Hive table in the Hive data source within the EMR cluster.
Write data to the Detail Table
Next, we will start writing data into the detail table:
Please repeat steps 7 - 15, paying attention to the following:
Note:
The following steps correspond to the step numbers indicated in the illustrations
In the same workflow, create a new HiveQL node
Step 8: Name it: Insert data into the Detail Table;
Steps 9 and 10: These two steps are often overlooked;
Step 11: The HiveQL statement is as follows;
Step 12:
Change the schedule cycle to: Cycle
Set the execution time to: 01:00
Description: This task needs to run once daily
Step 14: No matter how familiar you are with the operation, remember to test run before and after submission to ensure the task runs correctly.
HiveQL statement to insert data into the Detail Table:
--Insert data into the Detail Table HiveQL statementSET hive.exec.dynamic.partition.mode=nonstrict;INSERTINTOTABLE emall.dwd_trade_order_ordercreate_productsales PARTITION(pt_date)SELECT
o.order_id,
o.product_id,
p.product_name,
p.category_id,
ca.category_name,
o.quantity,
o.unit_price,
o.amount,
o.order_time,
o.shipping_city_id,
ci.city_name,
o.shipping_address,
o.pt_date
FROM
emall.ods_order_order o
JOIN
emall.ods_product_product p ON o.product_id = p.product_id
JOIN
emall.ods_product_category ca ON p.category_id = ca.category_id
JOIN
emall.ods_order_city ci ON o.shipping_city_id = ci.city_id
WHERE o.pt_date ='${yyyy-MM-dd-1d}';
At this point, we have completed the development task for the Detail Table. After the data is synchronized from the original table to the Hive cluster at midnight, the system will automatically link the four tables and compile the data into the Detail Table.
Note
There are some redundant fields in the Detail Table at this point, which is intended to minimize joins when processing the Summary Table to improve computational efficiency.
Summary Table Development
Create the Mall Summary Table
Next, we will start developing the Summary Table
First, please repeat steps 5 - 6 to create a new workflow: Mall Summary Table Development.
Next, repeat steps 7 - 15 to create the summary table. Pay attention to the following points:
Note:
The following steps correspond to the step numbers indicated in the illustrations
Create a new HiveQL node in the same workflow;
Step 8: Name it: Create Mall Summary Table;
Steps 9 and 10: These two steps are often overlooked;
Step 11: The table creation SQL statement is as follows;
Step 12:
Set the scheduling cycle to:Once
Set the execution time to: Default
Description: This task only needs to run once
Step 14: No matter how familiar you are with the operation, remember to test run before and after submission to ensure the task runs correctly.
city_name STRING COMMENT'City name',
category_id INTCOMMENT'Product category ID',
category_name STRING COMMENT'Product category name',
quantity INTCOMMENT'Total product sales, positive integer',
amount DECIMAL(10,2)COMMENT'Total product sales in monetary terms, two decimal places kept')COMMENT'Daily summary table of product sales situations'
PARTITIONED BY(pt_date STRING)row format delimited fieldsterminatedby'\t'
STORED AS PARQUET;
Write data into the summary table
Next, repeat steps 7 - 15 to insert data into the summary table. Pay attention to the following points:
Note:
The following steps correspond to the step numbers indicated in the illustrations
Create a new HiveQL node in the same workflow;
Step 8: Name it: Insert Data into the Summary Table;
Steps 9 and 10: These two steps are often overlooked;
Step 11: The HiveQL statement is as follows;
Step 12:
Change the schedule cycle to: Cycle
Set the execution time to: 01:00
Description: This task needs to run once daily
Step 14: No matter how familiar you are with the operation, remember to test run before and after submission to ensure the task runs correctly.
Insert data into the Summary Table HiveQL statement:
--Insert data into the Summary Table HiveQL statementSET hive.exec.dynamic.partition.mode=nonstrict;INSERTINTOTABLE emall.dws_trade_order_productsales_1d PARTITION(pt_date)SELECT
p.pt_date AS order_date,
p.shipping_city_id,
p.shipping_city_name,
p.category_id,
p.category_name,SUM(p.quantity)AS quantity,SUM(p.amount)AS amount,
p.pt_date
FROM
emall.dwd_trade_order_ordercreate_productsales p
WHERE p.pt_date ='${yyyy-MM-dd-1d}'GROUPBY
p.pt_date,
p.shipping_city_id,
We have completed the offline development tasks for the detail table and summary table. Every early morning, WeData will automatically perform the calculation tasks for the detail and summary tables.
Establish dependencies and submit
1. Double-click the mall data development workflow. A workflow canvas will pop up, displaying all the tasks under the workflow. Establish the dependencies among the workflows in sequence. Once completed, click the submit button.