前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >维度模型数据仓库(八) —— 维度子集

维度模型数据仓库(八) —— 维度子集

作者头像
用户1148526
发布2022-12-02 14:12:05
4100
发布2022-12-02 14:12:05
举报
文章被收录于专栏:Hadoop数据仓库

(五)进阶技术         3. 维度子集         有些需求不需要最细节的数据。例如更想要某个月而不是某天的记录。再比如相对于全部的销售数据,可能对某些特定状态的数据更感兴趣等。这些特定维度包含在从细节维度选择的行中,所以叫维度子集。维度子集比细节维度小,因此更易使用,查询也更快。         本篇中将准备两个特定维度,它们均取自现有的维度:月份维度(日期维度的子集),Pennsylvania州客户维度(客户维度的子集)。清单(五)-3-1里的脚本用于建立月份维度,并从日期维度初始装载月份维度。注意月份维度不包含promo_ind列,该列不适用月层次上,因为一个月中可能有多个促销期。促销标记适用于日层次。

代码语言:javascript
复制
USE dw;

CREATE TABLE month_dim (
    month_sk INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    month_name VARCHAR(9),
    month INT(2),
    quarter INT(1),
    year INT(4),
    effective_date DATE,
    expiry_date DATE
) ;

INSERT INTO month_dim
SELECT DISTINCT
  NULL
, month_name
, month
, quarter
, year
, effective_date
, expiry_date
FROM date_dim ;

COMMIT ;

清单(五)-3-1

        为了从日期维度同步导入月份维度,要把月份装载嵌入到日期维度的预装载脚本中。需要修改“准备数据仓库模拟环境”里生成日期维度数据的存储过程。清单(五)-3-2中显示了修改后的存储过程。无论何时用修改后的脚本增加日期记录时,如果这个日期所在的月份没在月份维度中,那么该月份会被装载到月份维度中。

代码语言:javascript
复制
USE dw;

DELIMITER // ;

DROP PROCEDURE IF EXISTS pre_populate_date //

CREATE PROCEDURE pre_populate_date (IN start_dt DATE, IN end_dt
       DATE)
BEGIN
       WHILE start_dt <= end_dt DO
             INSERT INTO date_dim(
               date_sk
             , date
             , month_name
             , month
             , quarter
             , year
             , effective_date
             , expiry_date
             )
             VALUES(
               NULL
             , start_dt
             , MONTHNAME(start_dt)
             , MONTH(start_dt)
             , QUARTER(start_dt)
             , YEAR (start_dt)
             , '0000-00-00'
             , '9999-12-31'
     )
             ;
             SET start_dt = ADDDATE(start_dt, 1);
      END WHILE;

      INSERT INTO month_dim
      SELECT DISTINCT
        NULL
        , month_name
        , month
        , quarter
        , year
        , effective_date
        , expiry_date
      FROM date_dim
     WHERE CONCAT(month, year) NOT IN (SELECT CONCAT(month, year) FROM month_dim) ;

END
//

DELIMITER ; //

清单(五)-3-2

为测试修改后的日期预装载,使用下面的命令运行存储过程增加从2021年1月1日到2021年12月31日的日期。 USE dw ; call pre_populate_date ('2021-01-01', '2021-12-31'); COMMIT ; 使用下面的语句查询month_dim表,确认有12个月份被正确装载。 select * from month_dim where year = 2021 ;

        在“准备数据仓库模拟环境”里除了使用存储过程,还使用了Kettle转换预装载日期维度数据。这里对转换也做一些修改,使之同时预装载日期维度和月份维度数据。修改后的Kettle转换步骤如图(五)- 3-1到图(五)- 3-13所示。其中JavaScript步骤的代码没有变化,参见“准备数据仓库模拟环境”里的清单(二)- 2。

图(五)- 3-1

图(五)- 3-2

图(五)- 3-3

图(五)- 3-4

图(五)- 3-5

图(五)- 3-6

图(五)- 3-7

图(五)- 3-8

图(五)- 3-9

图(五)- 3-10

图(五)- 3-11

图(五)- 3-12

图(五)- 3-13

        月份维度是一个上卷维度,它包含基本维度的上层数据。而特定维度子集是选择基本维度的一个特定子集。清单(五)-3-3里的脚本建立特定维度表,并导入Pennsylvania (PA)客户维度子集。

代码语言:javascript
复制
USE dw;

CREATE TABLE pa_customer_dim (
    customer_sk INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    customer_number INT,
    customer_name VARCHAR(50),
    customer_street_address VARCHAR(50),
    customer_zip_code INT(5),
    customer_city VARCHAR(30),
    customer_state VARCHAR(2),
    shipping_address VARCHAR(50),
    shipping_zip_code INT(5),
    shipping_city VARCHAR(30),
    shipping_state VARCHAR(2),
	version INT,
    effective_date DATE,
    expiry_date DATE
);

INSERT INTO pa_customer_dim
SELECT
  customer_sk
, customer_number
, customer_name
, customer_street_address
, customer_zip_code
, customer_city
, customer_state
, shipping_address
, shipping_zip_code
, shipping_city
, shipping_state
, version
, effective_date
, expiry_date
FROM customer_dim
WHERE customer_state = 'PA' ;

COMMIT ;

清单(五)-3-3

        注意 PA客户维度子集与月份维度子集有两点区别:

  • pa_customer_dim表和customer_dim表有完全相同的列。month_dim不包含date_dim表的日期列。
  • pa_customer_dim表的代理键就是客户维度的代理键。month_dim表里的月份维度代理键并不来自日期维度。

为了测试清单(五)-3-3的脚本,先用下面的命令添加三个Ohio的客户。 USE dw; INSERT INTO customer_dim ( customer_sk , customer_number , customer_name , customer_street_address , customer_zip_code , customer_city , customer_state , shipping_address , shipping_zip_code , shipping_city , shipping_state , version , effective_date , expiry_date ) VALUES   (NULL, 10, 'Bigger Customers', '7777 Ridge Rd.', '44102',        'Cleveland', 'OH', '7777 Ridge Rd.', '44102', 'Cleveland',        'OH', 1, '2015-03-02', '2200-01-01') , (NULL, 11, 'Smaller Stores', '8888 Jennings Fwy.', '44102',        'Cleveland', 'OH', '8888 Jennings Fwy.', '44102',        'Cleveland', 'OH', 1, '2015-03-02', '2200-01-01') , (NULL, 12, 'Small-Medium Retailers', '9999 Memphis Ave.', '44102',        'Cleveland', 'OH', '9999 Memphis Ave.', '44102', 'Cleveland',        'OH', 1, '2015-03-02', '2200-01-01') ; COMMIT ; 执行清单(五)-3-3里的脚本,使用下面的查询语句确认三个OH用户信息被正确装载。确认三个OH用户信息被正确装载,pa_customer_dim表中只有PA客户的信息。 select customer_name, customer_state, effective_date from customer_dim; select customer_name, customer_state, effective_date from pa_customer_dim;         修改定期装载         无论何时当一个新的PA客户信息插入到客户维度表,它也应该插入PA客户维度表。所以应该把PA客户维度子集的装载合并到数据仓库定期装载中。清单(五)-3-4显示了修改后的定期装载脚本。改变(增加)注意,在每次运行按天定期导入脚本时会重建(清除,然后添加所有的PA客户)PA客户维度子集。

代码语言:javascript
复制
USE dw;

-- 设置SCD的截止时间和生效时间
SET @pre_date = SUBDATE(CURRENT_DATE,1) ;

-- 设置CDC的上限时间
UPDATE cdc_time SET current_load = CURRENT_DATE ;

-- 装载客户维度
TRUNCATE TABLE customer_stg;
INSERT INTO customer_stg
SELECT 
  customer_number
, customer_name
, customer_street_address
, customer_zip_code
, customer_city
, customer_state
, shipping_address
, shipping_zip_code
, shipping_city
, shipping_state
FROM source.customer ;
/* 在所有地址列上 SCD2                           */
/* 置过期                          */
UPDATE customer_dim a,
    customer_stg b 
SET 
    expiry_date = @pre_date
WHERE
    a.customer_number = b.customer_number
        AND (a.customer_street_address <> b.customer_street_address
        OR a.customer_city <> b.customer_city
        OR a.customer_zip_code <> b.customer_zip_code
        OR a.customer_state <> b.customer_state
        OR a.shipping_address <> b.shipping_address
        OR a.shipping_city <> b.shipping_city
        OR a.shipping_zip_code <> b.shipping_zip_code
        OR a.shipping_state <> b.shipping_state
        OR a.shipping_address IS NULL
        OR a.shipping_city IS NULL
        OR a.shipping_zip_code IS NULL
        OR a.shipping_state IS NULL)
        AND expiry_date = '2200-01-01';
/* 加新行                          */
INSERT INTO customer_dim
SELECT
  NULL
, b.customer_number
, b.customer_name
, b.customer_street_address
, b.customer_zip_code
, b.customer_city
, b.customer_state
, b.shipping_address
, b.shipping_zip_code
, b.shipping_city
, b.shipping_state
, a.version + 1
, @pre_date
, '2200-01-01'
FROM
  customer_dim a
, customer_stg b
WHERE
    a.customer_number = b.customer_number
AND ( a.customer_street_address <> b.customer_street_address
     OR a.customer_city <> b.customer_city
     OR a.customer_zip_code <> b.customer_zip_code
     OR a.customer_state <> b.customer_state
     OR a.shipping_address <> b.shipping_address
     OR a.shipping_city <> b.shipping_city
     OR a.shipping_zip_code <> b.shipping_zip_code
     OR a.shipping_state <> b.shipping_state
     OR a.shipping_address IS NULL
     OR a.shipping_city IS NULL
     OR a.shipping_zip_code IS NULL
     OR a.shipping_state IS NULL)
AND EXISTS(
SELECT *
FROM customer_dim x
WHERE
    b.customer_number=x.customer_number
AND a.expiry_date = @pre_date )
AND NOT EXISTS (
SELECT *
FROM customer_dim y
WHERE
    b.customer_number = y.customer_number
AND y.expiry_date = '2200-01-01') ;
/* 在 customer_name 列上 SCD1                                             */
UPDATE customer_dim a, customer_stg b
SET a.customer_name = b.customer_name
WHERE a.customer_number = b.customer_number
      AND a.customer_name <> b.customer_name ;
/* 新增的客户                                                   */
INSERT INTO customer_dim
SELECT
  NULL
, customer_number
, customer_name
, customer_street_address
, customer_zip_code
, customer_city
, customer_state
, shipping_address
, shipping_zip_code
, shipping_city
, shipping_state
, 1
, @pre_date
,'2200-01-01'
FROM customer_stg
WHERE customer_number NOT IN(
SELECT y.customer_number
FROM customer_dim x, customer_stg y
WHERE x.customer_number = y.customer_number) ;

/* 重建PA客户维度                               */
TRUNCATE pa_customer_dim;
INSERT INTO pa_customer_dim
SELECT
  customer_sk
, customer_number
, customer_name
, customer_street_address
, customer_zip_code
, customer_city
, customer_state
, shipping_address
, shipping_zip_code
, shipping_city
, shipping_state
, version
, effective_date
, expiry_date
FROM customer_dim
WHERE customer_state = 'PA' ;

/* 装载产品维度                                           */
TRUNCATE TABLE product_stg ;
INSERT INTO product_stg
SELECT 
  product_code
, product_name
, product_category
FROM source.product ;
/* 在 product_name 和 product_category 列上 SCD2                                    */
/* 置过期                                 */
UPDATE
  product_dim a
, product_stg b
SET
  expiry_date = @pre_date
WHERE
    a.product_code = b.product_code
AND (   a.product_name <> b.product_name
     OR a.product_category <> b.product_category)
AND expiry_date = '2200-01-01';
/* 加新行                                */
INSERT INTO product_dim
SELECT
  NULL
, b.product_code
, b.product_name
, b.product_category
, a.version + 1
, @pre_date
,'2200-01-01'
FROM
  product_dim a
, product_stg b
WHERE
    a.product_code = b.product_code
AND (   a.product_name <> b.product_name
     OR a.product_category <> b.product_category)
AND EXISTS(
SELECT *
FROM product_dim x
WHERE     b.product_code = x.product_code
      AND a.expiry_date = @pre_date)
AND NOT EXISTS (
SELECT *
FROM product_dim y
WHERE     b.product_code = y.product_code
      AND y.expiry_date = '2200-01-01') ;
/* 新增的产品                                                    */
INSERT INTO product_dim
SELECT
  NULL
, product_code
, product_name
, product_category
, 1
, @pre_date
, '2200-01-01'
FROM product_stg
WHERE product_code NOT IN(
SELECT y.product_code
FROM product_dim x, product_stg y
WHERE x.product_code = y.product_code) ;

-- 装载订单维度,新增前一天的订单号
INSERT INTO order_dim (
  order_number
, effective_date
, expiry_date)
SELECT
  order_number
, order_date
, '2200-01-01'
FROM source.sales_order, cdc_time
WHERE entry_date >= last_load AND entry_date < current_load ;

-- 装载事实表,新增前一天的订单
INSERT INTO sales_order_fact
SELECT
  order_sk
, customer_sk
, product_sk
, date_sk
, order_amount
, order_quantity
FROM
  source.sales_order a
, order_dim b
, customer_dim c
, product_dim d
, date_dim e
, cdc_time f
WHERE
    a.order_number = b.order_number
AND a.customer_number = c.customer_number
AND a.order_date >= c.effective_date
AND a.order_date < c.expiry_date
AND a.product_code = d.product_code
AND a.order_date >= d.effective_date
AND a.order_date < d.expiry_date
AND a.order_date = e.date
AND a.entry_date >= f.last_load AND a.entry_date < f.current_load ;

-- 更新时间戳表的last_load字段
UPDATE cdc_time SET last_load = current_load ;

COMMIT ;

清单(五)-3-4

        使用Kettle步骤装PA客户维度子集,只需要增加一个表输入和一个表输出步骤即可,如图(五)- 3-14到(五)- 3-18所示。

图(五)- 3-14

图(五)- 3-15

图(五)- 3-16

图(五)- 3-17

图(五)- 3-18

测试修改后的定期装载 先运行下面的命令往客户维度里添加一个PA的客户和一个OH的客户。 USE dw; INSERT INTO customer_dim ( customer_sk , customer_number , customer_name , customer_street_address , customer_zip_code , customer_city , customer_state , shipping_address , shipping_zip_code , shipping_city , shipping_state , version , effective_date , expiry_date ) VALUES   (NULL, 13, 'PA Customer', '1111 Louise Dr.', '17050',        'Mechanicsburg', 'PA', '1111 Louise Dr.', '17050',        'Mechanicsburg', 'PA', 1, '2015-03-03', '2200-01-01') , (NULL, 14, 'OH Customer', '6666 Ridge Rd.', '44102',        'Cleveland', 'OH', '6666 Ridge Rd.', '44102',        'Cleveland', 'OH', 1, '2015-03-03', '2200-01-01') ; COMMIT ;         现在设置系统日期为2015年3月3日,这样老的数据就不会再次转载,然后使用清单(五)-3-4脚本或修改后的Kettle步骤执行定期转换。执行结果只是增加了pa_customer_dim表的19条记录,其它表的数据不变。 验证结果应该如下所示: mysql> select customer_name, customer_state, effective_date from pa_customer_dim; +------------------------+----------------+----------------+ | customer_name          | customer_state | effective_date | +------------------------+----------------+----------------+ | Really Large Customers | PA             | 2013-03-01     | | Small Stores           | PA             | 2013-03-01     | | Medium Retailers       | PA             | 2013-03-01     | | Good Companies         | PA             | 2013-03-01     | | Wonderful Shops        | PA             | 2013-03-01     | | Loyal Clients          | PA             | 2013-03-01     | | Distinguished Agencies | PA             | 2013-03-01     | | Loyal Clients          | PA             | 2015-03-01     | | Subsidiaries           | PA             | 2015-03-01     | | Really Large Customers | PA             | 2015-03-02     | | Small Stores           | PA             | 2015-03-02     | | Medium Retailers       | PA             | 2015-03-02     | | Good Companies         | PA             | 2015-03-02     | | Wonderful Shops        | PA             | 2015-03-02     | | Loyal Clients          | PA             | 2015-03-02     | | Distinguished Agencies | PA             | 2015-03-02     | | Subsidiaries           | PA             | 2015-03-02     | | Online Distributors    | PA             | 2015-03-02     | | PA Customer            | PA             | 2015-03-03     | +------------------------+----------------+----------------+ 19 rows in set (0.00 sec)

本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2015-11-06,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档