这个内容是群友问我的一个问题,他已经解决了,但是通过表关联,使用笛卡尔积的方式进行处理,为了进一步优化,他问我有没有不用笛卡尔积的方式计算。
有用户盈利记录表 t_case1_ods_user_profit,包含三个字段,分别为交易日期,用户ID,用户当日盈利金额(如果亏损则为负数);样例数据如下:
+-------------+----------+---------+
| trade_date | user_id | profit |
+-------------+----------+---------+
| 2026-03-12 | 1001 | 3.00 |
| 2026-03-13 | 1002 | 5.00 |
| 2026-03-13 | 1001 | -5.00 |
| 2026-03-14 | 1003 | 7.00 |
| 2026-03-14 | 1004 | 7.00 |
| 2026-03-14 | 1002 | 7.00 |
| 2026-03-15 | 1002 | -7.00 |
| 2026-03-15 | 1001 | -7.00 |
+-------------+----------+---------+
建表语句
create table t_case1_ods_user_profit(
trade_date string COMMENT '交易日期 yyyy-mm-dd',
user_id bigint comment '用户ID',
profit decimal(10,2) COMMENT '盈利金额'
) COMMENT='用户每日盈亏表';
INSERT INTO TABLE t_case1_ods_user_profit VALUES
('2026-03-12', 1001, 3),
('2026-03-13', 1002, 5),
('2026-03-13', 1001, -5),
('2026-03-14', 1003, 7),
('2026-03-14', 1004, 7),
('2026-03-14', 1002, 7),
('2026-03-15', 1002, -7),
('2026-03-15', 1001, -7);
需求内容: 希望能够查询下面数据
用户盈利记录表中,记录了每日有交易过程的用户当期盈利金额,如果亏损则记为负数。现希望查询出截止到每日的累积盈利的用户数;
分析: 因为盈利记录表中仅存在当日有交易记录,这样我们进行累积求和的结果是不能满足要求的,这是该题目困难的原因。
我们先使用SQL完成需求内容。
既然题目中没有,那么我们将其补全,即为每个用户补全所有日期的交易记录,如果用户不存在交易记录,则补当日盈亏金额为0;
select t_date.trade_date,
t_user.user_id,
0 as profit
from (select trade_date
from t_case1_ods_user_profit
group by trade_date) t_date
join (select user_id
from t_case1_ods_user_profit
group by user_id) t_user;
执行结果
+-------------+----------+---------+
| trade_date | user_id | profit |
+-------------+----------+---------+
| 2026-03-12 | 1002 | 0 |
| 2026-03-12 | 1001 | 0 |
| 2026-03-12 | 1003 | 0 |
| 2026-03-12 | 1004 | 0 |
| 2026-03-14 | 1002 | 0 |
| 2026-03-14 | 1001 | 0 |
| 2026-03-14 | 1003 | 0 |
| 2026-03-14 | 1004 | 0 |
| 2026-03-13 | 1002 | 0 |
| 2026-03-13 | 1001 | 0 |
| 2026-03-13 | 1003 | 0 |
| 2026-03-13 | 1004 | 0 |
| 2026-03-15 | 1002 | 0 |
| 2026-03-15 | 1001 | 0 |
| 2026-03-15 | 1003 | 0 |
| 2026-03-15 | 1004 | 0 |
+-------------+----------+---------+
16 rows selected (10.268 seconds)(dwsql.com)
select trade_date,
user_id,
sum(profit) as profit
from (select t_date.trade_date,
t_user.user_id,
0 as profit
from (select trade_date
from t_case1_ods_user_profit
group by trade_date) t_date
join (select user_id
from t_case1_ods_user_profit
group by user_id) t_user
union all
select trade_date, user_id, profit
from t_case1_ods_user_profit) t
group by trade_date,
user_id
order by trade_date,
user_id --为方便查看数据添加
执行结果
+-------------+----------+---------+
| trade_date | user_id | profit |
+-------------+----------+---------+
| 2026-03-12 | 1001 | 3.00 |
| 2026-03-12 | 1002 | 0.00 |
| 2026-03-12 | 1003 | 0.00 |
| 2026-03-12 | 1004 | 0.00 |
| 2026-03-13 | 1001 | -5.00 |
| 2026-03-13 | 1002 | 5.00 |
| 2026-03-13 | 1003 | 0.00 |
| 2026-03-13 | 1004 | 0.00 |
| 2026-03-14 | 1001 | 0.00 |
| 2026-03-14 | 1002 | 7.00 |
| 2026-03-14 | 1003 | 7.00 |
| 2026-03-14 | 1004 | 7.00 |
| 2026-03-15 | 1001 | -7.00 |
| 2026-03-15 | 1002 | -7.00 |
| 2026-03-15 | 1003 | 0.00 |
| 2026-03-15 | 1004 | 0.00 |
+-------------+----------+---------+
16 rows selected (1.249 seconds)(dwsql.com)
select trade_date,
user_id,
profit,
sum(profit) over (partition by user_id order by trade_date asc) as total_profit --关键开窗函数
from (select trade_date,
user_id,
sum(profit) as profit
from (select t_date.trade_date,
t_user.user_id,
0 as profit
from (select trade_date
from t_case1_ods_user_profit
group by trade_date) t_date
join (select user_id
from t_case1_ods_user_profit
group by user_id) t_user
union all
select trade_date, user_id, profit
from t_case1_ods_user_profit) t
group by trade_date,
user_id) tt
执行结果
+-------------+----------+---------+---------------+
| trade_date | user_id | profit | total_profit |
+-------------+----------+---------+---------------+
| 2026-03-12 | 1001 | 3.00 | 3.00 |
| 2026-03-13 | 1001 | -5.00 | -2.00 |
| 2026-03-14 | 1001 | 0.00 | -2.00 |
| 2026-03-15 | 1001 | -7.00 | -9.00 |
| 2026-03-12 | 1002 | 0.00 | 0.00 |
| 2026-03-13 | 1002 | 5.00 | 5.00 |
| 2026-03-14 | 1002 | 7.00 | 12.00 |
| 2026-03-15 | 1002 | -7.00 | 5.00 |
| 2026-03-12 | 1003 | 0.00 | 0.00 |
| 2026-03-13 | 1003 | 0.00 | 0.00 |
| 2026-03-14 | 1003 | 7.00 | 7.00 |
| 2026-03-15 | 1003 | 0.00 | 7.00 |
| 2026-03-12 | 1004 | 0.00 | 0.00 |
| 2026-03-13 | 1004 | 0.00 | 0.00 |
| 2026-03-14 | 1004 | 7.00 | 7.00 |
| 2026-03-15 | 1004 | 0.00 | 7.00 |
+-------------+----------+---------+---------------+
16 rows selected (0.952 seconds)(dwsql.com)
select trade_date, count(case when total_profit > 0 then user_id end) as user_cnt
from (select trade_date,
user_id,
profit,
sum(profit) over (partition by user_id order by trade_date asc) as total_profit
from (select trade_date,
user_id,
sum(profit) as profit
from (select t_date.trade_date,
t_user.user_id,
0 as profit
from (select trade_date
from t_case1_ods_user_profit
group by trade_date) t_date
join (select user_id
from t_case1_ods_user_profit
group by user_id) t_user
union all
select trade_date, user_id, profit
from t_case1_ods_user_profit) t
group by trade_date,
user_id) tt) ttt
group by trade_date
order by trade_date asc;
执行结果
+-------------+-----------+
| trade_date | user_cnt |
+-------------+-----------+
| 2026-03-12 | 1 |
| 2026-03-13 | 1 |
| 2026-03-14 | 3 |
| 2026-03-15 | 3 |
+-------------+-----------+
4 rows selected (1.252 seconds)(dwsql.com)
前面先做笛卡尔积,太过“暴力” ,会产生大量的数据,产生大量计算;我们稍微优化一下;
select trade_date,
user_id,
profit,
sum(profit) over (partition by user_id order by trade_date asc) as total_profit --累积求和
from t_case1_ods_user_profit
执行结果
+-------------+----------+---------+---------------+
| trade_date | user_id | profit | total_profit |
+-------------+----------+---------+---------------+
| 2026-03-12 | 1001 | 3.00 | 3.00 |
| 2026-03-13 | 1001 | -5.00 | -2.00 |
| 2026-03-15 | 1001 | -7.00 | -9.00 |
| 2026-03-13 | 1002 | 5.00 | 5.00 |
| 2026-03-14 | 1002 | 7.00 | 12.00 |
| 2026-03-15 | 1002 | -7.00 | 5.00 |
| 2026-03-14 | 1003 | 7.00 | 7.00 |
| 2026-03-14 | 1004 | 7.00 | 7.00 |
+-------------+----------+---------+---------------+
8 rows selected (0.352 seconds)(dwsql.com)
**说明:**我们没有办法对该数据进行直接统计,因为如果用户在某天不存在交易,则当日不会有其记录,e.g. 1001 用户在3月14日没有记录,所以接下来我们要使用笛卡尔积来完成缺失数据的补足;
select t_date.trade_date,
t_profit.trade_date,
t_profit.user_id,
t_profit.profit,
t_profit.total_profit,
row_number() over (partition by t_date.trade_date, t_profit.user_id order by t_profit.trade_date desc) as rn --注意倒排序
from (select trade_date
from t_case1_ods_user_profit
group by trade_date) t_date
join (select trade_date,
user_id,
profit,
sum(profit) over (partition by user_id order by trade_date asc) as total_profit
from t_case1_ods_user_profit) t_profit
--注意没有on条件,所以是笛卡尔积
where t_profit.trade_date <= t_date.trade_date
执行结果
+-------------+-------------+----------+---------+---------------+-----+
| trade_date | trade_date | user_id | profit | total_profit | rn |
+-------------+-------------+----------+---------+---------------+-----+
| 2026-03-12 | 2026-03-12 | 1001 | 3.00 | 3.00 | 1 |
| 2026-03-13 | 2026-03-13 | 1001 | -5.00 | -2.00 | 1 |
| 2026-03-13 | 2026-03-12 | 1001 | 3.00 | 3.00 | 2 |
| 2026-03-13 | 2026-03-13 | 1002 | 5.00 | 5.00 | 1 |
| 2026-03-14 | 2026-03-13 | 1001 | -5.00 | -2.00 | 1 |
| 2026-03-14 | 2026-03-12 | 1001 | 3.00 | 3.00 | 2 |
| 2026-03-14 | 2026-03-14 | 1002 | 7.00 | 12.00 | 1 |
| 2026-03-14 | 2026-03-13 | 1002 | 5.00 | 5.00 | 2 |
| 2026-03-14 | 2026-03-14 | 1003 | 7.00 | 7.00 | 1 |
| 2026-03-14 | 2026-03-14 | 1004 | 7.00 | 7.00 | 1 |
| 2026-03-15 | 2026-03-15 | 1001 | -7.00 | -9.00 | 1 |
| 2026-03-15 | 2026-03-13 | 1001 | -5.00 | -2.00 | 2 |
| 2026-03-15 | 2026-03-12 | 1001 | 3.00 | 3.00 | 3 |
| 2026-03-15 | 2026-03-15 | 1002 | -7.00 | 5.00 | 1 |
| 2026-03-15 | 2026-03-14 | 1002 | 7.00 | 12.00 | 2 |
| 2026-03-15 | 2026-03-13 | 1002 | 5.00 | 5.00 | 3 |
| 2026-03-15 | 2026-03-14 | 1003 | 7.00 | 7.00 | 1 |
| 2026-03-15 | 2026-03-14 | 1004 | 7.00 | 7.00 | 1 |
+-------------+-------------+----------+---------+---------------+-----+
18 rows selected (0.809 seconds)
select new_trade_date,
count(case when total_profit > 0 then user_id end) as user_cnt
from (select t_date.trade_date as new_trade_date,
t_profit.trade_date,
t_profit.user_id,
t_profit.profit,
t_profit.total_profit,
row_number() over (partition by t_date.trade_date, t_profit.user_id order by t_profit.trade_date desc) as rn --注意倒排序
from (select trade_date
from t_case1_ods_user_profit
group by trade_date) t_date
join (select trade_date,
user_id,
profit,
sum(profit) over (partition by user_id order by trade_date asc) as total_profit
from t_case1_ods_user_profit) t_profit
--注意没有on条件,所以是笛卡尔积
where t_profit.trade_date <= t_date.trade_date) t
where t.rn = 1
group by new_trade_date
order by new_trade_date;
执行结果
+-----------------+-----------+
| new_trade_date | user_cnt |
+-----------------+-----------+
| 2026-03-12 | 1 |
| 2026-03-13 | 1 |
| 2026-03-14 | 3 |
| 2026-03-15 | 3 |
+-----------------+-----------+
4 rows selected (11.858 seconds)
方法二还是笛卡尔积,数据还是很大,我们换一种思路来计算,避免笛卡尔积。因为只是想要盈利用户数量,并不需要找出具体是哪个用户,所以可以使用累加的方式。
select trade_date,
user_id,
profit,
sum(profit) over (partition by user_id order by trade_date asc) as total_profit --累积求和
from t_case1_ods_user_profit
执行结果
+-------------+----------+---------+---------------+
| trade_date | user_id | profit | total_profit |
+-------------+----------+---------+---------------+
| 2026-03-12 | 1001 | 3.00 | 3.00 |
| 2026-03-13 | 1001 | -5.00 | -2.00 |
| 2026-03-15 | 1001 | -7.00 | -9.00 |
| 2026-03-13 | 1002 | 5.00 | 5.00 |
| 2026-03-14 | 1002 | 7.00 | 12.00 |
| 2026-03-15 | 1002 | -7.00 | 5.00 |
| 2026-03-14 | 1003 | 7.00 | 7.00 |
| 2026-03-14 | 1004 | 7.00 | 7.00 |
+-------------+----------+---------+---------------+
8 rows selected (0.352 seconds)(dwsql.com)
select trade_date,
user_id,
profit,
total_profit,
lag(total_profit) over (partition by user_id order by trade_date) as last_total_profit
from (select trade_date,
user_id,
profit,
sum(profit) over (partition by user_id order by trade_date asc) as total_profit --累积求和
from t_case1_ods_user_profit) t
执行结果
+-------------+----------+---------+---------------+--------------------+
| trade_date | user_id | profit | total_profit | last_total_profit |
+-------------+----------+---------+---------------+--------------------+
| 2026-03-12 | 1001 | 3.00 | 3.00 | NULL |
| 2026-03-13 | 1001 | -5.00 | -2.00 | 3.00 |
| 2026-03-15 | 1001 | -7.00 | -9.00 | -2.00 |
| 2026-03-13 | 1002 | 5.00 | 5.00 | NULL |
| 2026-03-14 | 1002 | 7.00 | 12.00 | 5.00 |
| 2026-03-15 | 1002 | -7.00 | 5.00 | 12.00 |
| 2026-03-14 | 1003 | 7.00 | 7.00 | NULL |
| 2026-03-14 | 1004 | 7.00 | 7.00 | NULL |
+-------------+----------+---------+---------------+--------------------+
8 rows selected (8.925 seconds)
select trade_date,
user_id,
profit,
total_profit,
last_total_profit,
case
when last_total_profit is null and total_profit > 0 then 1
when last_total_profit is null and total_profit < 0 then 0
when last_total_profit > 0 and total_profit > 0 then 0
when last_total_profit <= 0 and total_profit <= 0 then 0
when last_total_profit > 0 and total_profit <= 0 then -1
when last_total_profit <= 0 and total_profit > 0 then 1
end as cnt_change
from (select trade_date,
user_id,
profit,
total_profit,
lag(total_profit) over (partition by user_id order by trade_date) as last_total_profit
from (select trade_date,
user_id,
profit,
sum(profit) over (partition by user_id order by trade_date asc) as total_profit --累积求和
from t_case1_ods_user_profit) t) tt
执行结果
+-------------+----------+---------+---------------+--------------------+-------------+
| trade_date | user_id | profit | total_profit | last_total_profit | cnt_change |
+-------------+----------+---------+---------------+--------------------+-------------+
| 2026-03-12 | 1001 | 3.00 | 3.00 | NULL | 1 |
| 2026-03-13 | 1001 | -5.00 | -2.00 | 3.00 | -1 |
| 2026-03-15 | 1001 | -7.00 | -9.00 | -2.00 | 0 |
| 2026-03-13 | 1002 | 5.00 | 5.00 | NULL | 1 |
| 2026-03-14 | 1002 | 7.00 | 12.00 | 5.00 | 0 |
| 2026-03-15 | 1002 | -7.00 | 5.00 | 12.00 | 0 |
| 2026-03-14 | 1003 | 7.00 | 7.00 | NULL | 1 |
| 2026-03-14 | 1004 | 7.00 | 7.00 | NULL | 1 |
+-------------+----------+---------+---------------+--------------------+-------------+
8 rows selected (0.52 seconds)(dwsql.com)
select trade_date,
sum(day_cnt_change) over (order by trade_date) as user_cnt
from (select trade_date,
sum(cnt_change) as day_cnt_change
from (select trade_date,
user_id,
profit,
total_profit,
last_total_profit,
case
when last_total_profit is null and total_profit > 0 then 1
when last_total_profit is null and total_profit < 0 then 0
when last_total_profit > 0 and total_profit > 0 then 0
when last_total_profit <= 0 and total_profit <= 0 then 0
when last_total_profit > 0 and total_profit <= 0 then -1
when last_total_profit <= 0 and total_profit > 0 then 1
end as cnt_change
from (select trade_date,
user_id,
profit,
total_profit,
lag(total_profit) over (partition by user_id order by trade_date) as last_total_profit
from (select trade_date,
user_id,
profit,
sum(profit) over (partition by user_id order by trade_date asc) as total_profit --累积求和
from t_case1_ods_user_profit) t) tt) ttt
group by trade_date) tttt
执行结果
+-------------+-----------+
| trade_date | user_cnt |
+-------------+-----------+
| 2026-03-12 | 1 |
| 2026-03-13 | 1 |
| 2026-03-14 | 3 |
| 2026-03-15 | 3 |
+-------------+-----------+
4 rows selected (1.883 seconds)
SQL我们写完了,即便按照方法三,没有笛卡尔积的方式,每次查询我们都需要不断的查询所有历史数据,进行一遍遍的开窗和聚合,这在生产过程是不可接受的。所以如果日常需要查看该数据、或者经常性的分析,则需要通过建模的方式减少类似的查询计算。
如果我们一张表记录了截止到前天所有用户的用户累计盈利金额,叠加昨天用户盈利金额,就可以算出截止到昨天的用户累计盈利金额了。 这种方式叫做滚表的方式,这样加工时每天仅需要处理当天的日期和全量用户前一天的状态,查询是直接根据日期进行统计即可拿到统计结果。 这也是通过存储换计算的方式,也可以理解为空间换时间(存储空间换查询时的查询时间)
为了方便加工,我们创建两张表 用户每日盈亏表(分区表)(注:该表在实际生产环境,用户每日盈亏表应该就已经是分区表了,无需我们额外处理),用户每日累积盈利金额状态表(分区表)。
CREATE TABLE t_case1_dwd_user_profit (
user_id BIGINT COMMENT '用户ID',
profit DECIMAL(10,2) COMMENT '盈利金额'
)
COMMENT '用户每日盈亏表'
PARTITIONED BY (trade_date STRING COMMENT '交易日期 yyyy-mm-dd');
CREATE TABLE IF NOT EXISTS t_case1_dwd_user_profit_status (
user_id BIGINT COMMENT '用户ID',
profit DECIMAL(10,2) COMMENT '当日盈利金额',
total_profit DECIMAL(18,2) COMMENT '累计盈利金额'
)
COMMENT '用户盈利状态明细表'
PARTITIONED BY (stat_date STRING COMMENT '统计日期 yyyy-mm-dd');
默认原始日志数据是分区表内数据,我们将数据恢复到分区表中,方便进行下一步处理。(当然对于写SQL来讲,是否做分区无所谓,但是在工作中,一定会有)
insert overwrite table t_case1_dwd_user_profit partition(trade_date)
select
user_id,
profit,
trade_date
from t_case1_ods_user_profit;
滚表的初始数据、初始加工过程和后面的加工过程不相同,所以需要写两个逻辑,初始逻辑和滚动逻辑。
select
insert overwrite table t_case1_dwd_user_profit_status partition (stat_date)
select user_id,
profit,
profit as total_profit
trade_date as stat_date
from t_case1_dwd_user_profit
where trade_date = '2026-03-12'
insert overwrite table t_case1_dwd_user_profit_status partition (stat_date = '2026-03-13')
select coalesce(t1.user_id, t2.user_id) as user_id,
t2.profit,
coalesce(t1.total_profit,0) + coalesce(t2.profit, 0) as total_profit
from (
--昨日状态数据
select user_id,
profit,
total_profit,
stat_date
from t_case1_dwd_user_profit_status
where stat_date = '2026-03-12') t1
full outer join
(select user_id,
profit,
trade_date as stat_date
from t_case1_dwd_user_profit
where trade_date = '2026-03-13'
) t2
on t1.user_id = t2.user_id
14日加工
insert overwrite table t_case1_dwd_user_profit_status partition (stat_date = '2026-03-14')
select coalesce(t1.user_id, t2.user_id) as user_id,
t2.profit,
coalesce(t1.total_profit,0) + coalesce(t2.profit, 0) as total_profit
from (
--昨日状态数据
select user_id,
profit,
total_profit,
stat_date
from t_case1_dwd_user_profit_status
where stat_date = '2026-03-13') t1
full outer join
(select user_id,
profit,
trade_date as stat_date
from t_case1_dwd_user_profit
where trade_date = '2026-03-14'
) t2
on t1.user_id = t2.user_id
15日加工
insert overwrite table t_case1_dwd_user_profit_status partition (stat_date = '2026-03-15')
select coalesce(t1.user_id, t2.user_id) as user_id,
t2.profit,
coalesce(t1.total_profit,0) + coalesce(t2.profit, 0) as total_profit
from (
--昨日状态数据
select user_id,
profit,
total_profit,
stat_date
from t_case1_dwd_user_profit_status
where stat_date = '2026-03-14') t1
full outer join
(select user_id,
profit,
trade_date as stat_date
from t_case1_dwd_user_profit
where trade_date = '2026-03-15'
) t2
on t1.user_id = t2.user_id
后续日期根据实际情况加工,生产环境使用参数每日处理即可,最终分区表内的数据如下
结果数据
+----------+---------+---------------+-------------+
| user_id | profit | total_profit | stat_date |
+----------+---------+---------------+-------------+
| 1001 | 3.00 | 3.00 | 2026-03-12 |
| 1001 | -5.00 | -2.00 | 2026-03-13 |
| 1002 | 5.00 | 5.00 | 2026-03-13 |
| 1001 | NULL | -2.00 | 2026-03-14 |
| 1002 | 7.00 | 12.00 | 2026-03-14 |
| 1003 | 7.00 | 7.00 | 2026-03-14 |
| 1004 | 7.00 | 7.00 | 2026-03-14 |
| 1001 | -7.00 | -9.00 | 2026-03-15 |
| 1002 | -7.00 | 5.00 | 2026-03-15 |
| 1003 | NULL | 7.00 | 2026-03-15 |
| 1004 | NULL | 7.00 | 2026-03-15 |
+----------+---------+---------------+-------------+
11 rows selected (0.525 seconds)
select stat_date,
count(case
when total_profit > 0 then user_id end) as user_cnt
from t_case1_dwd_user_profit_status
group by stat_date
查询结果
+-------------+-----------+
| stat_date | user_cnt |
+-------------+-----------+
| 2026-03-12 | 1 |
| 2026-03-13 | 1 |
| 2026-03-14 | 3 |
| 2026-03-15 | 3 |
+-------------+-----------+
4 rows selected (0.725 seconds)