前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >字节华为大数据面试SQL-合并日期重叠的活动

字节华为大数据面试SQL-合并日期重叠的活动

作者头像
数据仓库晨曦
发布2024-05-27 12:42:42
1050
发布2024-05-27 12:42:42
举报
文章被收录于专栏:数据仓库技术

昨天文章发出后,发现写错了所以删除文章,今天修复后重新发出来,感谢指出错误的朋友。这里还是按照原有的解题思路进行处理,细节处会给出昨天错误地方的对比。

一、题目

已知有表记录了每个大厅的活动开始日期和结束日期,每个大厅可以有多个活动。请编写一个SQL查询合并在同一个大厅举行的所有重叠的活动,如果两个活动至少有一天相同,那他们就是重叠的

样例数据

代码语言:javascript
复制
+----------+-------------+-------------+
| hall_id  | start_date  |  end_date   |
+----------+-------------+-------------+
| 1        | 2023-01-13  | 2023-01-20  |
| 1        | 2023-01-14  | 2023-01-17  |
| 1        | 2023-01-14  | 2023-01-16  |
| 1        | 2023-01-18  | 2023-01-25  |
| 1        | 2023-01-20  | 2023-01-26  |
| 2        | 2022-12-09  | 2022-12-23  |
| 2        | 2022-12-13  | 2022-12-17  |
| 2        | 2022-12-20  | 2022-12-24  |
| 2        | 2022-12-25  | 2022-12-30  |
| 3        | 2022-12-01  | 2023-01-30  |
+----------+-------------+-------------+

结果

代码语言:javascript
复制
+----------+-------------+-------------+
| hall_id  | start_date  |  end_date   |
+----------+-------------+-------------+
| 1        | 2023-01-13  | 2023-01-26  |
| 2        | 2022-12-09  | 2022-12-24  |
| 2        | 2022-12-25  | 2022-12-30  |
| 3        | 2022-12-01  | 2023-01-30  |
+----------+-------------+-------------+

解释:有三个大厅。 大厅 1:

  • 两个活动["2823-01-13","2023-01-20"]和[“2023-01-14","2023-01-17"]重叠,我们将它们合并到一个活动中[“2023-01-13","2023-01-20"]。

大厅 2:

  • ["2022-12-25","2022-12-30"]不与任何其他活动重叠,所以我们保持原样。

二、分析

我们首先按照hall_id分组,根据start_date、end_date 升序排列,按照start_date 进行了升序排列,所以当前行的start_date一定晚于前一行的start_date,对当前行的start_date 和截止到上一行的最大end_date进行比较,如果当前行的start_date 小于等于截止到前一行最大end_date 代表有交叉,可以合并,否则代表不可合并。判断出是否可以合并之后,具体操作合并就转化成类似连续问题了。

维度

评分

题目难度

⭐️⭐️⭐️⭐️⭐️

题目清晰度

⭐️⭐️⭐️⭐️⭐️

业务常见度

⭐️⭐️⭐️⭐️

三、SQL

1.使用max()函数开窗,获得截止到当前行之前的活动最后日期

这里我们使用max(),按照hall_id进行分组,然后根据start_date和end_date进行排序,利用聚合函数开窗,开窗函数内有排序则聚合到当前行的特性,进行处理。其中我们这边需要聚合到当前行的上一行

执行SQL

代码语言:javascript
复制
select
    hall_id,
    start_date,
    end_date,
    max(end_date) over (partition by hall_id order by start_date asc,end_date asc rows between unbounded preceding and 1 preceding) as max_end_date
from t_hall_event

执行结果

代码语言:javascript
复制
+----------+-------------+-------------+---------------+
| hall_id  | start_date  |  end_date   | max_end_date  |
+----------+-------------+-------------+---------------+
| 1        | 2023-01-13  | 2023-01-20  | NULL          |
| 1        | 2023-01-14  | 2023-01-16  | 2023-01-20    |
| 1        | 2023-01-14  | 2023-01-17  | 2023-01-20    |
| 1        | 2023-01-18  | 2023-01-25  | 2023-01-20    |
| 1        | 2023-01-20  | 2023-01-26  | 2023-01-25    |
| 2        | 2022-12-09  | 2022-12-23  | NULL          |
| 2        | 2022-12-13  | 2022-12-17  | 2022-12-23    |
| 2        | 2022-12-20  | 2022-12-24  | 2022-12-23    |
| 2        | 2022-12-25  | 2022-12-30  | 2022-12-24    |
| 3        | 2022-12-01  | 2023-01-30  | NULL          |
+----------+-------------+-------------+---------------+

昨天错误写法

这里先看下结果,这一步还不太明显,在下一步判断是否应该被合并时,可以方便看出结果。

执行SQL

代码语言:javascript
复制
select
    hall_id,
    start_date,
    end_date,
    lag(end_date) over (partition by hall_id order by start_date asc,end_date asc) as last_end_date
from t_hall_event

执行结果

代码语言:javascript
复制
+----------+-------------+-------------+----------------+
| hall_id  | start_date  |  end_date   | last_end_date  |
+----------+-------------+-------------+----------------+
| 1        | 2023-01-13  | 2023-01-20  | NULL           |
| 1        | 2023-01-14  | 2023-01-16  | 2023-01-20     |
| 1        | 2023-01-14  | 2023-01-17  | 2023-01-16     |
| 1        | 2023-01-18  | 2023-01-25  | 2023-01-17     |
| 1        | 2023-01-20  | 2023-01-26  | 2023-01-25     |
| 2        | 2022-12-09  | 2022-12-23  | NULL           |
| 2        | 2022-12-13  | 2022-12-17  | 2022-12-23     |
| 2        | 2022-12-20  | 2022-12-24  | 2022-12-17     |
| 2        | 2022-12-25  | 2022-12-30  | 2022-12-24     |
| 3        | 2022-12-01  | 2023-01-30  | NULL           |
+----------+-------------+-------------+----------------+

2.根据当前行的start_day与max_end_date进行比较,得出是否可以合并标记;

执行SQL

代码语言:javascript
复制
select hall_id,
       start_date,
       end_date,
       max_end_date,
       if(start_date <= max_end_date, 0, 1) as is_merge --0:合并,1:不合并
from (select hall_id,
             start_date,
             end_date,
             max(end_date)
                 over (partition by hall_id order by start_date asc,end_date asc rows between unbounded preceding and 1 preceding) as max_end_date
      from t_hall_event) t

执行结果

代码语言:javascript
复制
+----------+-------------+-------------+---------------+-----------+
| hall_id  | start_date  |  end_date   | max_end_date  | is_merge  |
+----------+-------------+-------------+---------------+-----------+
| 1        | 2023-01-13  | 2023-01-20  | NULL          | 1         |
| 1        | 2023-01-14  | 2023-01-16  | 2023-01-20    | 0         |
| 1        | 2023-01-14  | 2023-01-17  | 2023-01-20    | 0         |
| 1        | 2023-01-18  | 2023-01-25  | 2023-01-20    | 0         |
| 1        | 2023-01-20  | 2023-01-26  | 2023-01-25    | 0         |
| 2        | 2022-12-09  | 2022-12-23  | NULL          | 1         |
| 2        | 2022-12-13  | 2022-12-17  | 2022-12-23    | 0         |
| 2        | 2022-12-20  | 2022-12-24  | 2022-12-23    | 0         |
| 2        | 2022-12-25  | 2022-12-30  | 2022-12-24    | 1         |
| 3        | 2022-12-01  | 2023-01-30  | NULL          | 1         |
+----------+-------------+-------------+---------------+-----------+

以下是昨天错误写法

执行SQL

代码语言:javascript
复制
select hall_id,
       start_date,
       end_date,
       last_end_date,
       if(start_date <= last_end_date, 0, 1) as is_merge --0:合并,1:不合并
from (select hall_id,
             start_date,
             end_date,
             lag(end_date) over (partition by hall_id order by start_date asc,end_date asc) as last_end_date
      from t_hall_event) t

执行结果

代码语言:javascript
复制
+----------+-------------+-------------+----------------+-----------+
| hall_id  | start_date  |  end_date   | last_end_date  | is_merge  |
+----------+-------------+-------------+----------------+-----------+
| 1        | 2023-01-13  | 2023-01-20  | NULL           | 1         |
| 1        | 2023-01-14  | 2023-01-16  | 2023-01-20     | 0         |
| 1        | 2023-01-14  | 2023-01-17  | 2023-01-16     | 0         |
| 1        | 2023-01-18  | 2023-01-25  | 2023-01-17     | 1         |
| 1        | 2023-01-20  | 2023-01-26  | 2023-01-25     | 0         |
| 2        | 2022-12-09  | 2022-12-23  | NULL           | 1         |
| 2        | 2022-12-13  | 2022-12-17  | 2022-12-23     | 0         |
| 2        | 2022-12-20  | 2022-12-24  | 2022-12-17     | 1         |
| 2        | 2022-12-25  | 2022-12-30  | 2022-12-24     | 1         |
| 3        | 2022-12-01  | 2023-01-30  | NULL           | 1         |
+----------+-------------+-------------+----------------+-----------+

分析:可以看到hall_id = 1 的第4行数据,开始时间为2023-01-18 结束时间为2023-01-25的活动与第一行的活动存在交叉,所以应该被合并,但是由于中间,其前一行的活动截止日期为2023-01-17,早于该行活动的开始日期而被判断为不应该被合并,导致错判。

3.连续问题,使用sum()over()进行分组;

执行SQL

代码语言:javascript
复制
select hall_id,
       start_date,
       end_date,
       max_end_date,
       is_merge,
       sum(is_merge) over (partition by hall_id order by start_date asc,end_date asc) as group_id
from (select hall_id,
             start_date,
             end_date,
             max_end_date,
             if(start_date <= max_end_date, 0, 1) as is_merge --0:合并,1:不合并
      from (select hall_id,
                   start_date,
                   end_date,
                   max(end_date)
                       over (partition by hall_id order by start_date asc,end_date asc rows between unbounded preceding and 1 preceding) as max_end_date
            from t_hall_event) t) tt

执行结果

代码语言:javascript
复制
+----------+-------------+-------------+---------------+-----------+-----------+
| hall_id  | start_date  |  end_date   | max_end_date  | is_merge  | group_id  |
+----------+-------------+-------------+---------------+-----------+-----------+
| 1        | 2023-01-13  | 2023-01-20  | NULL          | 1         | 1         |
| 1        | 2023-01-14  | 2023-01-16  | 2023-01-20    | 0         | 1         |
| 1        | 2023-01-14  | 2023-01-17  | 2023-01-20    | 0         | 1         |
| 1        | 2023-01-18  | 2023-01-25  | 2023-01-20    | 0         | 1         |
| 1        | 2023-01-20  | 2023-01-26  | 2023-01-25    | 0         | 1         |
| 2        | 2022-12-09  | 2022-12-23  | NULL          | 1         | 1         |
| 2        | 2022-12-13  | 2022-12-17  | 2022-12-23    | 0         | 1         |
| 2        | 2022-12-20  | 2022-12-24  | 2022-12-23    | 0         | 1         |
| 2        | 2022-12-25  | 2022-12-30  | 2022-12-24    | 1         | 2         |
| 3        | 2022-12-01  | 2023-01-30  | NULL          | 1         | 1         |
+----------+-------------+-------------+---------------+-----------+-----------+

4.完成合并,得到最终结果

取每个组内的start_day 的最小值作为活动开始日期,end_day的最大值作为活动结束日期,得到最终结果。注意分组条件为hall_id+group_id

执行SQL

代码语言:javascript
复制
select hall_id,
       min(start_date) as start_date,
       max(end_date)   as end_date
from (select hall_id,
             start_date,
             end_date,
             max_end_date,
             is_merge,
             sum(is_merge) over (partition by hall_id order by start_date asc,end_date asc) as group_id
      from (select hall_id,
                   start_date,
                   end_date,
                   max_end_date,
                   if(start_date <= max_end_date, 0, 1) as is_merge --0:合并,1:不合并
            from (select hall_id,
                         start_date,
                         end_date,
                         max(end_date)
                             over (partition by hall_id order by start_date asc,end_date asc rows between unbounded preceding and 1 preceding) as max_end_date
                  from t_hall_event) t) tt) ttt
group by hall_id, group_id --注意这里的分组,有group_id

执行结果

代码语言:javascript
复制
+----------+-------------+-------------+
| hall_id  | start_date  |  end_date   |
+----------+-------------+-------------+
| 1        | 2023-01-13  | 2023-01-26  |
| 2        | 2022-12-09  | 2022-12-24  |
| 2        | 2022-12-25  | 2022-12-30  |
| 3        | 2022-12-01  | 2023-01-30  |
+----------+-------------+-------------+

四、建表语句和数据插入

代码语言:javascript
复制
--建表语句
CREATE TABLE IF NOT EXISTS t_hall_event (
    hall_id STRING, --大厅ID
    start_date STRING, -- 营销活动开始日期
    end_date STRING -- 营销活动结束日期
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC;
--数据插入
insert into  t_hall_event(hall_id, start_date, end_date) values
('1','2023-01-13','2023-01-20'),
('1','2023-01-14','2023-01-17'),
('1','2023-01-14','2023-01-16'),
('1','2023-01-18','2023-01-25'),
('1','2023-01-20','2023-01-26'),
('2','2022-12-09','2022-12-23'),
('2','2022-12-13','2022-12-17'),
('2','2022-12-20','2022-12-24'),
('2','2022-12-25','2022-12-30'),
('3','2022-12-01','2023-01-30');
本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2024-05-23,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 数据仓库技术 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 一、题目
  • 二、分析
  • 三、SQL
    • 1.使用max()函数开窗,获得截止到当前行之前的活动最后日期
      • 2.根据当前行的start_day与max_end_date进行比较,得出是否可以合并标记;
        • 3.连续问题,使用sum()over()进行分组;
          • 4.完成合并,得到最终结果
          • 四、建表语句和数据插入
          相关产品与服务
          大数据
          全栈大数据产品,面向海量数据场景,帮助您 “智理无数,心中有数”!
          领券
          问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档