前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >小红书大数据面试SQL-用户商品购买收藏行为特征加工

小红书大数据面试SQL-用户商品购买收藏行为特征加工

作者头像
数据仓库晨曦
发布2024-05-27 12:41:58
910
发布2024-05-27 12:41:58
举报
文章被收录于专栏:数据仓库技术

一、题目

已知有

购买记录表t_order,包含自增id:id,用户ID:user_id,商品ID:goods_id,订单时间:order_time,商品类别:goods_type;

用户收藏记录表t_collect_log,包含自增id,用户ID:user_id,商品ID:goods_id,收藏时间 collect_time

请用一句sql语句得出以下查询结果,得到所有用户的商品行为特征,其中用户行为分类为4种:是否已购买、购买未收藏、收藏未购买、收藏且购买。

购买记录表t_order

代码语言:javascript
复制
+-----+----------+-----------+-------------------+-------------+
| id  | user_id  | goods_id  |    order_time     | goods_type  |
+-----+----------+-----------+-------------------+-------------+
| 1   | 1        | 201       | 2020/11/14 10:00  | 1           |
| 2   | 2        | 203       | 2020/11/15 12:00  | 2           |
| 3   | 3        | 203       | 2020/11/16 10:00  | 1           |
| 4   | 4        | 203       | 2020/11/17 10:00  | 1           |
| 5   | 5        | 203       | 2020/11/18 10:00  | 1           |
| 6   | 6        | 203       | 2020/11/18 11:00  | 1           |
| 7   | 7        | 204       | 2020/11/18 12:00  | 1           |
| 8   | 8        | 205       | 2020/11/18 11:30  | 1           |
| 9   | 9        | 206       | 2020/12/1 10:00   | 1           |
| 10  | 4        | 207       | 2020/12/2 10:00   | 3           |
| 11  | 5        | 208       | 2020/12/3 10:00   | 1           |
| 12  | 6        | 209       | 2020/12/4 8:00    | 2           |
| 13  | 7        | 203       | 2020/12/5 10:00   | 2           |
| 14  | 8        | 203       | 2020/12/6 10:00   | 3           |
| 15  | 9        | 203       | 2020/12/7 15:00   | 4           |
| 16  | 1        | 204       | 2020/12/8 10:00   | 5           |
| 17  | 2        | 204       | 2020/12/9 10:00   | 5           |
| 18  | 3        | 206       | 2020/12/10 10:00  | 5           |
| 19  | 4        | 208       | 2020/12/11 10:00  | 5           |
| 20  | 5        | 209       | 2020/12/12 19:00  | 5           |
+-----+----------+-----------+-------------------+-------------+

收藏记录表t_collect_log

代码语言:javascript
复制
+-----+----------+-----------+-------------------+
| id  | user_id  | goods_id  |   collect_time    |
+-----+----------+-----------+-------------------+
| 1   | 1        | 203       | 2020/11/14 12:00  |
| 2   | 9        | 203       | 2020/11/15 10:00  |
| 3   | 4        | 203       | 2020/11/16 10:00  |
| 4   | 5        | 203       | 2020/11/17 10:00  |
| 5   | 6        | 203       | 2020/11/17 11:00  |
| 6   | 7        | 204       | 2020/11/17 12:00  |
| 7   | 8        | 205       | 2020/11/18 11:30  |
| 8   | 9        | 212       | 2020/12/1 10:00   |
| 9   | 4        | 207       | 2020/12/2 10:00   |
| 10  | 5        | 213       | 2020/12/3 10:00   |
| 11  | 6        | 209       | 2020/12/4 8:00    |
| 12  | 7        | 203       | 2020/12/5 10:00   |
| 13  | 8        | 203       | 2020/12/6 10:00   |
| 14  | 9        | 203       | 2020/12/7 15:00   |
| 15  | 1        | 203       | 2020/12/8 10:00   |
| 16  | 2        | 204       | 2020/12/9 10:00   |
| 17  | 3        | 205       | 2020/12/10 8:00   |
| 18  | 4        | 208       | 2020/12/11 10:00  |
| 19  | 5        | 209       | 2020/12/10 19:00  |
| 20  | 7        | 201       | 2020/12/11 19:00  |
+-----+----------+-----------+-------------------+

期望结果

代码语言:javascript
复制
+----------+-----------+---------+------------------+------------------+------------------+
| user_id  | goods_id  | is_buy  | buy_not_collect  | collect_not_buy  | buy_and_collect  |
+----------+-----------+---------+------------------+------------------+------------------+
| 1        | 201       | 1       | 1                | 0                | 0                |
| 1        | 203       | 0       | 0                | 1                | 0                |
| 1        | 204       | 1       | 1                | 0                | 0                |
| 2        | 203       | 1       | 1                | 0                | 0                |
| 2        | 204       | 1       | 0                | 0                | 1                |
| 3        | 203       | 1       | 1                | 0                | 0                |
| 3        | 205       | 0       | 0                | 1                | 0                |
| 3        | 206       | 1       | 1                | 0                | 0                |
| 4        | 203       | 1       | 0                | 0                | 1                |
| 4        | 207       | 1       | 0                | 0                | 1                |
| 4        | 208       | 1       | 0                | 0                | 1                |
| 5        | 203       | 1       | 0                | 0                | 1                |
| 5        | 208       | 1       | 1                | 0                | 0                |
| 5        | 209       | 1       | 0                | 0                | 1                |
| 5        | 213       | 0       | 0                | 1                | 0                |
| 6        | 203       | 1       | 0                | 0                | 1                |
| 6        | 209       | 1       | 0                | 0                | 1                |
| 7        | 201       | 0       | 0                | 1                | 0                |
| 7        | 203       | 1       | 0                | 0                | 1                |
| 7        | 204       | 1       | 0                | 0                | 1                |
| 8        | 203       | 1       | 0                | 0                | 1                |
| 8        | 205       | 1       | 0                | 0                | 1                |
| 9        | 203       | 1       | 0                | 0                | 1                |
| 9        | 206       | 1       | 1                | 0                | 0                |
| 9        | 212       | 0       | 0                | 1                | 0                |
+----------+-----------+---------+------------------+------------------+------------------+

二、分析

这个题目属于简单但繁琐的类型,日常数据开发中尤其在有支持算法的数据团队中比较常见,但是大家都不乐意做的脏活。如果面试中遇到类似这种问题,需要考虑这个团队日常主要干脏活,能不能接受。说回题目,因为数据需要join操作,并且存在较多的冗余,由此很容易出现数据倾斜的问题,如果我出这个题目,希望看到的是候选人日常sql的习惯中是否优先进行行列裁剪和去重以保证join时两个表的粒度统一

维度

评分

题目难度

⭐️⭐️⭐️

题目清晰度

⭐️⭐️⭐️⭐️⭐️

业务常见度

⭐️⭐️⭐️⭐️

三、SQL

1.行列裁剪

首先对订单表、收藏记录进行行列裁剪,我们只需要user_id、goods_id,去掉其他冗余列,且保证user_id+goods_id唯一

订单表

代码语言:javascript
复制
select
    user_id,
    goods_id
from t_order
group by
    user_id,
    goods_id

执行结果

代码语言:javascript
复制
+----------+-----------+
| user_id  | goods_id  |
+----------+-----------+
| 1        | 201       |
| 1        | 204       |
| 2        | 203       |
| 2        | 204       |
| 3        | 203       |
| 3        | 206       |
| 4        | 203       |
| 4        | 207       |
| 4        | 208       |
| 5        | 203       |
| 5        | 208       |
| 5        | 209       |
| 6        | 203       |
| 6        | 209       |
| 7        | 203       |
| 7        | 204       |
| 8        | 203       |
| 8        | 205       |
| 9        | 203       |
| 9        | 206       |
+----------+-----------+

收藏表

代码语言:javascript
复制
select
    user_id,
    goods_id
from t_collect_log
group by
    user_id,
    goods_id

执行结果

代码语言:javascript
复制
+----------+-----------+
| user_id  | goods_id  |
+----------+-----------+
| 1        | 203       |
| 2        | 204       |
| 3        | 205       |
| 4        | 203       |
| 4        | 207       |
| 4        | 208       |
| 5        | 203       |
| 5        | 209       |
| 5        | 213       |
| 6        | 203       |
| 6        | 209       |
| 7        | 201       |
| 7        | 203       |
| 7        | 204       |
| 8        | 203       |
| 8        | 205       |
| 9        | 203       |
| 9        | 212       |
+----------+-----------+

2.两个表进行全外联,获得全量的数据行

对两个结果表进行全外联,关联条件为user_id + goods_id, 因为已经完成去重, 所以得到的行就是全量的行,且不会有重复。我们先直接关联,不做任何加工。

执行SQL

代码语言:javascript
复制
select
    t_ord.user_id,
    t_ord.goods_id,
    t_collect.user_id,
    t_collect.goods_id
from
    (
    --订单表数据
    select
        user_id,
        goods_id
    from t_order
    group by
        user_id,
        goods_id
    ) t_ord
    full join
    (
    --收藏表数据
    select
        user_id,
        goods_id
    from t_collect_log
    group by
        user_id,
        goods_id
    ) t_collect
        on t_ord.user_id = t_collect.user_id
        and t_ord.goods_id = t_collect.goods_id

执行结果

代码语言:javascript
复制
+----------------+-----------------+--------------------+---------------------+
| t_ord.user_id  | t_ord.goods_id  | t_collect.user_id  | t_collect.goods_id  |
+----------------+-----------------+--------------------+---------------------+
| 1              | 201             | NULL               | NULL                |
| NULL           | NULL            | 1                  | 203                 |
| 1              | 204             | NULL               | NULL                |
| 2              | 203             | NULL               | NULL                |
| 2              | 204             | 2                  | 204                 |
| 3              | 203             | NULL               | NULL                |
| NULL           | NULL            | 3                  | 205                 |
| 3              | 206             | NULL               | NULL                |
| 4              | 203             | 4                  | 203                 |
| 4              | 207             | 4                  | 207                 |
| 4              | 208             | 4                  | 208                 |
| 5              | 203             | 5                  | 203                 |
| 5              | 208             | NULL               | NULL                |
| 5              | 209             | 5                  | 209                 |
| NULL           | NULL            | 5                  | 213                 |
| 6              | 203             | 6                  | 203                 |
| 6              | 209             | 6                  | 209                 |
| NULL           | NULL            | 7                  | 201                 |
| 7              | 203             | 7                  | 203                 |
| 7              | 204             | 7                  | 204                 |
| 8              | 203             | 8                  | 203                 |
| 8              | 205             | 8                  | 205                 |
| 9              | 203             | 9                  | 203                 |
| 9              | 206             | NULL               | NULL                |
| NULL           | NULL            | 9                  | 212                 |
+----------------+-----------------+--------------------+---------------------+

3.求取结果数据

我们先把所有的user_id 和 goods_id取出来,然后进行特征加工:

  • 是否购买: 根据 t_ord中的goods_id 是否为空判断是否购买,为空代表未购买,非空代表购买;
  • 购买未收藏: t_ord中goods_id不为空,t_collect中goods_id为空;
  • 收藏未购买: t_ord中goods_id为空,t_collect中的goods_id不为空;
  • 收藏且购买: t_ord中的goods_id不为空,t_collect中的goods_id不为空;

执行SQL

代码语言:javascript
复制
select
    coalesce(t_ord.user_id,t_collect.user_id) as user_id,
    coalesce(t_ord.goods_id,t_collect.goods_id) as goods_id,
    if(t_ord.goods_id is not null,1,0) as is_buy,
    if(t_ord.goods_id is not null and t_collect.goods_id is null,1,0) as buy_not_collect,
    if(t_ord.goods_id is null and t_collect.goods_id is not null,1,0) as collect_not_buy,
    if(t_ord.goods_id is not null and t_collect.goods_id is not null,1,0) as buy_and_collect
from
    (
    --订单表数据
    select
        user_id,
        goods_id
    from t_order
    group by
        user_id,
        goods_id
    ) t_ord
    full join
    (
    --收藏表数据
    select
        user_id,
        goods_id
    from t_collect_log
    group by
        user_id,
        goods_id
    ) t_collect
        on t_ord.user_id = t_collect.user_id
        and t_ord.goods_id = t_collect.goods_id

执行结果

代码语言:javascript
复制
+----------+-----------+---------+------------------+------------------+------------------+
| user_id  | goods_id  | is_buy  | buy_not_collect  | collect_not_buy  | buy_and_collect  |
+----------+-----------+---------+------------------+------------------+------------------+
| 1        | 201       | 1       | 1                | 0                | 0                |
| 1        | 203       | 0       | 0                | 1                | 0                |
| 1        | 204       | 1       | 1                | 0                | 0                |
| 2        | 203       | 1       | 1                | 0                | 0                |
| 2        | 204       | 1       | 0                | 0                | 1                |
| 3        | 203       | 1       | 1                | 0                | 0                |
| 3        | 205       | 0       | 0                | 1                | 0                |
| 3        | 206       | 1       | 1                | 0                | 0                |
| 4        | 203       | 1       | 0                | 0                | 1                |
| 4        | 207       | 1       | 0                | 0                | 1                |
| 4        | 208       | 1       | 0                | 0                | 1                |
| 5        | 203       | 1       | 0                | 0                | 1                |
| 5        | 208       | 1       | 1                | 0                | 0                |
| 5        | 209       | 1       | 0                | 0                | 1                |
| 5        | 213       | 0       | 0                | 1                | 0                |
| 6        | 203       | 1       | 0                | 0                | 1                |
| 6        | 209       | 1       | 0                | 0                | 1                |
| 7        | 201       | 0       | 0                | 1                | 0                |
| 7        | 203       | 1       | 0                | 0                | 1                |
| 7        | 204       | 1       | 0                | 0                | 1                |
| 8        | 203       | 1       | 0                | 0                | 1                |
| 8        | 205       | 1       | 0                | 0                | 1                |
| 9        | 203       | 1       | 0                | 0                | 1                |
| 9        | 206       | 1       | 1                | 0                | 0                |
| 9        | 212       | 0       | 0                | 1                | 0                |
+----------+-----------+---------+------------------+------------------+------------------+

四、建表语句和数据插入

代码语言:javascript
复制
--订单表创建语句
CREATE TABLE IF NOT EXISTS t_order (
    id string,
    user_id string,
    goods_id string,
    order_time string,
    goods_type string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC;
--订单数据插入
insert into t_order(id,user_id,goods_id,order_time,goods_type)values
('1','1','201','2020/11/14 10:00','1'),
('2','2','203','2020/11/15 12:00','2'),
('3','3','203','2020/11/16 10:00','1'),
('4','4','203','2020/11/17 10:00','1'),
('5','5','203','2020/11/18 10:00','1'),
('6','6','203','2020/11/18 11:00','1'),
('7','7','204','2020/11/18 12:00','1'),
('8','8','205','2020/11/18 11:30','1'),
('9','9','206','2020/12/1 10:00','1'),
('10','4','207','2020/12/2 10:00','3'),
('11','5','208','2020/12/3 10:00','1'),
('12','6','209','2020/12/4 8:00','2'),
('13','7','203','2020/12/5 10:00','2'),
('14','8','203','2020/12/6 10:00','3'),
('15','9','203','2020/12/7 15:00','4'),
('16','1','204','2020/12/8 10:00','5'),
('17','2','204','2020/12/9 10:00','5'),
('18','3','206','2020/12/10 10:00','5'),
('19','4','208','2020/12/11 10:00','5'),
('20','5','209','2020/12/12 19:00','5');

--收藏记录日志
CREATE TABLE IF NOT EXISTS t_collect_log (
    id string,
    user_id string,
    goods_id string,
    collect_time string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC;
--收藏记录数据插入
insert into t_collect_log(id,user_id,goods_id,collect_time)values
('1','1','203','2020/11/14 12:00'),
('2','9','203','2020/11/15 10:00'),
('3','4','203','2020/11/16 10:00'),
('4','5','203','2020/11/17 10:00'),
('5','6','203','2020/11/17 11:00'),
('6','7','204','2020/11/17 12:00'),
('7','8','205','2020/11/18 11:30'),
('8','9','212','2020/12/1 10:00'),
('9','4','207','2020/12/2 10:00'),
('10','5','213','2020/12/3 10:00'),
('11','6','209','2020/12/4 8:00'),
('12','7','203','2020/12/5 10:00'),
('13','8','203','2020/12/6 10:00'),
('14','9','203','2020/12/7 15:00'),
('15','1','203','2020/12/8 10:00'),
('16','2','204','2020/12/9 10:00'),
('17','3','205','2020/12/10 8:00'),
('18','4','208','2020/12/11 10:00'),
('19','5','209','2020/12/10 19:00'),
('20','7','201','2020/12/11 19:00');
本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2024-05-21,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 数据仓库技术 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 一、题目
  • 二、分析
  • 三、SQL
    • 1.行列裁剪
      • 2.两个表进行全外联,获得全量的数据行
        • 3.求取结果数据
        • 四、建表语句和数据插入
        相关产品与服务
        大数据
        全栈大数据产品,面向海量数据场景,帮助您 “智理无数,心中有数”!
        领券
        问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档