从1到10 的高级 SQL 技巧，试试知道多少？

万能数据的小草

发布于 2024-07-23 15:21:21

440

发布于 2024-07-23 15:21:21

文章被收录于专栏：万能的小草

1.增量表和MERGE

以正确的方式有效更新表很重要。理想的情况是当您的事务是主键、唯一整数和自动增量时。这种情况下的表更新很简单：

insert target_table (transaction_id)
select transaction_id from source_table 
where
transaction_id > (select max(transaction_id) from target_table)
;

在数据仓库中使用非规范化星型模式数据集时，情况并非总是如此。可能需要使用 SQL 创建会话和/或仅使用部分数据增量更新数据集。transaction_id可能不存在，但您将不得不处理数据模型，其中唯一键取决于transaction_id已知的最新（或时间戳）。例如，数据user_id集中last_online取决于最新的已知连接时间戳。在这种情况下，您需要update现有用户和insert新用户。

合并和增量更新

您可以使用MERGE，也可以将操作拆分为两个操作。一种是用新记录更新现有记录，另一种是插入不存在的全新记录（LEFT JOIN 情况）。

MERGE是关系数据库中常用的语句。Google BigQuery MERGE 命令是数据操作语言 (DML) 语句之一。它通常用于在一条语句中自动执行三个主要功能。这些函数是 UPDATE、INSERT 和 DELETE。

当两个或多个数据匹配时，可以使用 UPDATE 或 DELETE 子句。
当两个或多个数据不同且不匹配时，可以使用 INSERT 子句。
当给定数据与源不匹配时，也可以使用 UPDATE 或 DELETE 子句。

这意味着 Google BigQuery MERGE 命令可让您通过更新、插入和删除 Google BigQuery 表中的数据来合并 Google BigQuery 数据。

考虑这个 SQL：

create temp table last_online as (   -- 最近在线表
select 1 as user_id
    , timestamp('2000-10-01 00:00:01') as last_online
)
;
create temp table connection_data  (  --分区
  user_id int64
  ,timestamp timestamp
)
PARTITION BY DATE(_PARTITIONTIME)
;
insert connection_data (user_id, timestamp)
select 2 as user_id
    , timestamp_sub(current_timestamp(),interval 28 hour) as timestamp
union all
select 1 as user_id
        , timestamp_sub(current_timestamp(),interval 28 hour) as timestamp
union all
select 1 as user_id
        , timestamp_sub(current_timestamp(),interval 20 hour) as timestamp
union all
select 1 as user_id
    , timestamp_sub(current_timestamp(),interval 1 hour) as timestamp
;

merge last_online t
using (
select
      user_id
    , last_online
from
    (
select
            user_id
        ,   max(timestamp) as last_online

from
            connection_data
where
date(_partitiontime) >= date_sub(current_date(), interval 1 day)
group by
            user_id

    ) y

) s
on t.user_id = s.user_id
when matched then
update set last_online = s.last_online, user_id = s.user_id
when not matched then
insert (last_online, user_id) values (last_online, user_id)
;
select * from last_online
;

在这里，我们仅在一次操作中完成了 UPDATE 和 INSERT。

2. 计算单词数 Counting words

执行 UNNEST() 并检查您需要的单词是否在您需要的列表中可能在许多情况下很有用，即情感分析：

with titles as (
select 'Title with word foo' as title union all
select 'Title with word bar'
)
, data as (
select
    title, 
split(title, ' ') as words 
from
    titles
)
select * from data, unnest(words) words
where
    words in ('bar')
;

3. 在 SELECT 语句之外使用 IF() 语句

这使我们有机会节省一些代码行并在代码方面更加雄辩。通常您希望将其放入子查询中，并在where子句中添加过滤器，但您可以这样做：

with daily_revenue as (
select
current_date() as dt
    , 100          as revenue
union all
select
date_sub(current_date(), interval 1 day) as dt
    , 100          as revenue
)
select
*
from daily_revenue
where
if(revenue >101,1,0) = 1
;

另一个例子是如何不将它与分区表一起使用。不要这样做。这是一个不好的例子，因为由于匹配的表后缀可能是动态确定的（基于表中的某些内容），因此您将需要为全表扫描付费。

SELECT *
FROM `firebase.events`
WHERE IF(condition,
         _TABLE_SUFFIX BETWEEN '20170101' AND '20170117',
         _TABLE_SUFFIX BETWEEN '20160101' AND '20160117')
;

4. 使用 GROUP BY ROLLUP

ROLLUP函数用于执行多个级别的聚合。当您必须使用维度图时，这非常有用。

以下查询返回在where子句中指定的交易类型 (is_gift) 每天的总信用支出，并且还显示每天的总支出以及所有可用日期的总支出。

ith data as (
select
current_timestamp() as ts           
    ,'stage'            as context_type 
    ,1                  as user_id      
    ,100                as credit_value 
    , true              as is_gift
union all
select
    timestamp_sub(current_timestamp(), interval 24 hour) as ts           
    ,'user'             as context_type 
    ,1                  as user_id      
    ,200                as credit_value 
    ,false              as is_gift
union all
select
    timestamp_sub(current_timestamp(), interval 24*2 hour) as ts           
    ,'user'             as context_type 
    ,3                  as user_id      
    ,300                as credit_value 
    ,true               as is_gift

)
, results as (
select
date(ts) as date
    ,context_type
    ,sum(credit_value)/100 as daily_credits_spend
from data

group by rollup(1, context_type)
order by 1
)

select
date
  ,if(context_type is null, 'total', context_type) as context_type
  ,daily_credits_spend
from results
order by date
;

5. 将表转换为 JSON

想象一下，您需要将表转换为 JSON 对象，其中每个记录都是嵌套数组的元素。这就是to_json_string()函数发挥作用的地方：

with mytable as (
select 1 as x, 'foo' as y, true as z union all
select 2, 'bar', false
)
select
concat("{", "\"MyTable\":", "[", string_agg(to_json_string(t), ","), "]", "}")
from mytable as t
;

然后您可以在任何地方使用它：日期、营销漏斗、指数、直方图等。

6. 使用 PARTITION BY函数

给定user_id、date和total_cost列。对于每个日期，如何在保留所有行的同时显示每个客户的总收入值？你可以这样实现：

select
date
    ,user_id
    ,total_cost
    ,sum(total_cost) over (partition by date,user_id) as revenue_per_day
from production.payment_transaction
;

7.移动平均线

通常，BI 开发人员的任务是向报告及其出色的仪表板添加移动平均值。这可能是 7、14、30 天/月甚至一年的 MA 线图。那么我们该怎么做呢？

with dates as (
select
    dt
from
unnest(generate_date_array(date_sub(current_date(), interval 90 day), current_date(), interval 1 day)) as dt
)

, data as (
select dt
        , CEIL(RAND()*1000) as revenue -- just some random data.
from
        dates
)
select
  dt
, revenue
, AVG(revenue) OVER(ORDER BY unix_date(dt) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) as seven_day_moving_average
from data
;

8. 日期数组Date arrays

当您处理用户保留或想要检查某些数据集是否缺少值（即日期）时，它变得非常方便。BigQuery 有一个名为的函数GENERATE_DATE_ARRAY：

select
 dt
from
unnest(generate_date_array('2019–12–04', '2020–09–17', interval 1 day)) as dt
;

9.排序Row_number()

这对于从数据中获取最新信息（即最新更新的记录等）甚至删除重复项很有用：

SELECT *
FROM table_a a
LEFT JOIN table_b b
ON a.id = b.id
WHERE b.date < a.date
QUALIFY ROW_NUMBER() OVER (PARTITION BY b.id ORDER BY b.date desc) = 1

10、排序Rank/dense_rank

它们也称为排序函数。倾向于使用DENSE_RANK 默认排名功能，因为它不会跳过下一个可用排名，而RANK会。它返回连续的排名值。您可以将其与分区一起使用，将结果划分为不同的存储桶。如果每个分区中的行具有相同的值，则它们将获得相同的排名。例子：

with top_spenders as (
select 1 as user_id, 100 as total_spend, 11   as reputation_level union all
select 2 as user_id, 250 as total_spend, 11   as reputation_level union all
select 3 as user_id, 250 as total_spend, 11   as reputation_level union all
select 4 as user_id, 300 as total_spend, 11   as reputation_level union all
select 11 as user_id, 1000 as total_spend, 22   as reputation_level union all
select 22 as user_id, 1500 as total_spend, 22   as reputation_level union all
select 33 as user_id, 1500 as total_spend, 22   as reputation_level union all
select 44 as user_id, 2500 as total_spend, 22   as reputation_level 

)

select
    user_id
    , rank() over(partition by reputation_level order by total_spend desc) as rank
    , dense_rank() over(partition by reputation_level order by total_spend desc) as dense_rank
from
    top_spenders
;

另一个产品价格的例子：

with products as (

select
2                    as product_id      
        , 'premium_account'  as product_type    
        , 100                as total_cost   
union all
select
1                    as product_id      
        , 'premium_group'    as product_type
        , 200                as total_cost
union all
select
111                  as product_id      
        , 'bots'             as product_type    
        , 300                as total_cost      
union all
select
112                  as product_id      
        , 'bots'             as product_type    
        , 400                as total_cost      
union all
select
113                  as product_id      
        , 'bots'             as product_type    
        , 500                as total_cost      
union all
select
213                  as product_id      
        , 'bots'             as product_type    
        , 300                as total_cost      

)
select * from (
select
      product_id
    , product_type
    , total_cost as product_price
    , dense_rank () over ( 
partition by product_type
order by total_cost desc
    ) price_rank 
from
    products
) t
where price_rank < 3
;

12.Pivot / unpivot 透视/取消透视

透视将行更改为列。这就是它的全部作用。Unpivot 则相反。最后它可以在 BigQuery 中使用：

elect * from
(
  -- #1 from_item
  select 
     extract(month from dt) as mo         
    ,product_type    
    ,revenue   
from (
    select
        date(current_date()) as dt              
        , 'premium_account'  as product_type    
        , 100                as revenue   
    union all
    select
        date_sub(current_date(), interval 1 month) as dt
        , 'premium_group'    as product_type
        , 200                as revenue
    union all
    select
        date_sub(current_date(), interval 2 month) as dt
        , 'bots'             as product_type
        , 300                as revenue
  )
)
pivot
(
  -- #2 aggregate
  avg(revenue) as avg_revenue_
  -- #3 pivot_column
for product_type in ('premium_account', 'premium_group')
)
;

13.First_value / last_value

这是另一个有用的函数，它有助于获取每行相对于该特定分区中的第一个/最后一个值的增量。

with top_spenders as (
select 1 as user_id, 100 as total_spend, 11   as reputation_level union all
select 2 as user_id, 150 as total_spend, 11   as reputation_level union all
select 3 as user_id, 250 as total_spend, 11   as reputation_level union all
select 11 as user_id, 1000 as total_spend, 22   as reputation_level union all
select 22 as user_id, 1500 as total_spend, 22   as reputation_level union all
select 33 as user_id, 2500 as total_spend, 22   as reputation_level 

)
, data as (
select
        user_id
        ,total_spend
        ,reputation_level
        ,first_value(total_spend)
over (partition by reputation_level order by total_spend desc
rows between unbounded preceding and unbounded following) as top_spend
from top_spenders
)

select
    user_id
    ,reputation_level
    ,total_spend
    ,top_spend          as top_spend_by_rep_level
    ,total_spend - top_spend as delta_in_usd
from data
;

14. 将表转换为结构数组并将它们传递给 UDF

当您需要将具有一些复杂逻辑的用户定义函数 (UDF) 应用于每行或表时，这非常有用。您始终可以将表视为 TYPE STRUCT 对象的数组，然后将其中每个对象传递给 UDF。这取决于你的逻辑。例如，我用它来计算购买过期时间：

select
     target_id
    ,product_id
    ,product_type_id
    ,production.purchase_summary_udf()(
        ARRAY_AGG(
STRUCT(
                target_id
                , user_id
                , product_type_id
                , product_id
                , item_count
                , days
                , expire_time_after_purchase
                , transaction_id 
                , purchase_created_at 
                , updated_at
            ) 
order by purchase_created_at
        )
    ) AS processed

from new_batch
;

以类似的方式，您可以创建表而无需使用UNION ALL。例如，我用它来模拟单元测试的一些测试数据。这样，您只需在编辑器中使用 ++ 即可非常快速地完成此操作Alt+Shift+Down

15. 使用 FOLLOWING 和 UNBOUNDED FOLLOWING 创建事件漏斗

营销渠道就是一个很好的例子。您的数据集可能包含相同类型的连续重复事件，但理想情况下您希望将每个事件与下一个不同类型的事件链接起来。当您需要获取某些内容（即事件、购买等）的列表以构建渠道数据集时，这可能很有用。使用 PARTITION BY 它使您有机会对所有以下事件进行分组，无论每个分区中存在多少个事件。

with d as (
select * from unnest([
struct('0003f' as user_pseudo_id, 12322175 as user_id, timestamp '2020-10-10 16:46:59.878 UTC' as event_timestamp, 'join_group' as event_name),
  ('0003',12,timestamp '2022-10-10 16:50:03.394 UTC','set_avatar'),
  ('0003',12,timestamp '2022-10-10 17:02:38.632 UTC','set_avatar'),
  ('0003',12,timestamp '2022-10-10 17:09:38.645 UTC','set_avatar'),
  ('0003',12,timestamp '2022-10-10 17:10:38.645 UTC','join_group'),
  ('0003',12,timestamp '2022-10-10 17:15:38.645 UTC','create_group'),
  ('0003',12,timestamp '2022-10-10 17:17:38.645 UTC','create_group'),
  ('0003',12,timestamp '2022-10-10 17:18:38.645 UTC','in_app_purchase'),
  ('0003',12,timestamp '2022-10-10 17:19:38.645 UTC','spend_virtual_currency'),
  ('0003',12,timestamp '2022-10-10 17:19:45.645 UTC','create_group'),
  ('0003',12,timestamp '2022-10-10 17:20:38.645 UTC','set_avatar')
  ]
  ) as t)

  , event_data as (
SELECT
    user_pseudo_id
  , user_id
  , event_timestamp
  , event_name
  , ARRAY_AGG(
STRUCT(
              event_name AS event_name
            , event_timestamp AS event_timestamp
        )
    ) 
OVER(PARTITION BY user_pseudo_id ORDER BY event_timestamp ROWS BETWEEN 1 FOLLOWING AND  UNBOUNDED FOLLOWING ) as next_events

FROM d
WHERE
DATE(event_timestamp) = "2022-10-10"

)
select
    user_pseudo_id
  , user_id
  , event_timestamp
  , event_name
  , (SELECT
        event_name FROM UNNEST(next_events) next_event
WHERE t.event_name != event_name
ORDER BY event_timestamp  LIMIT 1
-- change to ORDER BY event_timestamp desc if prev event needed
  ) next_event
  , (SELECT
        event_timestamp FROM UNNEST(next_events) next_event
WHERE t.event_name != event_name
ORDER BY event_timestamp  LIMIT 1
-- change to ORDER BY event_timestamp desc if prev event needed
  ) next_event_ts

from event_data t
;

16. 正则表达式

如果您需要从非结构化数据中提取某些内容（例如外汇汇率、自定义分组等），您会使用它。

使用正则表达式处理货币汇率

考虑以下汇率数据示例：


-- One or more digits (\d+), optional period (\.?), zero or more digits (\d*).
with object as
(select  '{"aed":3.6732,"afn":78.45934,"all":110.586428}' as rates)

, data as (
select "usd" as base_currency,
  regexp_extract_all(rates, r'"[^"]+":\d+\.?\d*') as pair
from object
)
, splits as (
select base_currency, pair, split(pair, ':') positions 
from data cross join unnest (pair) as pair
)
select base_currency, pair,  positions[offset(0)] as rate_currency,  positions[offset(1)] as rate
from splits  
;

使用正则表达式处理应用程序版本

有时您可能想要regexp获取应用程序的主要版本、发行版本或修改版本，并创建自定义报告：

with events as (
select  'open_chat' as event_name, '10.1.0' as app_display_version union all
select  'open_chat' as event_name, '10.1.9' as app_display_version union all
select  'open_chat' as event_name, '9.1.4' as app_display_version union all
select  'open_chat' as event_name, '9.0.0' as app_display_version
)
select
     app_display_version
    ,REGEXP_EXTRACT(app_display_version, '^[^.^]*') main_version
    ,safe_cast(REGEXP_EXTRACT(app_display_version, '[0-9]+.[0-9]+') as float64) release_version
    ,safe_cast(REGEXP_EXTRACT(app_display_version, r"^[a-zA-Z0-9_.+-]+.[a-zA-Z0-9-]+\.([a-zA-Z0-9-.]+$)") as int64) as mod_version
from events
;

结论

SQL 是一个强大的工具，有助于操作数据。希望这些来自数字营销的 SQL 用例对您有用。可以帮助您完成许多项目。 SQL 片段让我的工作变得轻松，几乎每天都在使用。此外，SQL 和现代数据仓库是数据科学的必备工具。其强大的方言功能允许轻松建模和可视化数据。由于 SQL 是数据仓库和商业智能专业人员使用的语言，因此如果您想与他们共享数据，它是一个很好的选择。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2024-04-05，如有侵权请联系 cloudcommunity@tencent.com 删除

技巧