使用Amazon Redshift (SQL),我得到了一个时间戳表,当条目之间的时间超过某个阈值时,我想将其分成不同的阶段。
例如,对此输入使用60个单位的阈值:
id ts
1 a 1
2 a 4
3 a 12
4 a 90
5 a 94
6 a 101
7 a 404
8 a 412
9 a 413我想退还这个:
id ts dt phase
1 a 1 NA 1
2 a 4 3 1
3 a 12 8 1
4 a 90 78 2
5 a 94 4 2
6 a 101 7 2
7 a 404 303 3
8 a 412 8 3
9 a 413 1 3这在R(我最熟悉的)中很简单,使用一个简单的for循环和ifelse,如果dt >60,它会将前面的phase值递增1:
# sample data
df <- data.frame(id = rep("a", 9),
ts = c(1, 4, 12, 90, 94, 101, 404, 412, 413)) %>%
mutate(dt = c(NA, diff(ts)))
# add default minimum phase value, 1
df$phase<- 1
# for loop
for(i in 2:nrow(df)) {
df$phase[i] <- ifelse(df$dt[i] > 60, df$phase[i-1] + 1, df$phase[i-1])
}但是,我在SQL中使用lag函数和case / when的尝试并没有成功。
-- sample data
CREATE TABLE sampledata (
conversationid varchar(10), ts integer
);
INSERT INTO sampledata (conversationid, ts)
VALUES
('a', 1),
('a', 4),
('a', 12),
('a', 90),
('a', 94),
('a', 101),
('a', 404),
('a', 412),
('a', 413);
-- query
SELECT *,
CASE
WHEN dt > 60 THEN LAG(period) OVER (PARTITION BY conversationid ORDER BY ts) + 1
ELSE LAG(period) OVER (PARTITION BY conversationid ORDER BY ts)
END AS period
FROM (
SELECT *,
ts - LAG(ts) OVER (PARTITION BY conversationid ORDER BY ts) AS dt,
1 AS period
FROM sampledata
)
ORDER BY ts
;
-- output
id ts dt period period
a 1 1
a 4 3 1 1
a 12 8 1 1
a 90 78 1 2
a 94 4 1 1
a 101 7 1 1
a 404 303 1 2
a 412 8 1 1
a 413 1 1 1我可以在dt > 60的行上递增相位值,但不能在后续行中传播递增的phase值。
我猜这可能与lag函数一次跨所有行操作而不是逐行操作和/或无法动态更新原始phase值(而不是创建第二个列phase )有关。
发布于 2018-02-08 19:57:23
你已经很接近了。您想要一个基于滞后差的累积和:
SELECT sd.*,
SUM(CASE WHEN diff > 60 THEN 1 ELSE 0 END) OVER (PARTITION BY conversationid ORDER BY ts) as period
FROM (SELECT sd.*,
(ts - LAG(ts) OVER (PARTITION BY conversationid ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) ) AS diff
FROM sampledata sd
) sd
ORDER BY ts;顺便说一句,我希望你使用ORDER BY conversationid, ts,而不仅仅是time。
最后,上面的代码将在NULL中启动句点(它应该正确地识别它们,只是将它们命名得很别扭)。下面的调整将根据您的具体请求执行枚举:
SELECT sd.*,
(1 + SUM(CASE WHEN diff < 60 THEN 0 ELSE 1 END) OVER (PARTITION BY conversationid ORDER BY ts ROW BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) as periodhttps://stackoverflow.com/questions/48684887
复制相似问题