给出一个非常大的数据集(>100万次观测),并试图将我的逻辑向量化,但没有找到R化的方法来解决它。
问题是,每当我在变量中有一个“坏”的观察时,我需要检查前面的5个观察结果是否有“好的”指标。只要前面有5个“好”的观察,“坏”的观察就会被保留下来。如果在5个观测移动窗口中存在“坏”观测,那么该观测结果最终将被从分析中删除。
到目前为止,我已经尝试使用带有ifelse()逻辑的for循环。逻辑检验出来了,但是R的处理需要几个小时才能完成。我已经研究过用于滚动窗口的zoo包,但没有应用聚合函数,比如mean()或sum()。我还研究过apply()、lapply()等,但一直无法使它们工作。
这是我的代码,所以是for循环。让df$Observation作为好与坏的最初标记,让df$Result来决定我们是保持还是放弃观察。
编辑
set.seed(1)
df <- data.frame(Observation = sample(c("Good", "Bad"), 1000, T, c(0.9,0.1)))
for(i in 1:nrow(df)){
ifelse(
df$Observation[i] == "Good",
df$Result[i] <- "Keep",
ifelse(
df$Observation[i] == "Bad" &
df$Observation[i-1] == "Good" &
df$Observation[i-2] == "Good" &
df$Observation[i-3] == "Good" &
df$Observation[i-4] == "Good",
df$Result[i] <- "Keep",
df$Result[i] <- "Drop"
)
)
}期望结果示例:
df[385:393,]
Observation Result
385 Good Keep
386 Good Keep
387 Good Keep
388 Good Keep
389 Good Keep
390 Bad Keep
391 Good Keep
392 Good Keep
393 Bad Drop代码如预期的工作,但我需要一个更有效的方式来执行它在R。谢谢你的帮助!
发布于 2019-05-07 20:06:47
你可以这样做:
首先,我设置了种子,创建了一些样本数据,并打开了必要的包。
set.seed(1)
df <- data.frame(Observation = sample(c("Good", "Bad"), 1000, T, c(0.9,0.1)))
library(zoo)
library(dplyr)一开始我落后了一排。从这里开始,我计算了该滞后行和前四行的rollmax。然后,我将这个rollmax与1进行比较。如果计算结果为TRUE,当前行等于"Bad",则Result为"Drop",否则为"KEEP"。
df2 <- df %>%
mutate(Result = if_else(rollmax(lag(Observation) == "Bad", 5, fill = 0, align = "right") == 1 & Observation == "Bad", "Drop", "Keep")) 这样,它将与预期的输出相匹配:
df2[385:393,]
Observation Result
385 Good Keep
386 Good Keep
387 Good Keep
388 Good Keep
389 Good Keep
390 Bad Keep
391 Good Keep
392 Good Keep
393 Bad Drop发布于 2019-05-09 03:14:47
为此,我喜欢zoo。这一切似乎都匹配,除了第一次的坏(只有3个国家之前)。您可以使用fill = 4调整逻辑以保持该逻辑不变。
library(tidyverse)
library(zoo)
df_decision <-
df %>%
mutate(
good_ind = as.integer(Observation == "Good"),
good_count = rollsum(good_ind, 5, align = "right", fill = good_ind),
result =ifelse(good_ind == 1 | good_count >= 4, "keep", "drop")
)发布于 2019-05-08 20:51:03
如果用一些dplyr函数替换循环,事情就会加快。只是要小心前5排的处理。dplyr版本将在前5行中删除任何“不好”的观察结果,而循环将保留它们。如果需要,可以向case_when添加更多的逻辑。
library(tictoc)
library(dplyr)
set.seed(1)
df <- data.frame(Observation = sample(c("Good", "Bad"), 10000, TRUE, c(0.9,0.1)))
df2 <- df
tic("loop")
for(i in 1:nrow(df)){
ifelse(
df$Observation[i] == "Good",
df$Result[i] <- "Keep",
ifelse(
df$Observation[i] == "Bad" &
df$Observation[i-1] == "Good" &
df$Observation[i-2] == "Good" &
df$Observation[i-3] == "Good" &
df$Observation[i-4] == "Good",
df$Result[i] <- "Keep",
df$Result[i] <- "Drop"
)
)
}
toc() # 3.9s
tic("dplyr")
df2 <- df2 %>%
dplyr::mutate(
L1 = dplyr::lag(Observation, 1),
L2 = dplyr::lag(Observation, 2),
L3 = dplyr::lag(Observation, 3),
L4 = dplyr::lag(Observation, 4),
L5 = dplyr::lag(Observation, 5),
Result = dplyr::case_when(
Observation == "Good" ~ "Keep",
L1 == "Good" &
L2 == "Good" &
L3 == "Good" &
L4 == "Good" &
L5 == "Good" ~ "Keep",
TRUE ~ "Drop"
)
) %>%
dplyr::select(Observation, Result)
toc() # 0.08shttps://stackoverflow.com/questions/56029580
复制相似问题