我有关于人们的就业状况的数据,每月一年。数据包括4个变量:一个人的ID、国家、月份和该月份的主要活动(就业、失业、不活动、其他)。我举了一个例子:
ID <- c(1:10, 1:10)
country <- c("AT", "BE", "CH", "CZ", "HR", "SO", "SV", "RU", "GR", "GE", "AT", "BE", "CH", "CZ", "HR",
"SO", "SV", "RU", "GR", "GE")
month <- c("Jan", "Feb", "Mar", "Apr", "May", "Aug", "Dec", "Nov", "Sep", "Jan", "Jun", "Jul", "Oct",
"Jan", "Feb", "Mar", "Apr", "May", "Aug", "Dec")
act <- c("Unemployed", "Employed", "Other", "Other", "Inactive", "Unemployed", "Employed",
"Employed", "Employed", "Unemployed", "Other", "Unemployed", "Unemployed", "Unemployed",
"Other", "Other", "Employed", "Other", "Other", "NA")
df <- data.frame(ID, country, month, act)
df[order(ID),]
ID country month act
1 1 AT Jan Unemployed
11 1 AT Jun Other
21 1 AT Nov Unemployed
2 2 BE Feb Employed
12 2 BE Jul Unemployed
22 2 BE Sep Unemployed
3 3 CH Mar Other
13 3 CH Oct Unemployed
23 3 CH Jan NA
4 4 CZ Apr Other
14 4 CZ Jan Unemployed
24 4 CZ Jun Unemployed
5 5 HR May Inactive
15 5 HR Feb Other
25 5 HR Jul Other
6 6 SO Aug Unemployed
16 6 SO Mar Other
26 6 SO Oct Employed
7 7 SV Dec Employed
17 7 SV Apr Employed
27 7 SV Nov Employed
8 8 RU Nov Employed
18 8 RU May Other
28 8 RU Jan NA
9 9 GR Sep Employed
19 9 GR Aug Other
29 9 GR Jun Inactive
10 10 GE Jan Unemployed
20 10 GE Dec NA
30 10 GE Aug Unemployed
我的目标是创建一个新的dataframe,其中每一行代表一个雇用期(),但条件是在就业之前和之后必须有一段失业时间()。这样我就能只包括那些人们从失业转到就业和回到失业状态的就业时间,并计算这些时间的持续时间。理想情况下,最终会有4个变量: PersID、国家、咒语持续时间、开始月份、月底。它应该是这样的:
ID country spell_duration starting ending
1 1 AT 5 Jan May
11 1 AT 5 Jun Oct
2 2 BE 7 Feb Aug
12 2 BE 6 Jul Dec
3 3 CH 10 Mar Dec
13 3 CH 1 Oct Oct
4 4 CZ 8 Apr Nov
14 4 CZ 5 Jan May
5 5 HR 5 May Sep
15 5 HR 4 Feb May
6 6 SO 2 Aug Sep
16 6 SO 6 Mar Aug
7 7 SV 1 Dec Dec
17 7 SV 9 Apr Dec
8 8 RU 8 Nov Dec
18 8 RU 7 May Nov
9 9 GR 3 Sep Nov
19 9 GR 2 Aug Sep
10 10 GE 8 Jan Aug
20 10 GE 1 Dec Dec
我已经找到了玛丽亚(How to calculate number and duration of categorical spells by ID in R)的解决方案,但她的问题是不同的。我不想要整个工作的时间,我也不需要有多少时间
发布于 2021-08-12 09:06:09
没有经过太多的思考,我脑海中浮现的第一件事。不过,很累赘。我相信有更优雅的解决方案,但这不需要任何额外的包。
data <- df
Empl_spells <- data.frame(ID = c(), Start = c(), End = c())
for(user in unique(data$ID)){
# subset per user
user_dat <- data[data$ID == user,]
# initiate a list to store where changes occur and a counter for
# entries to this list
if(nrow(user_dat) > 2){
Changes_data <- list()
entry <- 1
# for every row, check if it switches from employed to unemployed
# or the opposite. Mark with "break" if some other entry interrupts
for(i in 2:nrow(user_dat)){
if(user_dat$act[i] == "Employed" &
user_dat$act[i-1] == "Unemployed"){
Changes_data[[entry]] <- c("Start", i)
entry <- entry + 1
}else if(user_dat$act[i] == "Unemployed" &
user_dat$act[i-1] == "Employed"){
Changes_data[[entry]] <- c("End", i)
entry <- entry + 1
}else if(user_dat$act[i] != "Employed" &
user_dat$act[i] != "Unemployed"){
Changes_data[[entry]] <- c("Break", i)
entry <- entry + 1
}
}
# see where to an "End" follows a "Start" immediately in the new list
Changes_df <- do.call(rbind.data.frame, Changes_data)
EmplToUnempl <- which(Changes_df[-nrow(Changes_df), 1] == "Start" & Changes_df[-1, 1] == "End")
if(length(EmplToUnempl) >= 1){
append <- data.frame(ID = user,
Start = user_dat$month[as.numeric(Changes_df[EmplToUnempl, 2])],
End = user_dat$month[as.numeric(Changes_df[EmplToUnempl + 1, 2])-1])
# append the data to the data.frame for all of the people
Empl_spells <- rbind(Empl_spells, append)
}
}
}
因为我没有你的数据,所以我没有测试这个。这是你想要的吗?
编辑(矢量化;可能使其更快):
data <- df
users <- unique(data$ID)
calculate <- function(user){
# subset per user
user_dat <- data[data$ID == user,]
# initiate a list to store where changes occur and a counter for
# entries to this list
if(nrow(user_dat) > 2){
Changes_data <- list()
entry <- 1
# for every row, check if it switches from employed to unemployed
# or the opposite. Mark with "break" if some other entry interrupts
for(i in 2:nrow(user_dat)){
if(user_dat$act[i] == "Employed" &
user_dat$act[i-1] == "Unemployed"){
Changes_data[[entry]] <- c("Start", i)
entry <- entry + 1
}else if(user_dat$act[i] == "Unemployed" &
user_dat$act[i-1] == "Employed"){
Changes_data[[entry]] <- c("End", i)
entry <- entry + 1
}else if(user_dat$act[i] != "Employed" &
user_dat$act[i] != "Unemployed"){
Changes_data[[entry]] <- c("Break", i)
entry <- entry + 1
}
}
# see where to an "End" follows a "Start" immediately in the new list
Changes_df <- do.call(rbind.data.frame, Changes_data)
EmplToUnempl <- which(Changes_df[-nrow(Changes_df), 1] == "Start" & Changes_df[-1, 1] == "End")
if(length(EmplToUnempl) >= 1){
append <- data.frame(ID = user,
Start = user_dat$month[as.numeric(Changes_df[EmplToUnempl, 2])],
End = user_dat$month[as.numeric(Changes_df[EmplToUnempl + 1, 2])-1])
# append the data to the data.frame for all of the people
return(append)
}
}
}
empl_spells <- lapply(users, FUN = calculate)
Empl_spells <- do.call(rbind.data.frame, empl_spells)
编辑#2 (计算持续时间):
MonthToNumeric <- function(x){
which(c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec") == x)
}
calcDuration <- function(Start, End){
return(MonthToNumeric(End) - MonthToNumeric(Start) + 1)
}
Empl_spells$Duration <- mapply(FUN = calcDuration, Start = Empl_spells[, 2], End = Empl_spells[, 3])
发布于 2021-08-12 09:27:27
我使用data.table
软件包进行管理,我猜循环也能工作。
编辑:左一个额外的"}“,我编辑了它。我试过了而且成功了。
EDIT2:我也添加了"setDT(df)“。
library(data.table)
df <- fread(paste("ID country month act
1 AT Jan Unemployed
1 AT Jun Other
1 AT Nov Unemployed
2 BE Feb Employed
2 BE Jul Unemployed
2 BE Sep Unemployed
3 CH Mar Other
3 CH Oct Unemployed
3 CH Jan NA
4 CZ Apr Other
4 CZ Jan Unemployed
4 CZ Jun Unemployed
5 HR May Inactive
5 HR Feb Other
5 HR Jul Other
6 SO Aug Unemployed
6 SO Mar Other
6 SO Oct Employed
7 SV Dec Employed
7 SV Apr Employed
7 SV Nov Employed
8 RU Nov Employed
8 RU May Other
8 RU Jan NA
9 GR Sep Employed
9 GR Aug Other
9 GR Jun Inactive
10 GE Jan Unemployed
10 GE Dec NA
10 GE Aug Unemployed", collapse = '\n'))
setDT(df)
df[, monthInt := match(month, month.abb)]
df <- df[order(ID,monthInt)]
finalDt <- data.table()
for (i in unique(df[, ID])) {
tempT <- df[ID == i]
for (tim in 1:(nrow(tempT)-1)) {
timT <- data.table(ID = tempT[tim,ID],
country = tempT[tim, country],
spell_duration = tempT[tim+1, monthInt] - tempT[tim, monthInt],
starting = month.abb[tempT[tim, monthInt]],
ending = month.abb[tempT[tim+1, monthInt]-1])
finalDt <- rbind(finalDt,timT)
}
}
https://stackoverflow.com/questions/68753941
复制相似问题