我有一个数据集,表示一年中某一诊所的患者访问不同医生的次数。
示例:
doctor patient_no datetime
dr.kahn 1561 1/21/19 10:30:00
dr.gould 1397 2/06/19 12:30:00
dr.amoor 1596 2/11/19 9:00:00
dr.gould 995 10/07/19 12:30:00
dr.kahn 1561 10/14/19 9:30
我正在尝试创建一个新的数据帧,其中每一行都是一个唯一的医生-患者配对,并显示该患者访问该医生的次数,以及该特定患者-医生配对的平均访问间隔时间。例如,如果患者A在一年内去dr.kahn 4次,患者A连续预约dr.kahn的平均时间是多少。
示例:
doctor patient_no number_of_visits avg_time_passed_between_appointments
dr.gould 1054 7 2 months 1 days 2:00:00
dr.gould 1099 2 5 months 10 days 00:00:00
dr.kahn 875 12 0 months 26 days 0:30:00
任何帮助都将不胜感激。谢谢!
发布于 2020-02-19 04:58:45
下面是一种dplyr方法:
library(tidyverse)
df %>%
mutate(datetime = lubridate::mdy_hm(datetime)) %>%
group_by(doctor, patient_no) %>%
summarize(count = n(),
avg_days_between = (max(datetime) - min(datetime)) / lubridate::ddays(count - 1)) %>%
ungroup()
## A tibble: 4 x 4
# doctor patient_no count avg_days_between
# <chr> <dbl> <int> <dbl>
#1 dr.amoor 1596 1 NaN
#2 dr.gould 995 1 NaN
#3 dr.gould 1397 1 NaN
#4 dr.kahn 1561 2 266.
或者,您可以计算每个滞后,并使用不同的方法,如中位数,来表征平均增量。
df %>%
group_by(doctor, patient_no) %>%
mutate(datetime = lubridate::mdy_hm(datetime),
# coalesce helps
days_since_last = coalesce(c(datetime - lag(datetime))/
lubridate::ddays(1), 0)) %>%
summarize(count = n(),
median_time_between = median(days_since_last))
## A tibble: 4 x 4
# Groups: doctor [3]
# doctor patient_no count median_time_between
# <chr> <dbl> <int> <dbl>
#1 dr.amoor 1596 1 0
#2 dr.gould 995 1 0
#3 dr.gould 1397 1 0
#4 dr.kahn 1561 2 133.
样本数据
df <- tibble::tribble(
~doctor, ~patient_no, ~datetime,
"dr.kahn", 1561, "1/21/19 10:30",
"dr.gould", 1397, "2/6/19 12:30",
"dr.amoor", 1596, "2/11/19 9:00",
"dr.gould", 995, "10/7/19 12:30",
"dr.kahn", 1561, "10/14/19 9:30"
)
https://stackoverflow.com/questions/60293012
复制相似问题