我有一个包含字符变量的data.frame,它有一个额外的元数据字符串(某种程度上是键值格式),我希望这些元数据在data.frame中作为变量使用;元数据变量充满了细微差别和不一致之处:其中一些元数据具有不同长度的多个值(一个数组),并不是所有的观测都有所有附加数据(因此需要是空的或NA),一些元数据类别是重复的,或者有时在更结构化的元数据之前有“未分类的”值(这些值可以被忽略/删除)。
一个更有代表性的示例--说明tags中所述不一致的例子。
dat <- data.frame(title = c("How To", "Why To", "When To"),
id = c("001", "005", "102"),
tags = c("Type: Article, Topics: solo, Length: 3.5, Topics: self help, DIY",
"case study, thinking, English, Type: Paper, Topics: philosophy",
"Language: EN, Type: Checklist, Topics: scheduling, time-management")) 所需的输出将是一个data.frame (或类似于tibble),例如:
#> title id tags Language Type Length Topics
#> <chr> <chr> <chr> <chr> <chr> <int> <chr>
#> 1 How To 001 ... NA Article 3.5 solo, self help, DIY
#> 2 Why To 005 ... NA Paper NA philosophy
#> 3 When To 102 ... EN Checklist NA scheduling, time-management注意:我用...表示dat中的原始字符串;在修改问题之前,我还使用了部分提供的解决方案,通过:gsub("(^.[^:]*, )(?=[[:alpha:]]+:)", "", tags, perl = T)删除“未分类”值
最好采用tidyr方法,但鉴于我将类似问题的各种解决方案拼凑在一起,我只得到了一点帮助,任何解决方案都会有所帮助。
发布于 2022-10-08 04:52:59
这似乎适用于示例数据,但可能有一个短得多的版本,其中一些正则表达式区分了逗号的两种用法。
library(tidyverse)
dat %>%
separate_rows(tags, sep = ", ") %>%
separate(tags, into = c("header", "values"), fill = "left", sep = ": ") %>%
fill(header, .direction = "down") %>%
group_by(title, id, header) %>%
summarize(values = paste(values, collapse = ", "), .groups = "drop") %>%
pivot_wider(names_from = header, values_from = values)结果
# A tibble: 3 × 6
title id Length Topics Type Language
<chr> <chr> <chr> <chr> <chr> <chr>
1 How To 001 3.5 self help, DIY Article NA
2 When To 102 NA scheduling, time-management Checklist EN
3 Why To 005 NA philosophy Paper NA 编辑--使用更新的数据,这里有一个将Type作为特殊列的变体。我不清楚你想如何对待相同标题的语言和标记,而不是不同的类型,但我希望这表明了一种可以适应的方法。
dat %>%
separate_rows(tags, sep = ", ") %>%
separate(tags, into = c("header", "values"), fill = "left", sep = ": ") %>%
mutate(Type = if_else(header == "Type", values, NA_character_)) %>%
fill(header, Type, .direction = "down") %>%
filter(header != "Type") %>%
group_by(title, id, Type, header) %>%
summarize(values = paste(values, collapse = ", "), .groups = "drop") %>%
pivot_wider(names_from = header, values_from = values)
# A tibble: 5 × 7
title id Type ` Topics` Length Topics Language
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 How To 001 Article solo 3.5 self help, DIY NA
2 When To 102 Checklist NA NA scheduling, time-management NA
3 When To 102 Paper NA NA NA EN
4 Why To 005 Article NA NA case study, thinking, English NA
5 Why To 005 Paper NA NA philosophy NA 发布于 2022-10-08 08:42:45
在answer from Jon Spring上进行扩展,但使用一个rex例来区分逗号的两种用法:
library(dplyr)
library(tidyr)
dat %>%
separate_rows(tags, sep = "(, )(?=[[:alpha:]]+:)") %>%
separate(tags, into = c("header", "value"), fill = "left", sep = ": ") %>%
pivot_wider(names_from = header, values_from = value)
#> # A tibble: 3 × 6
#> title id Type Length Topics Language
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 How To 001 Article 3.5 self help, DIY <NA>
#> 2 Why To 005 Paper <NA> philosophy <NA>
#> 3 When To 102 Checklist <NA> scheduling, time-management ENregex标识所有, (逗号带有空格),后面跟着一个或多个字母([[:alpha:]]+,+表示一个或多个字母),后面跟着一个:。
如果要保留旧标记,只需在带有mutate(old_tag = tags) %>%的行之前添加一行separate_rows即可。
https://stackoverflow.com/questions/73994368
复制相似问题