我想问this issue一个后续问题,因为另外一个问题出现了:我发现了一些课题(例如,文化研究)。属于多个范畴(艺术、人文和社会科学),即存在重叠,必须加以考虑。
我有很长的分类列表,例如这个机器可读的例子:
AB <- c("Science","Arts & Humanities","Arts & Humanities; Social Sciences","Science","Arts & Humanities; Arts & Humanities; Social Sciences","Science","Science; Social Sciences","Social Sciences; Science") 看起来是这样的:
> AB
[1] "Science" "Arts & Humanities"
[3] "Arts & Humanities; Social Sciences" "Science"
[5] "Arts & Humanities; Arts & Humanities; Social Sciences" "Science"
[7] "Science; Social Sciences" "Social Sciences; Science" 为了得到这样的结果,我想编辑这些术语并消除重复:
[1] "Science" "Arts & Humanities"
[3] "Arts & Humanities; Social Sciences" "Science"
[5] "Arts & Humanities; Social Sciences" "Science"
[7] "Science; Social Sciences" "Science; Social Sciences" 因此,我正在寻找另一个循环,以消除#5中的重复。
> unique(strsplit(AB, "; *"))
[[1]]
[1] "Science"
[[2]]
[1] "Arts & Humanities"
[[3]]
[1] "Arts & Humanities" "Social Sciences"
[[4]]
[1] "Arts & Humanities" "Arts & Humanities" "Social Sciences"
[[5]]
[1] "Social Sciences" "Science" 所以我想再问一次,请问如何才能达到上述正确的输出?非常感谢您的考虑。
发布于 2012-10-24 17:29:48
我认为这与一个落后或领先的空白有关。如果您将此应用于AB,它将为您处理此问题:
fun <- function(text.var){
x <- unlist(strsplit(text.var, ";"))
Trim <- function(x) gsub("^\\s+|\\s+$", "", x)
paste(sort(unique(Trim(x))), collapse="; ")
}
sapply(AB, fun, USE.NAMES = FALSE)屈服:
> sapply(AB, fun, USE.NAMES = FALSE)
[1] "Science" "Arts & Humanities"
[3] "Arts & Humanities; Social Sciences" "Science"
[5] "Arts & Humanities; Social Sciences" "Science"
[7] "Science; Social Sciences" "Science; Social Sciences" https://stackoverflow.com/questions/13054308
复制相似问题