我的输入是一个csv文件,它有下面的表,其中有句子和类别列。
Sentence Class
Joe just joined Alice on the set. B
Alexis buys green apples C
Yesterday, two friends unite. A
Combination between x and y! A每个类都有一个排名的单词列表。(不是在csv中)
Class A keyword list Class B keyword list Class C keyword list
unite joined buy
combination join buys
together merge bought 我的输出需要是一个csv,在该句子中的class关键字列表中,在排名最高的关键字之前和之后添加的单词列。(见下图)

请注意,某些列中有空格,因为对应的单词在该句子中不存在。
我如何在R中做到这一点?
发布于 2021-01-09 16:15:26
假设您导入了文件并将其转换为以下格式:
df <- tribble(
~ Sentence, ~ Class,
"Joe just joined Alice on the set.", "B",
"Alexis buys green apples", "C",
"Yesterday, two friends unite.", "A",
"Combination between x and y!", "A"
)
kw_list <- list(
A=c("unite", "combination", "together"),
B=c("joined", "join", "merge"),
C=c("buy", "buys", "bought")
)然后您可以获取镜像中指定的数据帧,如下所示:
result <- df %>% mutate(res=map2(Sentence, Class, function(sentence, class){
word_list <- sentence %>% str_replace_all("[(,)(\\.)(!)]", "") %>%
str_split(" ") %>% .[[1]] %>% str_to_lower()
kws <- word_list %>% c(kw_list[[class]]) %>% .[duplicated(.)]
if(length(kws)==0){
return(NA)
}else{
kws %>% map(function(kw){
position <- str_which(word_list, str_c("^", kw, "$"))
left_kw <- if(position!=1){
word_list[1:(position-1)] %>% rev() %>% .[1:3] %>%
tibble(name=c("1st", "2nd", "3rd") %>% str_c(" word from left"), value=.) %>%
arrange(desc(name)) %>% pivot_wider()
}else{
NULL
}
right_kw <- if(position!=length(word_list)){
word_list[(position+1):length(word_list)] %>% .[1:3] %>%
tibble(name=c("1st", "2nd", "3rd") %>% str_c(" word from right"), value=.) %>%
pivot_wider()
}else{
NULL
}
bind_cols(left_kw, tibble(`key word`=kw), right_kw)
}) %>% reduce(bind_rows) %>% return()
}
})) %>% unnest(cols=res)这可以处理包括几个关键字的句子和不包括任何关键字的句子。
请注意,所有字母都将更改为小写,如果包含,.!以外的其他符号,将无法正常工作。
当然,这有点太长了,也不是最好的解决方案,但希望它能有所帮助。
https://stackoverflow.com/questions/65636910
复制相似问题