在所有的混乱中,这是一个问题:
data = readLines("file.txt")
# data reads
[1] "JESSICA [Day 1, 9:00 A.M.]: When there is sun, there was darkness."
[2] " However, nobody knew it was happening."
[3] " SAM [Day 1, 9:01 A.M.]: I thought it was not true."
[4] " But it was."
[5] " I thought it was "present" but it wasn't."
我尝试做的是:(1)按名称合并文本(JESSICA,SAM)。
我可以识别数据中的名字
test = regexpr("^([A-Z]+ \\[)",data)
names = regmatches(data,test)
final.name = sub("\\[","",names)
[1] "JESSICA" "SAM"
我可以确定数据中的日期和时间
test = regexpr("\\[(.*)\\]", data)
time = regmatches(data,test)
[1] "[Day 1, 9:00 A.M.]" "[Day 1, 9:01 A.M.]"
我遇到的困难是为每个名字合并不同的行。也就是说,不是这样:
[1] "JESSICA [Day 1, 9:00 A.M.]: When there is sun, there was darkness."
[2] " However, nobody knew it was happening."
我希望每一行都是:
[1] "JESSICA [Day 1, 9:00 A.M.]: When there is sun, there was darkness. However, nobody knew it was happening."
[2] " SAM [Day 1, 9:01 A.M.]: I thought it was not true. But it was. I thought it was "present" but it wasn't."
发布于 2019-04-08 09:49:19
逻辑类似于现在删除@Maurits的答案。我们可以根据final.name
的出现情况创建组,并通过将文本粘贴到一个组中来summarise
文本。我认为data
是单列数据帧,因为它比普通字符串更容易处理数据帧。
library(dplyr)
data %>%
group_by(group = cumsum(grepl(paste0(final.name, collapse = "|"), statement))) %>%
summarise(statement = paste0(statement, collapse = " ")) %>%
ungroup() %>%
select(-group)
#statement
# <chr>
#1 JESSICA [Day 1, 9:00 A.M.]: When there is sun, there was darkness. However, nobody knew it was happening.
#2 SAM [Day 1, 9:01 A.M.]: I thought it was not true. But it was. I thought it was present but it wasn't.
使用base R方法,我们可以使用aggregate
aggregate(statement~cumsum(grepl(paste0(final.name, collapse = "|"), statement)),
data, paste0, collapse = " ")[2]
data
data <- data.frame(statement = c(
"JESSICA [Day 1, 9:00 A.M.]: When there is sun, there was darkness.",
" However, nobody knew it was happening.",
"SAM [Day 1, 9:01 A.M.]: I thought it was not true.",
" But it was.",
" I thought it was present but it wasn't."))
final.name <- c("JESSICA", "SAM")
https://stackoverflow.com/questions/55564963
复制相似问题