这个问题与Using pmap to apply different regular expressions to different variables in a tibble?非常相似,但不同之处在于我意识到我的示例不足以描述我的问题。
我正在尝试将不同的正则表达式应用于tibble中的不同变量。例如,我创建了一个tibble清单1)我想要修改的变量名,2)我想要匹配的正则表达式,3)替换字符串。我想将正则表达式/替换应用于不同数据框中的变量。请注意,目标tibble中可能有我不想修改的变量,并且我的“配置”tibble中的行顺序可能与我的“目标”tibble中的列/变量顺序不对应。
所以我的“配置”tibble可能是这样的:
test_config <- dplyr::tibble(
string_col = c("col1", "col2", "col4", "col3"),
pattern = c("^\\.$", "^NA$", "^$", "^NULL$"),
replacement = c("","","", "")
)
我想把这个应用到一个目标tibble上:
test_target <- dplyr::tibble(
col1 = c("Foo", "bar", ".", "NA", "NULL"),
col2 = c("Foo", "bar", ".", "NA", "NULL"),
col3 = c("Foo", "bar", ".", "NA", "NULL"),
col4 = c("NULL", "NA", "Foo", ".", "bar"),
col5 = c("I", "am", "not", "changing", ".")
)
因此,我们的目标是在test_target的用户指定列/变量中用空字符串替换不同的字符串。
结果应该是这样的:
result <- dplyr::tibble(
col1 = c("Foo", "bar", "", "NA", "NULL"),
col2 = c("Foo", "bar", ".", "", "NULL"),
col3 = c("Foo", "bar", ".", "NA", ""),
col4 = c("NULL", "NA", "Foo", ".", "bar"),
col5 = c("I", "am", "not", "changing", ".")
)
我可以使用for循环来做我想做的事情,如下所示:
for (i in seq(nrow(test_config))) {
test_target <- dplyr::mutate_at(test_target,
.vars = dplyr::vars(
tidyselect::matches(test_config$string_col[[i]])),
.funs = dplyr::funs(
stringr::str_replace_all(
., test_config$pattern[[i]],
test_config$replacement[[i]]))
)
}
相反,有没有更整洁的方法来做我想做的事情呢?到目前为止,我认为purrr::pmap
是完成这项工作的工具,所以我已经创建了一个函数,该函数接受数据框、变量名、正则表达式和替换值,并返回修改了单个变量的数据框。它的行为符合预期:
testFun <- function(df, colName, regex, repVal){
colName <- dplyr::enquo(colName)
df <- dplyr::mutate_at(df,
.vars = dplyr::vars(
tidyselect::matches(!!colName)),
.funs = dplyr::funs(
stringr::str_replace_all(., regex, repVal))
)
}
# try with example
out <- testFun(test_target,
test_config$string_col[[1]],
test_config$pattern[[1]],
"")
但是,当我尝试在pmap
中使用该函数时,我遇到了几个问题: 1)有没有比这更好的方法来构建pmap调用的列表?
purrr::pmap(
list(test_target,
test_config$string_col,
test_config$pattern,
test_config$replacement),
testFun
)
2)当我调用pmap
时,我得到一个错误:
Error: Element 2 has length 4, not 1 or 5.
所以当我试图将一个长度为5的tibble作为list的一个元素传递,而该列表的其他元素都是长度为4的时候,pmap
很不高兴(我以为它会回收tibble)。
另请注意,之前,当我使用4行tibble调用pmap
时,我得到了一个不同的错误,
Error in UseMethod("tbl_vars") :
no applicable method for 'tbl_vars' applied to an object of class "character"
Called from: tbl_vars(tbl)
你们中有谁能建议一种使用pmap来做我想做的事情的方法,或者有不同的或更好的解决问题的方法?
谢谢!
发布于 2018-10-31 04:12:39
我没有使用purrr
和dplyr
的经验,但这里有一种使用data.table
的方法。这种方法可以移到dplyr中,只需用谷歌搜索一下:)
在可解释性方面,使用循环的方法可以说更好,因为它更简单。
编辑:对代码做了一些修改,最后没有使用purrr
# alternative with data.table
library(data.table)
library(dplyr)
# objects
test_config <- dplyr::tibble(
string_col = c("col1", "col2", "col4", "col3"),
pattern = c("^\\.$", "^NA$", "^$", "^NULL$"),
replacement = c("","","", "")
)
test_target <- dplyr::tibble(
col1 = c("Foo", "bar", ".", "NA", "NULL"),
col2 = c("Foo", "bar", ".", "NA", "NULL"),
col3 = c("Foo", "bar", ".", "NA", "NULL"),
col4 = c("NULL", "NA", "Foo", ".", "bar"),
col5 = c("I", "am", "not", "changing", ".")
)
multiColStringReplace <- function(test_target, test_config){
# data.table conversion
test_target <- as.data.table(test_target)
test_config <- as.data.table(test_config)
# adding an id column, as I'm reshaping the data, helps for identification of rows
# throughout the process
test_target[,id:=1:.N]
# wide to long format
test_target2 <- melt(test_target, id.vars="id")
head(test_target2)
# pull in the configuration, can join up on one column now
test_target2 <- merge(test_target2, test_config, by.x="variable",
by.y="string_col", all.x=TRUE)
# this bit still looks messy to me, haven't used pmap before.
# I've had to subset the data to the required format, run the pmap with gsub
# to complete the task, then assign the unlisted vector back in to the original
# data. Would like to see a better option too!
test_target2[, result := value]
test_target2[!is.na(pattern), result := gsub(pattern, replacement, value),
by = .(id, variable)]
# case from long to original format, and drop the id
output <- dcast(test_target2, id~variable,
value.var = "result")
output[, id := NULL]
# back to tibble
output <- as_tibble(output)
return(output)
}
output <- multiColStringReplace(test_target, test_config)
output
result <- dplyr::tibble(
col1 = c("Foo", "bar", "", "NA", "NULL"),
col2 = c("Foo", "bar", ".", "", "NULL"),
col3 = c("Foo", "bar", ".", "NA", ""),
col4 = c("NULL", "NA", "Foo", ".", "bar"),
col5 = c("I", "am", "not", "changing", ".")
)
output == result
# compare with old method
old <- test_target
for (i in seq(nrow(test_config))) {
old <- dplyr::mutate_at(old,
.vars = dplyr::vars(
tidyselect::matches(test_config$string_col[[i]])),
.funs = dplyr::funs(
stringr::str_replace_all(
., test_config$pattern[[i]],
test_config$replacement[[i]]))
)
}
old == result
# speed improves, but complexity rises
microbenchmark::microbenchmark("old" = {
old <- test_target
for (i in seq(nrow(test_config))) {
old <- dplyr::mutate_at(old,
.vars = dplyr::vars(
tidyselect::matches(test_config$string_col[[i]])),
.funs = dplyr::funs(
stringr::str_replace_all(
., test_config$pattern[[i]],
test_config$replacement[[i]]))
)
}
},
"data.table" = {
multiColStringReplace(test_target, test_config)
}, times = 20)
发布于 2018-10-31 05:14:44
为了后人着想,如果我将tibble传递给pmap_dfr
(但这不是一个好的解决方案),我也可以完成这项任务:
purrr::pmap_dfr(
list(list(test_target),
test_config$string_col,
test_config$pattern,
test_config$replacement),
testFun
) %>% dplyr::distinct()
虽然它可以工作,但这并不是一个好的解决方案,因为它回收了test_target
列表的元素,有效地为test_config的每一行创建了tibble的副本,因为它遍历了参数,然后将结果4个test_target的行绑定在一起,形成一个很大的最终输出tibble (我正在用distinct()
过滤它)。
也许有一些方法可以做一些事情,比如<<-
-like方法,以避免复制目标tibble,但这更奇怪,更糟糕。
发布于 2018-10-31 06:49:11
仅供参考,基准测试结果-- @camille建议的“笨拙整洁”的方法在我的硬件上是赢家!
Unit: milliseconds
expr min lq mean median uq max neval
loop 14.808278 16.098818 17.937283 16.811716 20.438360 24.38021 20
pmap_function 9.486146 10.157526 10.978879 10.628205 11.112485 15.39436 20
nice_tidy 8.313868 8.633266 9.597485 8.986735 9.870532 14.32946 20
awkward_tidy 1.535919 1.639706 1.772211 1.712177 1.783465 2.87615 20
data.table 5.611538 5.652635 8.323122 5.784507 6.359332 51.63031 20
https://stackoverflow.com/questions/53071578
复制相似问题