问使用带有的pmap将不同的正则表达式应用于tibble中的不同变量？
EN

Stack Overflow用户

提问于 2018-10-31 03:28:35

回答 3查看 106关注 0票数 1

这个问题与Using pmap to apply different regular expressions to different variables in a tibble?非常相似，但不同之处在于我意识到我的示例不足以描述我的问题。

我正在尝试将不同的正则表达式应用于tibble中的不同变量。例如，我创建了一个tibble清单1)我想要修改的变量名，2)我想要匹配的正则表达式，3)替换字符串。我想将正则表达式/替换应用于不同数据框中的变量。请注意，目标tibble中可能有我不想修改的变量，并且我的“配置”tibble中的行顺序可能与我的“目标”tibble中的列/变量顺序不对应。

所以我的“配置”tibble可能是这样的：

test_config <-  dplyr::tibble(
  string_col = c("col1", "col2", "col4", "col3"),
  pattern = c("^\\.$", "^NA$", "^$", "^NULL$"),
  replacement = c("","","", "")
)

我想把这个应用到一个目标tibble上：

test_target <- dplyr::tibble(
  col1 = c("Foo", "bar", ".", "NA", "NULL"),
  col2 = c("Foo", "bar", ".", "NA", "NULL"),
  col3 = c("Foo", "bar", ".", "NA", "NULL"),
  col4 = c("NULL", "NA", "Foo", ".", "bar"),
  col5 = c("I", "am", "not", "changing", ".")
)

因此，我们的目标是在test_target的用户指定列/变量中用空字符串替换不同的字符串。

结果应该是这样的：

result <- dplyr::tibble(
  col1 = c("Foo", "bar", "", "NA", "NULL"),
  col2 = c("Foo", "bar", ".", "", "NULL"),
  col3 = c("Foo", "bar", ".", "NA", ""),
  col4 = c("NULL", "NA", "Foo", ".", "bar"),
  col5 = c("I", "am", "not", "changing", ".")
)

我可以使用for循环来做我想做的事情，如下所示：

for (i in seq(nrow(test_config))) {
  test_target <- dplyr::mutate_at(test_target,
                   .vars = dplyr::vars(
                     tidyselect::matches(test_config$string_col[[i]])),
                   .funs = dplyr::funs(
                     stringr::str_replace_all(
                       ., test_config$pattern[[i]], 
                       test_config$replacement[[i]]))
  )
}

相反，有没有更整洁的方法来做我想做的事情呢？到目前为止，我认为purrr::pmap是完成这项工作的工具，所以我已经创建了一个函数，该函数接受数据框、变量名、正则表达式和替换值，并返回修改了单个变量的数据框。它的行为符合预期：

testFun <- function(df, colName, regex, repVal){
  colName <- dplyr::enquo(colName)
  df <- dplyr::mutate_at(df,
                         .vars = dplyr::vars(
                           tidyselect::matches(!!colName)),
                         .funs = dplyr::funs(
                           stringr::str_replace_all(., regex, repVal))
  )
}

# try with example
out <- testFun(test_target, 
               test_config$string_col[[1]], 
               test_config$pattern[[1]], 
               "")

但是，当我尝试在pmap中使用该函数时，我遇到了几个问题: 1)有没有比这更好的方法来构建pmap调用的列表？

purrr::pmap(
    list(test_target, 
         test_config$string_col, 
         test_config$pattern, 
         test_config$replacement),
    testFun
)

2)当我调用pmap时，我得到一个错误：

Error: Element 2 has length 4, not 1 or 5.

所以当我试图将一个长度为5的tibble作为list的一个元素传递，而该列表的其他元素都是长度为4的时候，pmap很不高兴(我以为它会回收tibble)。

另请注意，之前，当我使用4行tibble调用pmap时，我得到了一个不同的错误，

Error in UseMethod("tbl_vars") : 
  no applicable method for 'tbl_vars' applied to an object of class "character"
Called from: tbl_vars(tbl)

你们中有谁能建议一种使用pmap来做我想做的事情的方法，或者有不同的或更好的解决问题的方法？

谢谢!

purrr

回答 3

Stack Overflow用户

发布于 2018-10-31 04:12:39

我没有使用purrr和dplyr的经验，但这里有一种使用data.table的方法。这种方法可以移到dplyr中，只需用谷歌搜索一下:)

在可解释性方面，使用循环的方法可以说更好，因为它更简单。

编辑:对代码做了一些修改，最后没有使用purrr

# alternative with data.table
library(data.table)
library(dplyr)

# objects
test_config <-  dplyr::tibble(
  string_col = c("col1", "col2", "col4", "col3"),
  pattern = c("^\\.$", "^NA$", "^$", "^NULL$"),
  replacement = c("","","", "")
)
test_target <- dplyr::tibble(
  col1 = c("Foo", "bar", ".", "NA", "NULL"),
  col2 = c("Foo", "bar", ".", "NA", "NULL"),
  col3 = c("Foo", "bar", ".", "NA", "NULL"),
  col4 = c("NULL", "NA", "Foo", ".", "bar"),
  col5 = c("I", "am", "not", "changing", ".")
)

multiColStringReplace <- function(test_target, test_config){

  # data.table conversion
  test_target <- as.data.table(test_target)
  test_config <- as.data.table(test_config)

  # adding an id column, as I'm reshaping the data, helps for identification of rows
  # throughout the process
  test_target[,id:=1:.N]

  # wide to long format
  test_target2 <- melt(test_target, id.vars="id")
  head(test_target2)

  # pull in the configuration, can join up on one column now
  test_target2 <- merge(test_target2, test_config, by.x="variable",
                        by.y="string_col", all.x=TRUE)

  # this bit still looks messy to me, haven't used pmap before.
  # I've had to subset the data to the required format, run the pmap with gsub
  # to complete the task, then assign the unlisted vector back in to the original
  # data. Would like to see a better option too!
  test_target2[, result := value]
  test_target2[!is.na(pattern), result := gsub(pattern, replacement, value),
               by = .(id, variable)]

  # case from long to original format, and drop the id
  output <- dcast(test_target2, id~variable,
                  value.var = "result")
  output[, id := NULL]

  # back to tibble
  output <- as_tibble(output)

  return(output)

}

output <- multiColStringReplace(test_target, test_config)
output

result <- dplyr::tibble(
  col1 = c("Foo", "bar", "", "NA", "NULL"),
  col2 = c("Foo", "bar", ".", "", "NULL"),
  col3 = c("Foo", "bar", ".", "NA", ""),
  col4 = c("NULL", "NA", "Foo", ".", "bar"),
  col5 = c("I", "am", "not", "changing", ".")
)
output == result

# compare with old method
old <- test_target
for (i in seq(nrow(test_config))) {
  old <- dplyr::mutate_at(old,
                          .vars = dplyr::vars(
                            tidyselect::matches(test_config$string_col[[i]])),
                          .funs = dplyr::funs(
                            stringr::str_replace_all(
                              ., test_config$pattern[[i]], 
                              test_config$replacement[[i]]))
  )
}
old == result

# speed improves, but complexity rises
microbenchmark::microbenchmark("old" = {
  old <- test_target
  for (i in seq(nrow(test_config))) {
    old <- dplyr::mutate_at(old,
                            .vars = dplyr::vars(
                              tidyselect::matches(test_config$string_col[[i]])),
                            .funs = dplyr::funs(
                              stringr::str_replace_all(
                                ., test_config$pattern[[i]], 
                                test_config$replacement[[i]]))
    )
  }
},
"data.table" = {
  multiColStringReplace(test_target, test_config)
}, times = 20)

票数 1

Stack Overflow用户

发布于 2018-10-31 05:14:44

为了后人着想，如果我将tibble传递给pmap_dfr (但这不是一个好的解决方案)，我也可以完成这项任务：

purrr::pmap_dfr(
  list(list(test_target),
       test_config$string_col,
       test_config$pattern,
       test_config$replacement),
  testFun
) %>% dplyr::distinct()

虽然它可以工作，但这并不是一个好的解决方案，因为它回收了test_target列表的元素，有效地为test_config的每一行创建了tibble的副本，因为它遍历了参数，然后将结果4个test_target的行绑定在一起，形成一个很大的最终输出tibble (我正在用distinct()过滤它)。

也许有一些方法可以做一些事情，比如<<--like方法，以避免复制目标tibble，但这更奇怪，更糟糕。

票数 0

Stack Overflow用户

发布于 2018-10-31 06:49:11

仅供参考，基准测试结果-- @camille建议的“笨拙整洁”的方法在我的硬件上是赢家！

Unit: milliseconds
          expr       min        lq      mean    median        uq      max neval
          loop 14.808278 16.098818 17.937283 16.811716 20.438360 24.38021    20
 pmap_function  9.486146 10.157526 10.978879 10.628205 11.112485 15.39436    20
     nice_tidy  8.313868  8.633266  9.597485  8.986735  9.870532 14.32946    20
  awkward_tidy  1.535919  1.639706  1.772211  1.712177  1.783465  2.87615    20
    data.table  5.611538  5.652635  8.323122  5.784507  6.359332 51.63031    20

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/53071578

复制

相似问题

问使用带有的pmap将不同的正则表达式应用于tibble中的不同变量？
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用带有的pmap将不同的正则表达式应用于tibble中的不同变量？EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用带有的pmap将不同的正则表达式应用于tibble中的不同变量？
EN