问组合到列表大数据集中的优化计算
EN

Stack Overflow用户

提问于 2018-06-08 06:52:24

回答 1查看 313关注 0票数 1

我想知道是否有人能想出一种更快的方法来计算向量中元素的组合。我的方法有效，但在向量中有大约600万个元素，速度很慢。

测试向量

test.vector <- c("335261 344015 537633","22404 132858","254654 355860 488288","219943 373817","331839 404477")

我的方法

lapply(strsplit(test.vector, " "), function(x) unique(apply(combn(x, 2), 2, function(y) paste0(y, collapse = ""))))

期望的输出

[[1]]
[1] "335261344015" "335261537633" "344015537633"

[[2]]
[1] "22404132858"

[[3]]
[1] "254654355860" "254654488288" "355860488288"

[[4]]
[1] "219943373817"

[[5]]
[1] "331839404477"

optimization

combinations

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-06-08 09:33:01

这是一个在大型测试用例上比OP的解决方案更快的基于25x的解决方案。它不依赖于paste，而是我们利用了数字的属性和矢量化操作。我们还使用RcppAlgos包(我是作者)中的comboGeneral，它比链接答案中的combn和combnPrim在生成向量组合方面要快得多。首先，我们展示了comboGeneral相对于其他函数的效率提升：

## library(gRbase)
library(RcppAlgos)
library(microbenchmark)

microbenchmark(gRbase::combnPrim(300, 2), combn(300, 2), 
               comboGeneral(300, 2), unit = "relative")
Unit: relative
                     expr        min         lq      mean     median         uq       max neval
gRbase::combnPrim(300, 2)   5.145654   5.192439   4.83561   7.167839   4.320497   3.98992   100
            combn(300, 2) 204.866624 192.559119 143.75540 174.079339 102.733367 539.12325   100
     comboGeneral(300, 2)   1.000000   1.000000   1.00000   1.000000   1.000000   1.00000   100

现在，我们创建一个函数来创建一些随机可重现的数据，这些数据将传递给我们的测试函数：

makeTestSet <- function(vectorSize, elementSize, mySeed = 42, withRep = FALSE) {
    set.seed(mySeed)
    sapply(1:vectorSize, function(x) {
        paste(sample(10^6, s1 <- sample(2:elementSize, 1), replace = withRep), collapse = " ")
    })
}

makeTestSet(5, 3)
[1] "937076 286140 830446" "519096 736588 134667" "705065 457742 719111" 
[4] "255429 462293 940013" "117488 474997 560332"

看起来不错。现在，让我们看看设置fixed = TRUE是否会给我们带来任何好处(正如上面@MichaelChirico建议的那样)：

bigVec <- makeTestSet(10, 100000)

microbenchmark(standard = strsplit(bigVec, " "), 
               withFixed = strsplit(bigVec, " ", fixed = TRUE), 
               times = 15, unit = "relative")
Unit: relative
     expr      min       lq     mean   median       uq      max neval
 standard 4.447413 4.296662 4.133797 4.339537 4.084019 3.415639    15
withFixed 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    15

@MichaelChirico说得很对。把所有这些放在一起，我们得到：

combPairFast <- function(testVec) {
    lapply(strsplit(testVec, " ", fixed = TRUE), function(x) {
        combs <- RcppAlgos::comboGeneral(as.numeric(x), 2)
        unique(combs[,1] * (10)^(as.integer(log10(combs[,2])) + 1L) + combs[,2])
    })
}

## test.vector defined above by OP
combPairFast(test.vector)
[[1]]
[1] 335261344015 335261537633 344015537633

[[2]]
[1] 22404132858

[[3]]
[1] 254654355860 254654488288 355860488288

[[4]]
[1] 219943373817

[[5]]
[1] 331839404477

## OP original code
combPairOP <- function(testVec) {
    lapply(strsplit(testVec, " "), function(x) unique(apply(combn(x, 2), 2, function(y) paste0(y, collapse = ""))))
}

正如OP的注释中所述，最大数字小于一百万(准确地说是600000)，这意味着我们将其中一个数字乘以最多10^6，并将其与另一个6位数字相加(相当于简单地连接两个数字字符串)，我们可以保证在基数R的数值精度(即2^53 - 1)之内。这很好，因为数字上的算术运算比字符串运算效率高得多。

剩下的就是基准测试了：

test.vector <- makeTestSet(100, 50)

microbenchmark(combPairOP(test.vector), 
               combPairFast(test.vector),
               times = 20, unit = "relative")
Unit: relative
                     expr      min      lq     mean   median     uq      max neval
  combPairOP(test.vector) 22.33991 22.4264 21.67291 22.11017 21.729 25.23342    20
combPairFast(test.vector)  1.00000  1.0000  1.00000  1.00000  1.000  1.00000    20

在更大的向量上：

bigTest.vector <- makeTestSet(1000, 100, mySeed = 22, withRep = TRUE)

## Duplicate values exist
any(sapply(strsplit(bigTest.vector, " ", fixed = TRUE), function(x) {
    any(duplicated(x))
}))
[1] TRUE

system.time(t1 <- combPairFast(bigTest.vector))
 user  system elapsed 
0.303   0.011   0.314 

system.time(t2 <- combPairOP(bigTest.vector))
 user  system elapsed 
8.820   0.081   8.902    ### 8.902 / 0.314 ~= 28x faster

## results are the same
all.equal(t1, lapply(t2, as.numeric))
[1] TRUE

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/50751179

复制

相似问题

问组合到列表大数据集中的优化计算
EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问组合到列表大数据集中的优化计算EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问组合到列表大数据集中的优化计算
EN