文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在非常大的数据集中获得更有效的行数和uniqueN

问如何在非常大的数据集中获得更有效的行数和uniqueN
EN

Stack Overflow用户

提问于 2020-05-23 23:27:18

回答 1查看 39关注 0票数 1

我有一个非常大的数据集。它包括三个维度，即公司ID、银行代码、银行分支机构以及每个公司到每个银行分支机构的距离。

我需要获得以下两个统计信息：(i)银行数量，通过识别银行代码的uniqueN；(ii)分支机构数量，通过取银行分支机构数量之和

我对它进行了如下编码，但它很乏味，非常慢，而且与data.table的声誉相去甚远。

library(data.table)
set.seed(0L)
n <- 1e7 # a simplified sample
DTsample <- data.table(Dist = abs(rnorm(n)*15 + 30), 
                       BankCode = sample(1:200, n, replace = TRUE),
                       firmID = sample(1:5000, n, replace = TRUE)
                       )[, BankBranch := .GRP, by = BankCode]

# (i) calculate uniqueN of bank code
DTsample[, .(Dleq05 = uniqueN(BankCode[Dist <=  5]), # banks to a firm is no further than 5 km
             Dleq15 = uniqueN(BankCode[Dist <= 15]), # banks to a firm is no further than 15 km
             Dleq25 = uniqueN(BankCode[Dist <= 25]),
             Dleq35 = uniqueN(BankCode[Dist <= 35]),
             Dleq45 = uniqueN(BankCode[Dist <= 45])), by = firmID]

# (ii) calculate the sum of the number of bank branches
DTsample[, .(Nleq05 = sum(+(Dist <=  5)), # branches (in rows) to a firm is no further than 5 km
             Nleq15 = sum(+(Dist <= 15)), # branches (in rows) to a firm is no further than 15 km
             Nleq25 = sum(+(Dist <= 25)),
             Nleq35 = sum(+(Dist <= 35)),
             Nleq45 = sum(+(Dist <= 45))), by = firmID]

有没有可能跑得更快？我在我的真实数据中运行了类似的东西，这需要几天的时间。

data.table

回答 1

Stack Overflow用户

发布于 2020-05-24 08:01:50

我认为只有前1是慢的。对于相同的firmID、相同的银行代码和相同的分行代码，数据集也有多个距离，我已经随意修复了这些代码。

这里有一个选项，用于首先删除超出最大半径的行。然后，创建5英里、15英里、25英里、35英里和45英里的中心圆圈，并计算每个圆圈内唯一银行代码的数量。然后，填充空的外环，因为如果内环中存在银行代码，那么它也在空的外环中。

数据准备：

#remove bank codes that are too far
maxd <- 45
DT <- DTsample[Dist<=maxd]

#!!!!
#seems like sample dataset has diff dist for the same firmID, same bankcode and bankbranch
#hence use this to fix the issue
DT <- unique(DT, by=c("firmID", "BankCode", "BankBranch"))

实际代码：

system.time({
  #round the distances into desired circles
  intv <- seq(5, maxd, 10)  
  DT[, Dist := intv[findInterval(Dist, c(0, intv))]]

  #count unique bank code per firmID per cirle
  setkeyv(DT, c("firmID","Dist","BankCode"))
  DT[, uq := {
      ri <- rleid(firmID, Dist, BankCode)
      ix <- which(!duplicated(firmID))[-1L]
      x <- replace(c(0, rep(NA, .N-1L)), ix, ri[ix-1L])
      ri - nafill(x, "locf")
  }]
  muq <- DT[, .(nuq=max(uq)), keyby=.(firmID, Dist)]

  #fill in missing bankcodes because if they exists in smaller rings, they must also be in the larger rings
  dat <- muq[CJ(firmID=unique(firmID), Dist=intv), on=.NATURAL][,
      nuq := nafill(nafill(nuq, "locf"), fill=0L), by=.(firmID)]

  #get desired output in wide format
  a1 <- dcast(dat, firmID ~ Dist, value.var="nuq")
  setnames(a1, names(a1)[-1L], sprintf("Dleq%02d", intv))
})
#   user  system elapsed
#   0.36    0.11    0.35

与OP的原始代码进行比较和检查：

system.time({
  # (i) calculate uniqueN of bank code
  a1_0 <- DT[, .(Dleq05 = uniqueN(BankCode[Dist <=  5]), # banks to a firm is no further than 5 km
    Dleq15 = uniqueN(BankCode[Dist <= 15]), # banks to a firm is no further than 15 km
    Dleq25 = uniqueN(BankCode[Dist <= 25]),
    Dleq35 = uniqueN(BankCode[Dist <= 35]),
    Dleq45 = uniqueN(BankCode[Dist <= 45])), keyby = firmID]
})
#   user  system elapsed
#   1.52    1.81    2.69

fsetequal(a1, a1_0)
#[1] TRUE

数据：

library(data.table)
set.seed(0L)
n <- 1e7 # a simplified sample
DTsample <- data.table(Dist = abs(rnorm(n)*15 + 30), 
                       BankCode = sample(1:200, n, replace = TRUE),
                       firmID = sample(1:5000, n, replace = TRUE)
                       )[, BankBranch := .GRP, by = BankCode]

也许对于第二个问题，一些不那么重复的东西，比如：

a2 <- dcast(DT[, .N, .(firmID, Dist)][, N := cumsum(N), firmID],
    firmID ~ Dist, value.var="N")
setnames(a2, names(a2)[-1L], sprintf("Nleq%02d", intv))

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/61974368

复制

相似问题

问如何在非常大的数据集中获得更有效的行数和uniqueN
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在非常大的数据集中获得更有效的行数和uniqueNEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在非常大的数据集中获得更有效的行数和uniqueN
EN