首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何在非常大的数据集中获得更有效的行数和uniqueN

如何在非常大的数据集中获得更有效的行数和uniqueN
EN

Stack Overflow用户
提问于 2020-05-23 23:27:18
回答 1查看 39关注 0票数 1

我有一个非常大的数据集。它包括三个维度,即公司ID、银行代码、银行分支机构以及每个公司到每个银行分支机构的距离。

我需要获得以下两个统计信息:(i)银行数量,通过识别银行代码的uniqueN;(ii)分支机构数量,通过取银行分支机构数量之和

我对它进行了如下编码,但它很乏味,非常慢,而且与data.table的声誉相去甚远。

代码语言:javascript
复制
library(data.table)
set.seed(0L)
n <- 1e7 # a simplified sample
DTsample <- data.table(Dist = abs(rnorm(n)*15 + 30), 
                       BankCode = sample(1:200, n, replace = TRUE),
                       firmID = sample(1:5000, n, replace = TRUE)
                       )[, BankBranch := .GRP, by = BankCode]

# (i) calculate uniqueN of bank code
DTsample[, .(Dleq05 = uniqueN(BankCode[Dist <=  5]), # banks to a firm is no further than 5 km
             Dleq15 = uniqueN(BankCode[Dist <= 15]), # banks to a firm is no further than 15 km
             Dleq25 = uniqueN(BankCode[Dist <= 25]),
             Dleq35 = uniqueN(BankCode[Dist <= 35]),
             Dleq45 = uniqueN(BankCode[Dist <= 45])), by = firmID]

# (ii) calculate the sum of the number of bank branches
DTsample[, .(Nleq05 = sum(+(Dist <=  5)), # branches (in rows) to a firm is no further than 5 km
             Nleq15 = sum(+(Dist <= 15)), # branches (in rows) to a firm is no further than 15 km
             Nleq25 = sum(+(Dist <= 25)),
             Nleq35 = sum(+(Dist <= 35)),
             Nleq45 = sum(+(Dist <= 45))), by = firmID]

有没有可能跑得更快?我在我的真实数据中运行了类似的东西,这需要几天的时间。

EN

回答 1

Stack Overflow用户

发布于 2020-05-24 08:01:50

我认为只有前1是慢的。对于相同的firmID、相同的银行代码和相同的分行代码,数据集也有多个距离,我已经随意修复了这些代码。

这里有一个选项,用于首先删除超出最大半径的行。然后,创建5英里、15英里、25英里、35英里和45英里的中心圆圈,并计算每个圆圈内唯一银行代码的数量。然后,填充空的外环,因为如果内环中存在银行代码,那么它也在空的外环中。

数据准备:

代码语言:javascript
复制
#remove bank codes that are too far
maxd <- 45
DT <- DTsample[Dist<=maxd]

#!!!!
#seems like sample dataset has diff dist for the same firmID, same bankcode and bankbranch
#hence use this to fix the issue
DT <- unique(DT, by=c("firmID", "BankCode", "BankBranch"))

实际代码:

代码语言:javascript
复制
system.time({
  #round the distances into desired circles
  intv <- seq(5, maxd, 10)  
  DT[, Dist := intv[findInterval(Dist, c(0, intv))]]

  #count unique bank code per firmID per cirle
  setkeyv(DT, c("firmID","Dist","BankCode"))
  DT[, uq := {
      ri <- rleid(firmID, Dist, BankCode)
      ix <- which(!duplicated(firmID))[-1L]
      x <- replace(c(0, rep(NA, .N-1L)), ix, ri[ix-1L])
      ri - nafill(x, "locf")
  }]
  muq <- DT[, .(nuq=max(uq)), keyby=.(firmID, Dist)]

  #fill in missing bankcodes because if they exists in smaller rings, they must also be in the larger rings
  dat <- muq[CJ(firmID=unique(firmID), Dist=intv), on=.NATURAL][,
      nuq := nafill(nafill(nuq, "locf"), fill=0L), by=.(firmID)]

  #get desired output in wide format
  a1 <- dcast(dat, firmID ~ Dist, value.var="nuq")
  setnames(a1, names(a1)[-1L], sprintf("Dleq%02d", intv))
})
#   user  system elapsed
#   0.36    0.11    0.35

与OP的原始代码进行比较和检查:

代码语言:javascript
复制
system.time({
  # (i) calculate uniqueN of bank code
  a1_0 <- DT[, .(Dleq05 = uniqueN(BankCode[Dist <=  5]), # banks to a firm is no further than 5 km
    Dleq15 = uniqueN(BankCode[Dist <= 15]), # banks to a firm is no further than 15 km
    Dleq25 = uniqueN(BankCode[Dist <= 25]),
    Dleq35 = uniqueN(BankCode[Dist <= 35]),
    Dleq45 = uniqueN(BankCode[Dist <= 45])), keyby = firmID]
})
#   user  system elapsed
#   1.52    1.81    2.69

fsetequal(a1, a1_0)
#[1] TRUE

数据:

代码语言:javascript
复制
library(data.table)
set.seed(0L)
n <- 1e7 # a simplified sample
DTsample <- data.table(Dist = abs(rnorm(n)*15 + 30), 
                       BankCode = sample(1:200, n, replace = TRUE),
                       firmID = sample(1:5000, n, replace = TRUE)
                       )[, BankBranch := .GRP, by = BankCode]

也许对于第二个问题,一些不那么重复的东西,比如:

代码语言:javascript
复制
a2 <- dcast(DT[, .N, .(firmID, Dist)][, N := cumsum(N), firmID],
    firmID ~ Dist, value.var="N")
setnames(a2, names(a2)[-1L], sprintf("Nleq%02d", intv))
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/61974368

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档