我有一个非常大的数据集。它包括三个维度,即公司ID、银行代码、银行分支机构以及每个公司到每个银行分支机构的距离。
我需要获得以下两个统计信息:(i)银行数量,通过识别银行代码的uniqueN;(ii)分支机构数量,通过取银行分支机构数量之和
我对它进行了如下编码,但它很乏味,非常慢,而且与data.table的声誉相去甚远。
library(data.table)
set.seed(0L)
n <- 1e7 # a simplified sample
DTsample <- data.table(Dist = abs(rnorm(n)*15 + 30),
BankCode = sample(1:200, n, replace = TRUE),
firmID = sample(1:5000, n, replace = TRUE)
)[, BankBranch := .GRP, by = BankCode]
# (i) calculate uniqueN of bank code
DTsample[, .(Dleq05 = uniqueN(BankCode[Dist <= 5]), # banks to a firm is no further than 5 km
Dleq15 = uniqueN(BankCode[Dist <= 15]), # banks to a firm is no further than 15 km
Dleq25 = uniqueN(BankCode[Dist <= 25]),
Dleq35 = uniqueN(BankCode[Dist <= 35]),
Dleq45 = uniqueN(BankCode[Dist <= 45])), by = firmID]
# (ii) calculate the sum of the number of bank branches
DTsample[, .(Nleq05 = sum(+(Dist <= 5)), # branches (in rows) to a firm is no further than 5 km
Nleq15 = sum(+(Dist <= 15)), # branches (in rows) to a firm is no further than 15 km
Nleq25 = sum(+(Dist <= 25)),
Nleq35 = sum(+(Dist <= 35)),
Nleq45 = sum(+(Dist <= 45))), by = firmID]有没有可能跑得更快?我在我的真实数据中运行了类似的东西,这需要几天的时间。
发布于 2020-05-24 08:01:50
我认为只有前1是慢的。对于相同的firmID、相同的银行代码和相同的分行代码,数据集也有多个距离,我已经随意修复了这些代码。
这里有一个选项,用于首先删除超出最大半径的行。然后,创建5英里、15英里、25英里、35英里和45英里的中心圆圈,并计算每个圆圈内唯一银行代码的数量。然后,填充空的外环,因为如果内环中存在银行代码,那么它也在空的外环中。
数据准备:
#remove bank codes that are too far
maxd <- 45
DT <- DTsample[Dist<=maxd]
#!!!!
#seems like sample dataset has diff dist for the same firmID, same bankcode and bankbranch
#hence use this to fix the issue
DT <- unique(DT, by=c("firmID", "BankCode", "BankBranch"))实际代码:
system.time({
#round the distances into desired circles
intv <- seq(5, maxd, 10)
DT[, Dist := intv[findInterval(Dist, c(0, intv))]]
#count unique bank code per firmID per cirle
setkeyv(DT, c("firmID","Dist","BankCode"))
DT[, uq := {
ri <- rleid(firmID, Dist, BankCode)
ix <- which(!duplicated(firmID))[-1L]
x <- replace(c(0, rep(NA, .N-1L)), ix, ri[ix-1L])
ri - nafill(x, "locf")
}]
muq <- DT[, .(nuq=max(uq)), keyby=.(firmID, Dist)]
#fill in missing bankcodes because if they exists in smaller rings, they must also be in the larger rings
dat <- muq[CJ(firmID=unique(firmID), Dist=intv), on=.NATURAL][,
nuq := nafill(nafill(nuq, "locf"), fill=0L), by=.(firmID)]
#get desired output in wide format
a1 <- dcast(dat, firmID ~ Dist, value.var="nuq")
setnames(a1, names(a1)[-1L], sprintf("Dleq%02d", intv))
})
# user system elapsed
# 0.36 0.11 0.35与OP的原始代码进行比较和检查:
system.time({
# (i) calculate uniqueN of bank code
a1_0 <- DT[, .(Dleq05 = uniqueN(BankCode[Dist <= 5]), # banks to a firm is no further than 5 km
Dleq15 = uniqueN(BankCode[Dist <= 15]), # banks to a firm is no further than 15 km
Dleq25 = uniqueN(BankCode[Dist <= 25]),
Dleq35 = uniqueN(BankCode[Dist <= 35]),
Dleq45 = uniqueN(BankCode[Dist <= 45])), keyby = firmID]
})
# user system elapsed
# 1.52 1.81 2.69
fsetequal(a1, a1_0)
#[1] TRUE数据:
library(data.table)
set.seed(0L)
n <- 1e7 # a simplified sample
DTsample <- data.table(Dist = abs(rnorm(n)*15 + 30),
BankCode = sample(1:200, n, replace = TRUE),
firmID = sample(1:5000, n, replace = TRUE)
)[, BankBranch := .GRP, by = BankCode]也许对于第二个问题,一些不那么重复的东西,比如:
a2 <- dcast(DT[, .N, .(firmID, Dist)][, N := cumsum(N), firmID],
firmID ~ Dist, value.var="N")
setnames(a2, names(a2)[-1L], sprintf("Nleq%02d", intv))https://stackoverflow.com/questions/61974368
复制相似问题