我正在研究一种元分析研究的数据,并不是每一项研究都有每一种遗传变异的数据。我正试图解决以下问题:
下面是这些数据的示例:
> head(studies,2)
study cases controls N
1 A 3747 8024 11771
2 B 5367 5780 11147
> head(rawresults,2)
ID N
1 rs58241367 65280
2 rs85436064 107624
下面是我想要从中得到的一个例子(可选的contributing_studies列,如果这样做允许有更好的解决方案,我可以去掉它):
> head(final,2)
ID contributing_studies cases controls N
1 rs10685984 B,C,D,F,G 26221 19987 46208
2 rs12123751 A,C,D,G,J 25631 23509 49140
到目前为止,我对如何解决这个问题最好的办法是用暴力来解决这个问题。这十项研究中的每一项都有两种可能的状态(贡献和不贡献),所以这是2^10 = 1024个可能的总和。有些数字可能不是唯一的(可能有一个以上的研究Ns组合,可以产生这个总和),我计划排除那些作为模棱两可。我已经包括了下面的代码和一个解决方案的例子作为一个答案。
我想问的是:有更好的方法吗?也许是为了处理这类问题而存在的库或函数?或者我还能做些什么让它更快更有效率呢?
下面是模拟场景数据的代码:
set.seed(1)
# Make "studies"
studies <- data.frame(toupper(letters[1:10]),round(rnorm(10,5000,2000)),round(rnorm(10,5000,2000)),stringsAsFactors=F)
colnames(studies) <- c('study','cases','controls')
studies$N <- studies$cases + studies$controls
# Make "rawresults"
rawresults <- data.frame(character(length=50),numeric(length=50),stringsAsFactors=F)
colnames(rawresults) <- c('ID','N')
for(i in seq(1,50)) {
numstudies <- sample(seq(5,10),1)
rawresults[i,'N'] <- sum(sample(studies$N,numstudies))
rawresults[i,'ID'] <- paste0('rs',sample(seq(1,99999999),1))
}
编辑:更快的代码来模拟场景中的数据,这样就可以模拟数百万行rows。灵感来自艾伦卡梅隆的解决方案,以下,也使用梳子。
set.seed(1)
# Make "studies"
studies <- data.frame(toupper(letters[1:10]),round(rnorm(10,5000,2000)),round(rnorm(10,5000,2000)),stringsAsFactors=F)
colnames(studies) <- c('study','cases','controls')
studies$N <- studies$cases + studies$controls
# Make "rawresults"
num_results <- 50 # Number of results to simulate
possible_ns <- unlist(sapply(1:10,combn,x=studies$N,sum))
rawresults <- data.frame(paste0('rs',sample(1:99999999,num_results)),sample(possible_ns,num_results,rep=T),stringsAsFactors=F)
colnames(rawresults) <- c('ID','N')
发布于 2020-06-29 22:29:06
这是我用蛮力方法想出的解决办法。
我是提出问题的人,我希望有人能找到比我更好的解决办法。R中的东西意味着反褶积(我认为这是正确的词?)可以用来解决这个问题的金额。
#### Bruce-force generate all possible combinations of studies ####
sumstudies <- function(whichstudies,whichcolumn) {
# Convert integer "whichstudies" to binary, then use the binary digits to decide which studies are included or excluded for this combination
in_or_out <- as.logical(intToBits(whichstudies)[1:10])
# Return appropriate combination of data from included studies (sum if numeric, paste otherwise)
if(is.numeric(studies[,whichcolumn])) {
return(sum(studies[in_or_out,whichcolumn]))
} else {
return(paste(studies[in_or_out,whichcolumn],collapse=','))
}
}
# Create a data frame with all 1024 possible combinations of studies
allcombos <- data.frame(matrix(nrow=1024,ncol=4))
colnames(allcombos) <- c('contributing_studies','cases','controls','N')
allcombos$contributing_studies <- sapply(seq(1,1024),sumstudies,'study')
allcombos$N <- sapply(seq(1,1024),sumstudies,'N')
allcombos$cases <- sapply(seq(1,1024),sumstudies,'cases')
allcombos$controls <- sapply(seq(1,1024),sumstudies,'controls')
# Get rid of Ns that can be made by summing more than one different combination of studies, since we wouldn't know which solution was correct
duplicates <- duplicated(allcombos$N) | duplicated(allcombos$N,fromLast=T)
allcombos[duplicates,] <- NA # Set all affected rows to NA
#### Match the data about all possible combinations to the Ns in rawresults ####
final <- merge(rawresults,allcombos,by='N',all.x=T,all.y=F)
final <- final[,c('ID','contributing_studies','cases','controls','N')]
发布于 2020-06-30 01:19:55
对于那些后来在谷歌上搜索的人来说,下面是我编写的代码(基于Allan Cameron的解决方案),它添加了您在处理遗传元分析数据时可能需要的所有字段。它只花了16秒就完成了200万行。
allcomb <- function(studies,excl_ambig=T) {
n_studies <- nrow(studies)
N <- unlist(sapply(1:n_studies,combn,x=studies$N,sum))
cases <- unlist(sapply(1:n_studies,combn,x=studies$cases,sum))
controls <- unlist(sapply(1:n_studies,combn,x=studies$controls,sum))
contrib_studies <- unlist(sapply(1:n_studies,combn,x=studies$study,paste,collapse=','))
combined <- data.frame(contrib_studies,cases,controls,N,stringsAsFactors=F)
# Flag ambiguous lines where more than one possible combination of studies exists to produce that sum
combined$ambig <- duplicated(combined$N) | duplicated(combined$N,fromLast=T)
if(excl_ambig) {
combined <- combined[!combined$ambiguous,]
}
return(combined)
}
allcombos <- allcomb(studies)
rawresults[,c('contrib_studies','cases','controls','N')] <- allcombos[match(rawresults$N, allcombos$N),]
接受艾伦的解决方案作为答案,因为这是他的想法,我只是建立在它之上!
https://stackoverflow.com/questions/62647470
复制相似问题