我有一个巨大的数据df
,其中包括有关重叠间隔(A)和(B)的信息,以及它们位于哪个染色体(色度)上。还有关于区间(A)上观察到的值(基因表达水平)的信息。
chrom value Astart Aend Bstart Bend
chr1 0 0 54519752 17408 17431
chr1 0 0 54519752 17368 17391
chr1 0 0 54519752 567761 567783
chr11 0 2 93466832 568111 568133
chr11 0 2 93466832 568149 568171
chr11 0 2 93466832 1880734 1880756
chr11 4 93466844 93466880 93466856 93466878
chr11 2 93466885 135006516 93466889 93466911
chr11 2 93466885 135006516 94199710 94199732
请注意,同一间隔可能出现几次,例如,如果间隔(B)与两个(A)间隔重叠,则会报告两次:
Astart(1)=========================Aend1 Astart(2)========================Aend(2)
Bstart(1)=======================================Bend(1)
chrom value Astart Aend Bstart Bend
chr1 0 0 25 15 35 #A(1) and B(1) overlap
chr1 1 28 45 15 35 #A(2) and B(1) overlap
同样,如果一个间隔(A)与两个或两个以上(B)间隔重叠,它将被报告两次或更多次:
Astart(3)===================================================================Aend(3)
Bstart(2)=========Bend(2) Bstart(3)===========Bend(3) Bstart(4)===============Bend(4)
chrom value Astart Aend Bstart Bend
chr4 0 10 100 15 25 #A(3) and B(2) overlap
chr4 0 10 100 30 75 #A(3) and B(3) overlap
chr4 3 10 100 80 120 #A(3) and B(4) overlap
我的目标是从间隔(B)输出所有单独的位置,并从(A)输出相应的值。我有一段代码,很好地输出了(B)中的所有相关位置:
position <- unlist(mapply(seq, ans$Bstart, ans$Bend - 1))
> head(position)
[1] 17408 17409 17410 17411 17412 17413
问题在于,仅仅从那里检索染色体信息是不够的。当我列出这些位置时,我需要同时检查染色体信息和位置。这是因为相同的位置整数可能出现在多条染色体上,所以我以后不能只运行类似于for position %in% range(Astart, Aend) output $chrom, $value
(虚拟代码)的操作。
如何同时检索(chrom, position, value)
?
预期的结果将如下所示:
> head(expected_result)
chrom position value
chr1 17408 0
chr1 17409 0
chr1 17410 0
chr1 17411 0
chr1 17412 0
chr1 17413 0
#skipping some lines to show another part of the dataframe
chr11 93466856 4
chr11 93466857 4
发布于 2014-02-09 10:01:10
对ddply
的调用可能更优雅,但逻辑是相同的:
dfA = read.table(textConnection("chrom value Astart Aend Bstart Bend
chr1 0 0 54519752 17408 17431
chr1 0 0 54519752 17368 17391
chr1 0 0 54519752 567761 567783
chr11 0 2 93466832 568111 568133
chr11 0 2 93466832 568149 568171
chr11 0 2 93466832 1880734 1880756
chr11 4 93466844 93466880 93466856 93466878
chr11 2 93466885 135006516 93466889 93466911
chr11 2 93466885 135006516 94199710 94199732"), header = TRUE)
dfB = as.data.frame(do.call(rbind,
apply(dfA, MARGIN = 1, FUN = function(x) {
cbind(mapply(seq,
as.numeric(x['Bstart']),
as.numeric(x['Bend']) - 1),
x['chrom'], x['value'])
}
)))
lapply(dfB, typeof)
https://stackoverflow.com/questions/21657348
复制相似问题