我正在尝试将基因组坐标的范围合并为连续的范围,并增加了一个跨越空白的合并选项。
例如,如果我有基因组范围[[0, 1000], [5, 1100]]
,我希望结果是[0, 1100]
。如果偏移量选项被设置为100
,并且输入是[[0, 1000], [1090, 1000]]
,我将再次希望结果是[0, 1100]
。
我已经实现了这样一种方法,它顺序地通过对齐,并尝试在前一个结束位置和下一个起始位置上合并,但是它失败了,因为实际的结果有不同的长度。例如,我的列表中的结果[[138, 821],[177, 1158], [224, 905], [401, 1169]]
按起始位置排序。答案应该是[138, 1169]
,但我得到的是[[138, 1158], [177, 905], [224, 1169]]
。显然,我需要考虑的不仅仅是前一个结束和下一个开始,但我还没有找到一个好的解决方案(最好不是一个庞大的if语句巢)。有人有什么建议吗?
def overlap_alignments(align, gene, overlap):
#make sure alignments are sorted first by chromosome then by start pos on chrom
align = sorted(align, key = lambda x: (x[0], x[1]))
merged = list()
for i in xrange(1, len(align)):
prv, nxt = align[i-1], align[i]
if prv[0] == nxt[0] and prv[2] + overlap >= nxt[1]:
start, end = prv[1], nxt[2]
chrom = prv[0]
merged.append([chrom, start, end, gene])
return merged
发布于 2014-06-19 15:00:39
那么,如何跟踪每个开始和结束以及每个职位所属的范围的数量?
def overlap_alignments(align, overlap):
# create a list of starts and ends
stends = [ (a[0], 1) for a in align ]
stends += [ (a[1] + overlap, -1) for a in align ]
stends.sort(key=lambda x: x[0])
# now we should have a list of starts and ends ordered by position,
# e.g. if the ranges are 5..10, 8..15, and 12..13, we have
# (5,1), (8,1), (10,-1), (12,1), (13,-1), (15,-1)
# next, we form a cumulative sum of this
s = 0
cs = []
for se in stends:
s += se[1]
cs.append((se[0], s))
# this is, with the numbers above, (5,1), (8,2), (10,1), (12,2), (13,1), (15,0)
# so, 5..8 belongs to one range, 8..10 belongs to two overlapping range,
# 10..12 belongs to one range, etc
# now we'll find all contiguous ranges
# when we traverse through the list of depths (number of overlapping ranges), a new
# range starts when the earlier number of overlapping ranges has been 0
# a range ends when the new number of overlapping ranges is zero
prevdepth = 0
start = 0
combined = []
for pos, depth in cs:
if prevdepth == 0:
start = pos
elif depth == 0
combined.append((start, pos-overlap))
prevdepth = depth
return combined
这要比解释容易得多。(是的,累积和可以写在较短的空格内,但我觉得这样写得更清楚。)
为了以图形方式解释这一点,让我们以输入(5,10,8,15,12,13,16,20)和overlap=1为例。
.....XXXXXo.............. (5-10)
........XXXXXXXo......... (8-15)
............Xo........... (12-13)
................XXXXo.... (16-20)
.....1112221221111111.... number of ranges at each position
.....----------------.... number of ranges > 0
.....---------------..... overlap corrected (5-20)
发布于 2014-06-19 16:39:23
Python来了包括电池
from itertools import chain
flatten = chain.from_iterable
LEFT, RIGHT = 1, -1
def join_ranges(data, offset=0):
data = sorted(flatten(((start, LEFT), (stop + offset, RIGHT))
for start, stop in data))
c = 0
for value, label in data:
if c == 0:
x = value
c += label
if c == 0:
yield x, value - offset
if __name__ == '__main__':
print list(join_ranges([[138, 821], [900, 910], [905, 915]]))
print list(join_ranges([[138, 821], [900, 910], [905, 915]], 80))
结果:
[(138, 821), (900, 915)]
[(138, 915)]
它是如何工作的:我们把每个起点和终点都贴上标签,然后排序,然后我们简单地计算每一个起点和每个终点。如果我们已经访问了相同数量的起点和终点,我们有一个封闭的(加入)范围。
https://stackoverflow.com/questions/24317211
复制