我试图在python中的一个较大字符串的子串(i.e.windows)列表中找到具有特定频率的子串。它的目标是找出在至少一个子串中存在(固定长度的)哪些子串(如果有的话,以特定的所需频率):
strand='JJJKJKHHGHJKKLHHGJJJHHGJJJ'
#now, I break the string by windows (substrings) and define the patterns to look (sub-substrings) :
A=20 #(fixed lenght of each window (substring) moving along the string in a one-by-one way)
B=3 #(fixed length of the pattern (sub-substring))
C=3 #(frequency of pattern (sub-substring))
pattcount = {}
for i in range(0, len(strand)-A+1):
win=strand[i:i+A]
for n in range(0, len(win)-B+1):
patt=win[n:n+B]
pattcount[patt] = pattcount[patt] + 1 if pattcount.has_key(patt) else 1
pattgroup = []
for p,f in pattcount.iteritems():
if f != C:
pattgroup = pattgroup
elif f == C:
pattgroup += [p]
print (" ".join(pattgroup))因此,我得到的结果是: JKJ,而答案应该是: HHG (它在长度为20的窗口中包含C=3次),没有JKJ或JJJ (后者包含在C=3次中,但在整个字符串中,而不是在长度为20的窗口中)
我做错了什么?我如何才能在至少一个窗口中找到所需频率中存在的模式?(不会将来自其他窗口的模式的任何匹配添加到最终计数中),提前谢谢。
发布于 2013-11-15 02:22:26
问题是程序在窗口中运行(在这种情况下是7),并且在不重置的情况下计算每个模式的频率。所以最后,它发现HHG被遇到了18次(就像badc0re说的那样)。您要求它找到3次重复,所以提供了JKJ,因为它在所有窗口中只被找到3次。
所以基本上,每次启动新窗口时都必须重置pattcount变量。您可以将每个窗口的pattcount存储在另一个变量中,例如pattcounttotal。我举了一个粗略的例子:
strand='JJJKJKHHGHJKKLHHGJJJHHGJJJ'
#now, I break the string by windows (substrings) and define the patterns to look (sub-substrings) :
A=20 #(fixed lenght of each window (substring) moving along the string in a one-by-one way)
B=3 #(fixed length of the pattern (sub-substring))
C=3 #(frequency of pattern (sub-substring))
pattcounttotal = {} #So in this var everything will be stored
for i in range(0, len(strand)-A+1):
pattcount = {} #This one is moved inside the for loop so it is emptied with every window
win=strand[i:i+A]
for n in range(0, len(win)-B+1):
patt=win[n:n+B]
#I've partly rewritten your code to make it more readable (for me)
if pattcount.has_key(patt):
pattcount[patt] = pattcount[patt] + 1
else:
pattcount[patt] = 1
pattcounttotal[i] = pattcount #This pattcount is stored into the total one现在,您必须通过pattcounttotal (而不是pattcount)来查找仅在一个窗口中重复三次的模式。
当我“打印”pattcounttotal var时收到的输出如下:
{0: {'HGH': 1, 'KJK': 1, 'KLH': 1, 'HJK': 1, 'LHH': 1, 'KHH': 1, 'KKL': 1, 'GJJ': 1, 'HGJ': 1, 'JJK': 1, 'JJJ': 2, 'JKJ': 1, 'JKK': 1, 'JKH': 1, 'HHG': 2, 'GHJ': 1},
1: {'JJH': 1, 'HGH': 1, 'KJK': 1, 'JKK': 1, 'KLH': 1, 'LHH': 1, 'KHH': 1, 'KKL': 1, 'GJJ': 1, 'HGJ': 1, 'JJK': 1, 'HJK': 1, 'JKJ': 1, 'JKH': 1, 'HHG': 2, 'GHJ': 1, 'JJJ': 1},
2: {'JHH': 1, 'HGH': 1, 'KJK': 1, 'JJH': 1, 'KLH': 1, 'LHH': 1, 'KHH': 1, 'KKL': 1, 'GJJ': 1, 'HGJ': 1, 'JKH': 1, 'HJK': 1, 'JKJ': 1, 'JKK': 1, 'HHG': 2, 'GHJ': 1, 'JJJ': 1},
3: {'JHH': 1, 'KJK': 1, 'JJH': 1, 'KLH': 1, 'LHH': 1, 'KHH': 1, 'KKL': 1, 'GJJ': 1, 'HGJ': 1, 'JKH': 1, 'HJK': 1, 'HGH': 1, 'JKK': 1, 'HHG': 3, 'GHJ': 1, 'JJJ': 1},
4: {'JHH': 1, 'JJH': 1, 'KLH': 1, 'LHH': 1, 'KHH': 1, 'KKL': 1, 'GJJ': 1, 'HGJ': 2, 'HHG': 3, 'GHJ': 1, 'HGH': 1, 'JKK': 1, 'JKH': 1, 'HJK': 1, 'JJJ': 1},
5: {'JHH': 1, 'JJH': 1, 'KLH': 1, 'LHH': 1, 'KHH': 1, 'KKL': 1, 'GJJ': 2, 'HHG': 3, 'GHJ': 1, 'HGH': 1, 'JKK': 1, 'HGJ': 2, 'HJK': 1, 'JJJ': 1},
6: {'JHH': 1, 'JJH': 1, 'KLH': 1, 'LHH': 1, 'KKL': 1, 'GJJ': 2, 'HHG': 3, 'GHJ': 1, 'HGH': 1, 'JKK': 1, 'HGJ': 2, 'HJK': 1, 'JJJ': 2}}因此,基本上每个窗口的编号从0到6,并且每个模式的频率都被显示出来。因此,如果我快速浏览一下,我会发现在Windows3-5中遇到HHG模式3次。
(如果您想要“自动”输出,那么您仍然必须编写一段代码,循环遍历每个窗口(因此遍历字典的上层),并跟踪遇到三次的模式。)已更新
我认为这是一个有趣的问题,所以我也让它“自动”输出。它与你已经拥有的非常相似。下面是我使用的代码(希望这不是一个剧透:)
pattgroup = []
for i in pattcounttotal.iteritems(): #It first retrieves the upper levels of pattcounttotal
for p,f in i[1].iteritems(): #i[0] is the window's number and i[1] is the dictionary that contains the frequencies
if f != C:
pattgroup = pattgroup
elif f == C:
if not(p in pattgroup): #This makes sure if HHG is not already in pattgroup.
pattgroup += [p]
print (" ".join(pattgroup))如果我把这两个代码放在一起并运行程序,我会得到HHG。
发布于 2013-11-15 04:53:47
您可以简化为
from collections import defaultdict
strand='JJJKJKHHGHJKKLHHGJJJHHGJJJ'
print strand
A=20
B=3
C=3
D = A-B+1
pattcount = defaultdict(int)
res = set()
for i in xrange(len(strand)-A+1):
pattcount.clear()
win=strand[i:i+A]
for n in xrange(D):
pattcount[win[n:n+B]] += 1
res.update(k for k,v in pattcount.iteritems() if v==3)
print 'i==%d res==%s' % (i,res)
print (" ".join(res))编辑
我们甚至可以避免使用win
from collections import defaultdict
strand='JJJKJKHHGHJKKLHHGJJJHHGJJJ'
print strand
A=20
B=3
C=3
D = A-B+1
pattcount = defaultdict(int)
res = set()
for i in xrange(len(strand)-A+1):
pattcount.clear()
for n in xrange(i,i+D):
pattcount[strand[n:n+B]] += 1
res.update(k for k,v in pattcount.iteritems() if v==3)
print (" ".join(res))https://stackoverflow.com/questions/19983693
复制相似问题