文章/答案/技术大牛

发布

社区首页 >问答首页 >用于检查子字符串匹配位置的Python程序，其中包含大量的子字符串排列。

问用于检查子字符串匹配位置的Python程序，其中包含大量的子字符串排列。
EN

Code Review用户

提问于 2017-08-11 07:51:34

回答 1查看 306关注 0票数 5

这是我在这里的第一个代码，也是我第一个这么大的程序。我真的不知道一个编写“好的、可读的”代码的程序员的期望是什么。这是我的第一个程序，将用于一个现实世界的应用程序.而且，我对Python非常陌生。因此，在回顾时，请对这些代码或我的代码如何在一般的Python和编程方面都更好的问题提出建设性的批评。我将尽量用最好的方式解释下一段中的问题。如果需要对我的代码/逻辑/问题做任何澄清，请在评论中随意提问，我将尽力消除疑虑。

问题-

考虑两个文件。
每个都包含一个字符串列表。
一个列表包含“a”、“t”、“g”和“c”的字符串。
一个列表包含“A”、“U”、“G”和“C”组合的字符串
我必须将字符串从大写列表转换为a's到t's，c's to g's，u's到a's，g's到c's A-t，c-g，g-c，u-a。另一个特殊的条件是，在最多两个例子中，u的转换为g和/或g的转换为t的U-g，g-t
只要开始索引为1，则只需对字符串的四个区域(索引2-7(6字符)、2-8(7字符)、1-7(7字符)和1-8(8字符)进行转换。
在生成所有可能的转换之后，我必须对照另一个列表中的所有字符串检查每个字符串，并找出它们匹配的位置。

如果您正在寻找某种类型的输出，我还不能提供它，因为我需要做的比较是关于(38869 * 2588 * all possibble combinatons of each of the 2588) + time taken to generate all the permutations的。所以我的机器不足以做那样的事情。

我的程序-

## Date : 2017-08-10
## Author : dadyodevil
## Contact : daddyodevil@gmail.com
##
## A python program to detect all indices of complimentary Micro-RNA(miRNA) target sites on Messenger-RNAs(mRNA)
##
## As an input, this program needs two lists - 
##  1. A list of mRNAs where each entry is represented in a two line format:
##      >hg19_refGene NM_032291 range=chr1:67208779-67210768...
##      Sequence of mRNA
##   2. A list of miRNAs where each entry is represented in a two line format:
##      >hsa-miR-576-3p MIMAT000...
##      Sequence of miRNA
##
##  Pre-requisites for the reader -  
##  1. Understanding of programming concepts
##  2. A moderate understanding of the Python programming language version 2.7
##  3. Knowledge of terms regarding miRNA-mRNA target detection


import re


def extractSeed(miRNA):

    ## There are 4 seed regions with indices from 2-7, 2-8, 1-7 and 1-8

    miRNAfor6mer.append(miRNA[1:7][::-1])
    miRNAfor7mer.append(miRNA[1:8][::-1])
    miRNAfor7a1.append(miRNA[:7][::-1])
    miRNAfor8mer.append(miRNA[0:9][::-1])


def createCompliment(allCompliments, miRNA, wobbleCount, compliment):

    ## For the compliment, the convertions include a:t, u:a, g:c, c:g and for Wobble-Pairs, u:g and g:u

    if wobbleCount == 2:        
        for letter in miRNA:
            if letter == 'a':
                compliment += 't'
            elif letter == 'c':
                compliment += 'g'
            elif letter == 'g':
                compliment += 'c'
            else:
                compliment += 'a'
        allCompliments.append(compliment)

    else:
        for index, letter in enumerate(miRNA):
            if letter == 'a':
                compliment += 't'
            elif letter == 'c':
                compliment += 'g'
            elif letter == 'g':
                createCompliment(allCompliments, miRNA[index+1:], wobbleCount + 1, compliment + "t")
                createCompliment(allCompliments, miRNA[index+1:], wobbleCount + 1, compliment + "c")
                compliment += 'c'
            elif letter == 'u':
                createCompliment(allCompliments, miRNA[index+1:], wobbleCount + 1, compliment + "g")
                createCompliment(allCompliments, miRNA[index+1:], wobbleCount + 1, compliment + "a")
                compliment += 'a'

    ## Now that all possibilities are generated, the duplicates need to be removed
    allCompliments = sorted(list(set(allCompliments)))


def checkForMatch(miRNACompliments, seedRegion, miRNAname):

    ## Each miRNA that is recived by this function will be compared against the whole list of mRNAs and the matching indices will be saved

    ## Since the mRNA sequences are in alternate lines the sequences will be extracted as such and the when matches are found, the name of the mRNA will be extracted from teh index just before the current one

    for index in range(1, len(mRNA_List), 2):
        for entry in miRNACompliments:
            mRNA =  mRNA_List[index]
            matchesStart = [m.start() for m in re.finditer(entry, mRNA)]

            if (len(matchesStart) > 0):             
                mRNAname = mRNA_List[index-1][14:mRNA_List[index-1].find(" ",15)]
                matchesEnd = []             
                for index2 in range(0, len(matchesStart)):
                    matchesEnd.append(matchesStart[index2] + len(entry))
                allindices = zip(matchesStart, matchesEnd)
                complimentarySiteList.append([miRNAname, mRNAname, seedRegion, allindices])


def prepareForMatch(miRNA, miRNAname):

    global miRNAfor6mer, miRNAfor7mer, miRNAfor7a1, miRNAfor8mer
    miRNAfor6mer, miRNAfor7mer, miRNAfor7a1, miRNAfor8mer = [], [], [], []

    ## First the seed sites will be extracted and reversed
    extractSeed(miRNA)

    ## Empty lists will be generated to store all the compliments
    miRNAfor6mer.append([])
    miRNAfor7mer.append([])
    miRNAfor7a1.append([])
    miRNAfor8mer.append([])

    ## Then the compliments will be generated from the seed regions along with atmost of two Wobble-Pairs
    miRNAfor6mer.append(createCompliment(miRNAfor6mer[1], miRNAfor6mer[0], 0, ""))
    miRNAfor7mer.append(createCompliment(miRNAfor7mer[1], miRNAfor7mer[0], 0, ""))
    miRNAfor7a1.append(createCompliment(miRNAfor7a1[1], miRNAfor7a1[0], 0, ""))
    miRNAfor8mer.append(createCompliment(miRNAfor8mer[1], miRNAfor8mer[0], 0, ""))

    ## After generating all possible compliments, they will be checked for matching sites
    checkForMatch(miRNAfor6mer[1], "6mer", miRNAname)
    checkForMatch(miRNAfor7mer[1], "7mer", miRNAname)
    checkForMatch(miRNAfor7a1[1], "7A1", miRNAname)
    checkForMatch(miRNAfor8mer[1], "8mer", miRNAname)


def Main():

    global mRNA_List, miRNA_List, complimentarySiteList
    miRNA_List = open('miRNA_list.txt').read().splitlines()
    mRNA_List = open('mRNA_list.txt').read().splitlines()
    complimentarySiteList = []

    ## Since the sequences are in every alteRNAte lines, the 'index' needs to be incremeted by 2 to access only the sequences

    ## The miRNA lengths are also checked whether they are atleast 8 neucleotides long, if they are not, they will not be checked

    for index in range(1,len(miRNA_List),2):

        miRNAname = miRNA_List[index-1][5:miRNA_List[index-1].find(' ')]

        if (len(miRNA_List[index]) < 8):
            print "%s at %d has insufficient length." %(miRNAname, index)

        else:
            prepareForMatch(miRNA_List[index].lower(), miRNAname)

    for entry in complimentarySiteList:
        print entry


if __name__ == '__main__':
    Main()

recursion

python

beginner

strings

回答 1

Code Review用户

回答已采纳

发布于 2017-08-11 10:04:06

请不要滥用名单。突变是很难推理的，所以当一个函数可以变异数据时，它会使函数更难理解。
函数createCompliment不返回任何内容，而是对allCompliments进行变异。当您分配给它时，这个突变不起作用，例如在最后一行。allCompliments = sorted(list(set(allCompliments)))。这没什么作用，因为你以后不使用它。
与其在createCompliment中使用两个慢速for循环，不如执行所有标准转换，然后通过循环遍历特殊索引的所有组合来处理特殊转换。这样，如果您在执行if wobbleCount == 2:时执行循环，那么首先，这样您就可以得到基本的转换，那么在执行特殊的转换时，您就不必关心它们。这样，如果您有输入ccagaa，那么就可以将它转换为ggtctt，而不关心g。最简单的方法是使用str.translate。在此之后，您需要转换特殊的转换，即g -> t和u -> g。但是，由于我们执行了上述操作，所以它们是c -> t和a -> g。要转换这些，您需要获取c和a的索引。你可以用一张清单来理解。[i for i, c in enumerate(rna) if c in 'ac']。在此之后，您希望将这些字符的一次到两次出现的所有组合进行转换。这意味着我们可以使用itertools.combinations循环我们想要改变的所有组合。最后，我们必须转换值，因此使用[:]复制列表，这是整个列表的一个切片。然后循环遍历索引，我们可以转换列表中的值，并yield列表的字符串版本。这方面的一个例子是:我们从rna = 'cagu'开始，将它转换为基本的转换，gtca。在此之后，我们得到了所有特殊字符的索引，即[2, 3]。然后，我们将遍历所有这些组合，它们是[(2,), (3,), (2, 3)]，以及转换后的单词yield，即gtta、gtcg和gttg。将yield看作array.append是最简单的方法。因此，函数的以下内容大致相同: def fn_1()：产生1产2 def fn_2()：array = [] array.append(1) array.append(2)返回数组def fn_3()：返回1，2列表(fn_1()) == fn_2() == fn_3() # True
为了简化check_for_match，当您在一个二维元组的扁平列表中循环时，您想要取消这个列表。因此，[0, 1, 2, 3]将成为[(0, 1), (2, 3)]，允许使用更简单的for a, b in ...循环方法。为此，您可以使用grouper食谱：DEF石斑鱼(迭代，n，fillvalue=None)：“将数据收集到固定长度的块或块中”#石斑鱼(‘ABCDEFG’，3，'x') -> ABC def Gxx args = 可迭代的 *n返回izip_longest(fillvalue=fillvalue，*args)，因为[item] * n在item上不执行副本，并且利用迭代器的工作方式。前者意味着[item] * 2与[item, item]相同，而不是[item, copy(item)]。这一点很重要，因为这确保了两个项都是相同的迭代器。多次使用单个迭代器是很重要的，因为zip基本上使用[(next(it), next(it)), (next(it), next(it)), ...]，它更复杂一些，因为它可以处理任意大小的it，并且知道什么时候it停止。然而，这基本上是它的工作方式。
你应该跟着PEP8。

所以我会把你的代码改为：

import re
import string
import itertools

TRANS = string.maketrans('acgu', 'tgca')
CONVS = {'a': 'g', 'c': 't'}

SEEDS = [
    "6mer",
    "7mer",
    "7A1",
    "8mer"
]


def create_compliments(rna):
    rna = rna.translate(TRANS)
    yield rna
    all_indexes = [i for i, c in enumerate(rna) if c in CONVS]
    rna = list(rna)
    for n in (1, 2):
        for indexes in itertools.combinations(all_indexes, n):
            t = rna[:]
            for index in indexes:
                t[index] = CONVS[t[index]]
            yield ''.join(t)


def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return itertools.izip_longest(*args, fillvalue=fillvalue)


def check_for_match(mi_RNAs, seed, mi_RNA_name, m_RNA_list):
    mi_RNAs = list(mi_RNAs)
    for m_RNA_name, m_RNA in grouper(m_RNA_list, 2):
        m_RNA_name = m_RNA_name[14:m_RNA_name.find(" ", 15)]
        for entry in mi_RNAs:
            matches = [m.start() for m in re.finditer(entry, m_RNA)]
            if matches:
                all_indices = tuple(
                    (match, match + len(entry))
                    for match in matches
                )
                yield mi_RNA_name, m_RNA_name, seed, all_indices


def prepare_for_match(mi_RNA, mi_RNA_name, m_RNA_list):
    mi_RNAs = [
        mi_RNA[1:7][::-1],
        mi_RNA[1:8][::-1],
        mi_RNA[:7][::-1],
        mi_RNA[0:9][::-1]
    ]

    for mi_RNA, seed in zip(mi_RNAs, SEEDS):
        for entry in check_for_match(create_compliments(mi_RNA), seed, mi_RNA_name, m_RNA_list):
            yield entry


def main():
    mi_RNA_list = open('miRNA_list.txt').read().splitlines()
    m_RNA_list = open('mRNA_list.txt').read().splitlines()

    for index in range(1, len(mi_RNA_list), 2):
        mi_RNA_name = mi_RNA_list[index-1][5:mi_RNA_list[index-1].find(' ')]
        if (len(mi_RNA_list[index]) < 8):
            print "{} at {} has insufficient length.".format(mi_RNA_name, index)
        else:
            for entry in prepare_for_match(mi_RNA_list[index].lower(), mi_RNA_name, m_RNA_list):
                print tuple(entry)


if __name__ == '__main__':
    main()

票数 2

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/172673

复制

相似问题

问用于检查子字符串匹配位置的Python程序，其中包含大量的子字符串排列。
EN

问题-

我的程序-

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用于检查子字符串匹配位置的Python程序，其中包含大量的子字符串排列。EN

问题-

我的程序-

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用于检查子字符串匹配位置的Python程序，其中包含大量的子字符串排列。
EN