Python学习教程 (六)

作业(三)

  1. 使 “作业(二)” 中的程序都能接受命令行参数
    • import sys
    • sys.argv
    • import optparse
    • 用到的知识点

2.备注

  • 每个提到提到的“用到的知识点”为相对于前面的题目新增的知识点,请综合考虑。此外,对于不同的思路并不是所有提到的知识点都会用着,而且也可能会用到未提到的知识点。但是所有知识点都在前面的讲义部分有介绍。
  • 每个程序对于你身边会写的人来说都很简单,因此你一定要克制住,独立去把答案做出,多看错误提示,多比对程序输出结果和预期结果的差异。
  • 学习锻炼“读程序”,即对着文件模拟整个的读入、处理过程来发现可能的逻辑问题。
  • 程序运行没有错误不代表你写的程序完成了你的需求,你要去插眼输出结果是不是你想要的。

3.关于程序调试

  • 在初写程序时,可能会出现各种各样的错误,常见的有缩进不一致,变量名字拼写错误,丢失冒号,文件名未加引号等,这时要根据错误提示查看错误类型是什么,出错的是哪一行来定位错误。当然,有的时候报错的行自身不一定有错,可能是其前面或后面的行出现了错误。
  • 当结果不符合预期时,要学会使用print来查看每步的操作是否正确,比如我读入了字典,我就打印下字典,看看读入的是不是我想要的,是否含有不该存在的字符;或者在每个判断句、函数调入的情况下打印个字符,来跟踪程序的运行轨迹。

%%writefile script/cat.py
#作业1和2, cat.py
#包含了三中处理换行符的写法
import sys

def cat(file):
    for line in open(file):
        print line,
#------END of cat---------

def main():
    if len(sys.argv) < 2:  #
        print >>sys.stderr, "Usage: python %s filename" % sys.argv[0] #一般提示信息输出到标准错误
        sys.exit(0)

    file = sys.argv[1]
    cat(file)
#-------END main---------------
if __name__ == '__main__':
    main()
Overwriting script/cat.py
%run script/cat data/test1.fa
>NM_001011874 gene=Xkr4 CDS=151-2091
gcggcggcgggcgagcgggcgctggagtaggagctggggagcggcgcggccggggaaggaagccagggcg
>NM_001195662 gene=Rp1 CDS=55-909
AGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCA
>NM_011283 gene=Rp1 CDS=128-6412
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCAC
>NM_0112835 gene=Rp15 CDS=128-6412
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCAC
%%writefile script/splitName.py
#作业3
import sys

def splitName(filename):
    for line in open(filename):
        if line[0] == '>':
            lineL = line.split(' ')
            print lineL[0]
        else:
            print line,
#------END splitName---------------

def main():
    if len(sys.argv) < 2:  #
        print >>sys.stderr, "Usage: python %s filename" % sys.argv[0] #一般提示信息输出到标准错误
        sys.exit(0)

    filename = sys.argv[1]
    splitName(filename)
#-------END main------------------

if __name__ == '__main__':
    main()
Overwriting script/splitName.py
%run script/splitName data/test2.fa
>NM_001011874
gcggcggcgggcgagcgggcgctggagtaggagctggggagcggcgcggccggggaaggaagccagggcg
aggcgaggaggtggcgggaggaggagacagcagggacaggTGTCAGATAAAGGAGTGCTCTCCTCCGCTG
CCGAGGCATCATGGCCGCTAAGTCAGACGGGAGGCTGAAGATGAAGAAGAGCAGCGACGTGGCGTTCACC
CCGCTGCAGAACTCGGACAATTCGGGCTCTGTGCAAGGACTGGCTCCAGGCTTGCCGTCGGGGTCCGGAG
>NM_001195662
AAGCTCAGCCTTTGCTCAGATTCTCCTCTTGATGAAACAAAGGGATTTCTGCACATGCTTGAGAAATTGC
AGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCA
AGTTCCTTCCCCTCGCCATTCAAATATCACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGT
GGAGACCCACAGTTTGGCGGCGTTCGGGTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGC
TGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACA
CAGCATCACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTG
>NM_011283
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCAC
ACAAACTATTTACTCTTTTCTTCGTAAGGAAAGGTTCAACTTCTGGTCTCACCCAAAATGAGTGACACAC
CTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATAT
CACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGG
GTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCC
TGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGA
GGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTGCCAGTTGACCTGGACAAGGCCCGCAGG
CGCCCTCGGCCCTGGCTGAGTAGTCGCTCCATAAGCACGCATGTGCAGCTCTGTCCTGCAACTGCCAATA
TGTCCACCATGGCACCTGGCATGCTCCGTGCCCCAAGGAGGCTCGTGGTCTTCCGGAATGGTGACCCGAA
>NM_0112835
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCAC
ACAAACTATTTACTCTTTTCTTCGTAAGGAAAGGTTCAACTTCTGGTCTCACCCAAAATGAGTGACACAC
CTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATAT
CACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGG
GTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCC
TGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGA
GGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTGCCAGTTGACCTGGACAAGGCCCGCAGG
CGCCCTCGGCCCTGGCTGAGTAGTCGCTCCATAAGCACGCATGTGCAGCTCTGTCCTGCAACTGCCAATA
TGTCCACCATGGCACCTGGCATGCTCCGTGCCCCAAGGAGGCTCGTGGTCTTCCGGAATGGTGACCCGAA
%%writefile script/formatFasta.py
#作业4

import sys

def formatFasta(filename):
    aList = []
    for line in open(filename):
        if line[0] == '>':
            lineL = line.split(' ')
            if aList:
                print ''.join(aList)
                aList = []
            name = lineL[0]
            print name
        else:
            aList.append(line.strip())
#不要忘了最后一个序列
    print ''.join(aList)
#-----End formatFasta----------------
def main():
    if len(sys.argv) < 2:  #
        print >>sys.stderr, "Usage: python %s filename" % sys.argv[0] #一般提示信息输出到标准错误
        sys.exit(0)

    filename = sys.argv[1]
    formatFasta(filename)
#-------END main------------------

if __name__ == '__main__':
    main()
Overwriting script/formatFasta.py
%run script/formatFasta data/test2.fa
>NM_001011874
gcggcggcgggcgagcgggcgctggagtaggagctggggagcggcgcggccggggaaggaagccagggcgaggcgaggaggtggcgggaggaggagacagcagggacaggTGTCAGATAAAGGAGTGCTCTCCTCCGCTGCCGAGGCATCATGGCCGCTAAGTCAGACGGGAGGCTGAAGATGAAGAAGAGCAGCGACGTGGCGTTCACCCCGCTGCAGAACTCGGACAATTCGGGCTCTGTGCAAGGACTGGCTCCAGGCTTGCCGTCGGGGTCCGGAG
>NM_001195662
AAGCTCAGCCTTTGCTCAGATTCTCCTCTTGATGAAACAAAGGGATTTCTGCACATGCTTGAGAAATTGCAGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATATCACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGGGTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTG
>NM_011283
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCACACAAACTATTTACTCTTTTCTTCGTAAGGAAAGGTTCAACTTCTGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATATCACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGGGTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTGCCAGTTGACCTGGACAAGGCCCGCAGGCGCCCTCGGCCCTGGCTGAGTAGTCGCTCCATAAGCACGCATGTGCAGCTCTGTCCTGCAACTGCCAATATGTCCACCATGGCACCTGGCATGCTCCGTGCCCCAAGGAGGCTCGTGGTCTTCCGGAATGGTGACCCGAA
>NM_0112835
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCACACAAACTATTTACTCTTTTCTTCGTAAGGAAAGGTTCAACTTCTGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATATCACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGGGTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTGCCAGTTGACCTGGACAAGGCCCGCAGGCGCCCTCGGCCCTGGCTGAGTAGTCGCTCCATAAGCACGCATGTGCAGCTCTGTCCTGCAACTGCCAATATGTCCACCATGGCACCTGGCATGCTCCGTGCCCCAAGGAGGCTCGTGGTCTTCCGGAATGGTGACCCGAA
%%writefile script/formatFasta-2.py
#作业5 

import sys

def formatSeq(aList, length):
    seq = ''.join(aList)
    len_seq = len(seq)
    for i in range(0,len_seq,length):
        print seq[i:i+length] 
#----END formatSeq-------------------

def formatFasta(filename,length):
    aList = []
    for line in open(filename):
        if line[0] == '>':
            lineL = line.split(' ')
            if aList:
                formatSeq(aList, length)
                aList = []
            name = lineL[0]
            print name
        else:
            aList.append(line.strip())
    #不要忘了最后一个序列
    formatSeq(aList, length)
#------------END of formatFasta------------
def main():
    if len(sys.argv) < 3:  #注意这儿参数改变了
        print >>sys.stderr, "Usage: python %s filename length_of_seq_in_each_line" % sys.argv[0] #一般提示信息输出到标准错误
        sys.exit(0)

    filename = sys.argv[1]
    length = int(sys.argv[2])
    formatFasta(filename, length)
#-------END main------------------

if __name__ == '__main__':
    main()
Overwriting script/formatFasta-2.py
%run script/formatFasta-2 data/test2.fa 80
>NM_001011874
gcggcggcgggcgagcgggcgctggagtaggagctggggagcggcgcggccggggaaggaagccagggcgaggcgaggag
gtggcgggaggaggagacagcagggacaggTGTCAGATAAAGGAGTGCTCTCCTCCGCTGCCGAGGCATCATGGCCGCTA
AGTCAGACGGGAGGCTGAAGATGAAGAAGAGCAGCGACGTGGCGTTCACCCCGCTGCAGAACTCGGACAATTCGGGCTCT
GTGCAAGGACTGGCTCCAGGCTTGCCGTCGGGGTCCGGAG
>NM_001195662
AAGCTCAGCCTTTGCTCAGATTCTCCTCTTGATGAAACAAAGGGATTTCTGCACATGCTTGAGAAATTGCAGGTCTCACC
CAAAATGAGTGACACACCTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATT
CAAATATCACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGGGTG
GTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGT
AAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCT
CCCACAATAAGAAGGTGCTG
>NM_011283
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCACACAAACTATT
TACTCTTTTCTTCGTAAGGAAAGGTTCAACTTCTGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATG
ATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATATCACTCATCCTGTAGTGGCTAAACGCATCAG
TTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGGGTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTC
TGCTGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATC
ACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTGCCAGTTGACCTGGACAA
GGCCCGCAGGCGCCCTCGGCCCTGGCTGAGTAGTCGCTCCATAAGCACGCATGTGCAGCTCTGTCCTGCAACTGCCAATA
TGTCCACCATGGCACCTGGCATGCTCCGTGCCCCAAGGAGGCTCGTGGTCTTCCGGAATGGTGACCCGAA
>NM_0112835
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCACACAAACTATT
TACTCTTTTCTTCGTAAGGAAAGGTTCAACTTCTGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATG
ATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATATCACTCATCCTGTAGTGGCTAAACGCATCAG
TTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGGGTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTC
TGCTGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATC
ACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTGCCAGTTGACCTGGACAA
GGCCCGCAGGCGCCCTCGGCCCTGGCTGAGTAGTCGCTCCATAAGCACGCATGTGCAGCTCTGTCCTGCAACTGCCAATA
TGTCCACCATGGCACCTGGCATGCTCCGTGCCCCAAGGAGGCTCGTGGTCTTCCGGAATGGTGACCCGAA
%%writefile script/sortFasta.py
#作业6

import sys

def readFasta(filename):
    aDict = {}
    for line in open(filename):
        if line[0] == '>':
            key = line.split()[0]
            aDict[key] = [] #字典的值是一个列表, = 左边的字符串可以认为是右边列表的名字
        else:
            aDict[key].append(line.strip()) 
    return aDict
#-----END readFasta-------------------
def sortFasta(aDict):
    keyL = aDict.keys()
    keyL.sort()

    for key in keyL:
        print "%s\n%s" % (key, ''.join(aDict[key]))
#--------END sortFasta--------------------

def main():
    if len(sys.argv) < 2:  #注意这儿参数改变了
        print >>sys.stderr, "Usage: python %s filename" % sys.argv[0] #一般提示信息输出到标准错误
        sys.exit(0)

    filename = sys.argv[1]
    aDict = readFasta(filename)
    sortFasta(aDict)

#-------END main------------------

if __name__ == '__main__':
    main()
Overwriting sortFasta.py
%run script/sortFasta data/test2.fa
>NM_001011874
gcggcggcgggcgagcgggcgctggagtaggagctggggagcggcgcggccggggaaggaagccagggcgaggcgaggaggtggcgggaggaggagacagcagggacaggTGTCAGATAAAGGAGTGCTCTCCTCCGCTGCCGAGGCATCATGGCCGCTAAGTCAGACGGGAGGCTGAAGATGAAGAAGAGCAGCGACGTGGCGTTCACCCCGCTGCAGAACTCGGACAATTCGGGCTCTGTGCAAGGACTGGCTCCAGGCTTGCCGTCGGGGTCCGGAG
>NM_001195662
AAGCTCAGCCTTTGCTCAGATTCTCCTCTTGATGAAACAAAGGGATTTCTGCACATGCTTGAGAAATTGCAGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATATCACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGGGTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTG
>NM_011283
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCACACAAACTATTTACTCTTTTCTTCGTAAGGAAAGGTTCAACTTCTGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATATCACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGGGTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTGCCAGTTGACCTGGACAAGGCCCGCAGGCGCCCTCGGCCCTGGCTGAGTAGTCGCTCCATAAGCACGCATGTGCAGCTCTGTCCTGCAACTGCCAATATGTCCACCATGGCACCTGGCATGCTCCGTGCCCCAAGGAGGCTCGTGGTCTTCCGGAATGGTGACCCGAA
>NM_0112835
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCACACAAACTATTTACTCTTTTCTTCGTAAGGAAAGGTTCAACTTCTGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATATCACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGGGTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTGCCAGTTGACCTGGACAAGGCCCGCAGGCGCCCTCGGCCCTGGCTGAGTAGTCGCTCCATAAGCACGCATGTGCAGCTCTGTCCTGCAACTGCCAATATGTCCACCATGGCACCTGGCATGCTCCGTGCCCCAAGGAGGCTCGTGGTCTTCCGGAATGGTGACCCGAA
%%writefile script/grepFasta.py
#作业7.1

import sys

def readFasta(filename):
    aDict = {}
    for line in open(filename):
        if line[0] == '>':
            key = line.split()[0][1:] #注意去掉开头的'>',只取序列的名字
            aDict[key] = []  #字典的值是一个列表
        else:
            aDict[key].append(line.strip())
    return aDict
#--------END readFasta--------------------
def main():
    if len(sys.argv) < 3:  #注意这儿参数改变了
        print >>sys.stderr, "Usage: python %s seqFile nameFile" % sys.argv[0] #一般提示信息输出到标准错误
        sys.exit(0)

    seqFile = sys.argv[1]
    nameFile = sys.argv[2]
    aDict = readFasta(seqFile)
    for line in open(nameFile):
        name = line.strip()
        print ">%s\n%s" % (name, ''.join(aDict[name]))
#-------END main------------------

if __name__ == '__main__':
    main()
Overwriting grepFasta.py
%run script/grepFasta.py data/test2.fa data/fasta.name
>NM_001011874
gcggcggcgggcgagcgggcgctggagtaggagctggggagcggcgcggccggggaaggaagccagggcgaggcgaggaggtggcgggaggaggagacagcagggacaggTGTCAGATAAAGGAGTGCTCTCCTCCGCTGCCGAGGCATCATGGCCGCTAAGTCAGACGGGAGGCTGAAGATGAAGAAGAGCAGCGACGTGGCGTTCACCCCGCTGCAGAACTCGGACAATTCGGGCTCTGTGCAAGGACTGGCTCCAGGCTTGCCGTCGGGGTCCGGAG
>NM_011283
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCACACAAACTATTTACTCTTTTCTTCGTAAGGAAAGGTTCAACTTCTGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATATCACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGGGTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTGCCAGTTGACCTGGACAAGGCCCGCAGGCGCCCTCGGCCCTGGCTGAGTAGTCGCTCCATAAGCACGCATGTGCAGCTCTGTCCTGCAACTGCCAATATGTCCACCATGGCACCTGGCATGCTCCGTGCCCCAAGGAGGCTCGTGGTCTTCCGGAATGGTGACCCGAA
%%writefile script/grepFastq.py
#作业7.2
import sys

def readFastq(seqFile):
    aDict = {}
    count = 0
    for line in open(seqFile):
        count += 1
        if count % 4 == 1:
            name = line.strip()[1:] #去掉开头的@
            aDict[name] = [line] #存储名字行
        else:
            aDict[name].append(line)
    #----END reading--------------
    return aDict
#-------END readFastq----------这行只是注释,可有可无------

def main():
    if len(sys.argv) < 3:  #注意这儿参数改变了
        print >>sys.stderr, "Usage: python %s seqFile nameFile" % sys.argv[0] #一般提示信息输出到标准错误
        sys.exit(0)

    seqFile = sys.argv[1]
    nameFile = sys.argv[2]
    aDict = readFastq(seqFile)
    for line in open(nameFile):
        name = line.strip()
        print ''.join(aDict[name]),
#-------END main------------------

if __name__ == '__main__':
    main()
Overwriting grepFastq.py
%run script/grepFastq data/test1.fq data/fastq.name
@HWI-ST1223:80:D1FMTACXX:2:1101:1243:2213 1:N:0:AGTCAA
TCTGTGTAGCCNTGGCTGTCCTGGAACTCACTTTGTAGACCAGGCTGGCATGCACCACCACNNNCGGCTCATTTGTCTTTNNTTTTTGTTTTGTTCTGTA
+
BCCFFFFFFHH#4AFHIJJJJJJJJJJJJJJJJJIJIJJJJJGHIJJJJJJJJJJJJJIIJ###--5ABECFFDDEEEEE##,5=@B8?CDD<AD>C:@>
@HWI-ST1223:80:D1FMTACXX:2:1101:1375:2060 1:N:0:AGTCAA
NTGCTGAGCCACGACAAGGATCCCAGAGGGCCNAGCCCTGCATCTTGTATGGACCAGTTACNCATCAAAAGAGACTACTGTAGGCACCATCAATCAGATC
+
#1:DDDD;?CFFHDFEEIGIIIIIIG;DHFGG#)0?BFBDHBFF<FCFEFD;@DD@A=7?E#,,,;=(>3;=;;C>ACCC@CCCCCBBBCCAACCCCCCC
@HWI-ST1223:80:D1FMTACXX:2:1101:1383:2091 1:N:0:AGTCAA
NGTTCGTGTGGAACCTGGCGCTAAACCATTCGTAGACGACCTGCTTCTGGGTCGGGGTTTCGTACGTAGCAGAGCAGCTCCCTCGCTGCGATCTATTGAA
+
#1=DDFDFHHHHHJGJJJJJJJJJJJJJJJIJIGDHIHIGIJJJJJJJIIIGHHFDD3>BDDBDDDDDDDDDDBDCCBDDDDDDDDDDDBBDDDDEEACD
@HWI-ST1223:80:D1FMTACXX:2:1101:1452:2138 1:N:0:AGTCAA
NTCTAGGAGGTCTAGAAAGCCCAGGCCACCGGTACAAACATCAAGGGTGTTACGGATGTGCCGCTCTGAACCTCCAGGACGACTTTGATTTCAACTACAA
+
#4=DFFEFHHHHHJJJJJIJJJJHIIJGJJJJ@GIIJJJJJJIJJJJFGHIIIJJHHHDFFFFDDDDDDDDDDDDCDDDDDDDDDDDCCCEDEDDDDDDD
%%writefile script/screenResult.py
#作业8
import sys

def screenResult(filename, fc_value, adjP_value):
    head = 1 #This is used to indicate there is one head line needing to skip
    for line in open(filename):
        if head:
            head -= 1
            continue
        #-----Begin data lines------
        lineL = line.split()

        foldChange = float(lineL[9])
        adjP = float(lineL[12])
        if foldChange > fc_value and adjP < adjP_value:
            print lineL[0]
#-------END screenResult-------------------
def main():
    if len(sys.argv) < 4:  #注意这儿参数改变了
        print >>sys.stderr, "Usage: python %s fileName fold_change adjP" % sys.argv[0] #一般提示信息输出到标准错误
        sys.exit(0)

    filename = sys.argv[1]
    fc = float(sys.argv[2])
    adjP = float(sys.argv[3])

    screenResult(filename, fc, adjP)
#-------END main------------------

if __name__ == '__main__':
    main()
Overwriting screenResult.py
%run script/screenResult data/test.expr 2 0.05
Novel00011
Novel00043
Novel00047
Novel00077
Novel00079
Novel00080
Novel00084
Novel00085
Novel00086
Novel00087
Novel00090
Novel00124
Novel00148
Novel00156
Novel00162
Novel00166
%%writefile script/transferMultipleColumToMatrix.py
#作业9

import sys

def readMultipleColFile(filename):
    head = 1
    tissueSet = set()
    aDict = {}
    for line in open(filename):
        if head: #skip header line
            head -= 1
            continue
        #-------------------------------
        lineL = line.split()
        gene = lineL[0]
        tissue = lineL[1]
        expr = lineL[2]
        if gene not in aDict:
            aDict[gene] = {}
        assert tissue not in aDict[gene], "Duplicate tissues"
        aDict[gene][tissue] = expr
        tissueSet.add(tissue)
    #---END reading----------------------------
    tissueL = list(tissueSet)
    tissueL.sort()
    return tissueL, aDict
#----END readMultipleColFile--------------------------

def main():
    if len(sys.argv) < 2:  #注意这儿参数改变了
        print >>sys.stderr, "Usage: python %s fileName" % sys.argv[0] #一般提示信息输出到标准错误
        sys.exit(0)

    filename = sys.argv[1]
    tissueL, aDict = readMultipleColFile(filename)

    print "Gene\t%s" % '\t'.join(tissueL)

    for gene, tissueD in aDict.items():
        exprL = [gene]
        for tissue in tissueL:
            exprL.append(tissueD[tissue])
        print '\t'.join(exprL)
#-------END main------------------
if __name__ == '__main__':
    main()
Overwriting transferMultipleColumToMatrix.py
%run script/transferMultipleColumToMatrix data/multipleColExpr.txt
Gene    A-431    A-549    AN3-CA    BEWO    CACO-2
ENSG00000000460    25.2    14.2    10.6    24.4    14.2
ENSG00000000938    0.0    0.0    0.0    0.0    0.0
ENSG00000000457    2.8    3.4    3.8    5.8    2.9
ENSG00000000419    73.8    38.6    33.9    53.7    155.5
ENSG00000000003    21.3    32.5    38.2    31.4    63.9
ENSG00000000005    0.0    0.0    0.0    0.0    0.0
%%writefile script/reverseComplementary.py
#作业10
import sys

def reverseComplementary(string):
    ATCG_dict = {'A':'T','T':'A','C':'G','G':'C', 'a':'t','t':'a','c':'g','g':'c'}

    tmpL = []
    for i in string:
        tmpL.append(ATCG_dict[i])
    tmpL.reverse()
    return ''.join(tmpL)
#------END reverseComplementary-------
def main():
    if len(sys.argv) < 2:  #注意这儿参数改变了
        print >>sys.stderr, "Usage: python %s DNAsequence" % sys.argv[0] #一般提示信息输出到标准错误
        sys.exit(0)

    DNA = sys.argv[1]
    print reverseComplementary(DNA)    
#-------END main------------------

if __name__ == '__main__':
    main()
Overwriting reverseComplementary.py
%run script/reverseComplementary ACGTACGTACGTCACGTCAGCTAGAC
GTCTAGCTGACGTGACGTACGTACGT
%%writefile script/collapsemiRNAreads.py
#作业11

import sys

def collapsemiRNAreads(smRNA_file, sample, head = 1):
    lineno = 0
    for line in open(smRNA_file):
        if head:
            head -= 1
            continue
        #----end skip header line---
        seq, value = line.split()
        lineno += 1
        print ">%s_%d_x%s\n%s" % (sample, lineno, value, seq)
#-------end collapsemiRNAreads------------
def main():
    if len(sys.argv) < 3:  #注意这儿参数改变了
        print >>sys.stderr, \
            "Usage: python %s smRNA_file sampleLabel (Only three letters like ESB)" % sys.argv[0] #一般提示信息输出到标准错误
        sys.exit(0)

    smRNA_file = sys.argv[1]
    sample = sys.argv[2]
    collapsemiRNAreads(smRNA_file, sample)    
#-------END main------------------

if __name__ == '__main__':
    main()
Overwriting collapsemiRNAreads.py
%run script/collapsemiRNAreads data/mir.collapse ESB
>ESB_1_x2
ACTGCCCTAAGTGCTCCTTCTGGC
>ESB_2_x2
ATAAGGTGCATCTAGTGCAGATA
>ESB_3_x1
TGAGGTAGTAGTTTGTGCTGTTT
>ESB_4_x1
TCCTACGAGTTGCATGGATTC
>ESB_5_x1
ACCGGGTGGAGCCGCCGCA
>ESB_6_x1
ACTGCCCTAAGTGCTCCTTCTGGT
>ESB_7_x1
TACAGGGCTGGGGATGG
>ESB_8_x1
CCCTGGATGCTGTAGGATG
>ESB_9_x1
AAAATGCTACTACTTTTGAGTC
>ESB_10_x1
TCCCTGGTGGTCTAGTGGCTAGGAT
>ESB_11_x4
CTCTTAGATCGATGTGGTGCTC
>ESB_12_x1
TTGGTGGTTCAGTGGTA
>ESB_13_x1
TCACAATTCCCCTATACCATGGGCCAT
>ESB_14_x1
CAAATGCTAGGGTTGGTGG
>ESB_15_x4
AGGCAGCTTCCACAGCA
>ESB_16_x3
GCACTGAGATGGAGTGGTGTAA
>ESB_17_x3
CTCCATTTGTTTGATGATGGA
>ESB_18_x1
GCACTGAGATGGAGTGGTGTAT
>ESB_19_x2
AGTGCCCAGAGTTTGTAGTGT
>ESB_20_x1
ATAAGGTGCATCTAGTGCTGTTAGA
>ESB_21_x9
AAAATGCTTCCACTTTGTGTG
>ESB_22_x2
GGTAATTCTAGAGCTAATACATGCCG
>ESB_23_x6
CCGGGCGGAAACACCA
>ESB_24_x1
TGAAAGGCTGACCCGCCGGGC
>ESB_25_x1
TTCGTACAGTGGTCGTAAGTTCGTGC
>ESB_26_x1
GGTTAGCACTCTGGACTC
%%writefile script/map.py
#作业12
import sys

def readRef(ref):
    refD = {}
    for line in open(ref):
        if line[0] == '>':
            name = line[1:-1]
            assert name not in refD
            refD[name] = []
        else:
            refD[name].append(line.strip())
    #---------END reading----------------
    for key, valueL in refD.items():
        refD[key] = ''.join(valueL)
    return refD
#---------END readRef----------------
def mapSeq(seq, refD):
    len_seq = len(seq)
    for key, value in refD.items():
        pos = value.find(seq)
        while pos != -1:
            print "%s\t%d\t%d\t%s" % (key, pos, pos+len_seq, seq)
            current = pos + 1
            newpos = value[current:].find(seq)
            if newpos == -1:
                break
            pos = current + newpos
#----------END mapSeq------------
def main():
    if len(sys.argv) < 3:  #注意这儿参数改变了
        print >>sys.stderr, \
            "Usage: python %s Ref_genome.fa reads.fa" % sys.argv[0] #一般提示信息输出到标准错误
        sys.exit(0)

    ref = sys.argv[1]
    read = sys.argv[2]
    refD = readRef(ref)
    #------------------------------------
    for line in open(read):
        if line[0] == '>':
            pass
        else:
            seq = line.strip()
            mapSeq(seq, refD)   
#-------END main------------------

if __name__ == '__main__':
    main()
Overwriting map.py
%run script/map data/ref.fa data/short.fa
chr1    199    208    TGGCGTTCA
chr1    207    216    ACCCCGCTG
chr2    63    70    AAATTGC
chr3    0    7    AATAAAT
chr3    66    74    CCACACAA
chr3    206    214    ATATCACT
chr2    163    171    ATATCACT
chr3    417    423    AGAGGA
chr2    374    380    AGAGGA
chrX    137    143    AGAGGA

References

  • http://www.byteofpython.info/
  • http://woodpecker.org.cn/abyteofpython_cn/chinese/index.html
  • http://www.python-course.eu/

原文发布于微信公众号 - 生信宝典(Bio_data)

原文发表时间:2017-05-18

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏CreateAMind

多目标的强化学习教程-两篇均有代码

16420
来自专栏利炳根的专栏

学习笔记TF063:TensorFlow Debugger

TensorFlow Debugger(tfdbg),TensorFlow专用调试器。用断点、计算机图形化展现实时数据流,可视化运行TensorFlow图形内部...

66400
来自专栏惨绿少年

Shell编程基础篇-下

1.1 条件表达式 1.1.1 文件判断 常用文件测试操作符 常用文件测试操作符 说明 -d文件,d的全拼为directory 文...

20600
来自专栏CreateAMind

多目标的强化学习教程-两篇均有代码

23020
来自专栏小鹏的专栏

ubuntu下C++如何调用matlab程序

实验平台:   ubuntu  matlab R2016b   g++ 步骤: 1.    设置matlab的编译器 在命令行窗口下,输入并执行如下命令:m...

335100
来自专栏深度学习入门与实践

【深度学习系列】关于PaddlePaddle的一些避“坑”技巧

最近除了工作以外,业余在参加Paddle的AI比赛,在用Paddle训练的过程中遇到了一些问题,并找到了解决方法,跟大家分享一下: ---- PaddlePa...

44160
来自专栏全球人工智能的专栏

TensorFlow 的 c ++ 实践及各种坑!

本文重点介绍 tensorflow C++ 服务化过程中实现方式及遇到的各种问题。

3.9K20
来自专栏简书专栏

Python数据分析及可视化-小测验

本文中测验需要的文件夹下载链接: https://pan.baidu.com/s/1OqFM2TNY75iOST6fBlm6jw 密码: rmbt 下载压缩包...

33520
来自专栏小樱的经验随笔

Gym 100952A&&2015 HIAST Collegiate Programming Contest A. Who is the winner?【字符串,暴力】

A. Who is the winner? time limit per test:1 second memory limit per test:64 mega...

29260
来自专栏java、Spring、技术分享

JMH基准测试

  OpenJDK 中的开源项目 JMH(Java Microbenchmark Harness)。JMH 是一个面向 Java 语言或者其他 Java 虚拟机...

18830

扫码关注云+社区

领取腾讯云代金券