Python学习教程 (四)

Python 教程

欢迎来到Python的世界,本教程将带你遨游Python,领悟Python的魅力。本教程专注于帮助初学者,尤其是生物信息分析人员快速学会Python的常用功能和使用方式,因此只精选了部分Python的功能,请额外参考Python经典教程A byte of python和它的中文版 来更好的理解Python. 本文档的概念和文字描述参考了A byte of python(中文版),特此感谢。

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 Generic License.

实战练习(一)

背景知识

1. FASTA文件格式

>seq_name_1 sequence1 >seq_name_2 sequence2

2. FASTQ文件格式

@HWI-ST1223:80:D1FMTACXX:2:1101:1243:2213 1:N:0:AGTCAA TCTGTGTAGCCNTGGCTGTCCTGGAACTCACTTTGTAGACCAGGCTGGCATGCA + BCCFFFFFFHH#4AFHIJJJJJJJJJJJJJJJJJIJIJJJJJGHIJJJJJJJJJ

作业 (一)

  1. 给定FASTA格式的文件(test1.fa 和 test2.fa),写一个程序 cat.py 读入文件,并输出到屏幕
    • open(file)
    • for .. in loop
    • print
    • the amazng , or strip() function
    • 用到的知识点

2.给定FASTQ格式的文件(test1.fq), 写一个程序 cat.py 读入文件,并输出到屏幕

  • 同上
  • 用到的知识点

3.写程序 splitName.py, 读入test2.fa, 并取原始序列名字第一个空格前的名字为处理后的序列名字,输出到屏幕

  • split
  • 字符串的索引
  • 用到的知识点
  • 输出格式为: >NM_001011874 gcggcggcgggcgagcgggcgctggagtaggagctg.......

4.写程序 formatFasta.py, 读入test2.fa,把每条FASTA序列连成一行然后输出

  • join
  • strip
  • 用到的知识点
  • 输出格式为: >NM_001011874 gcggcggcgggc......TCCGCTG......GCGTTCACC......CGGGGTCCGGAG

5.写程序 formatFasta-2.py, 读入test2.fa,把每条FASTA序列分割成80个字母一行的序列

  • 字符串切片操作
  • range
  • 用到的知识点
  • 输出格式为 >NM_001011874 gcggcggcgc.(60个字母).TCCGCTGACG #(每行80个字母) acgtgctacg.(60个字母).GCGTTCACCC ACGTACGATG(最后一行可不足80个字母)

6.写程序 sortFasta.py, 读入test2.fa, 并取原始序列名字第一个空格前的名字为处理后的序列名字,排序后输出

  • sort
  • dict
  • aDict[key] = []
  • aDict[key].append(value)
  • 用到的知识点

7.提取给定名字的序列

  • 用到的知识点
  • print >>fh, or fh.write()
  • 取模运算,4 % 2 == 0
  • 写程序 grepFasta.py, 提取fasta.name中名字对应的test2.fa的序列,并输出到屏幕。
  • 写程序 grepFastq.py, 提取fastq.name中名字对应的test1.fq的序列,并输出到文件。

8.写程序 screenResult.py, 筛选test.expr中foldChange大于2的基因并且padj小于0.05的基因

  • 逻辑与操作符 and
  • 文件中读取的内容都为字符串,需要用int转换为整数,float转换为浮点数
  • 用到的知识点

9.写程序 transferMultipleColumToMatrix.py 将文件(multipleColExpr.txt)中基因在多个组织中的表达数据转换为矩阵形式

  • aDict[‘key’] = {}
  • aDict[‘key’][‘key2’] = value
  • if key not in aDict
  • aDict = {‘ENSG00000000003’: {“A-431”: 21.3, “A-549”, 32.5,…},”ENSG00000000003”:{},}
  • 用到的知识点
  • 输入格式(只需要前3列就可以) Gene Sample Value Unit Abundance ENSG00000000003 A-431 21.3 FPKM Medium ENSG00000000003 A-549 32.5 FPKM Medium ENSG00000000003 AN3-CA 38.2 FPKM Medium ENSG00000000003 BEWO 31.4 FPKM Medium ENSG00000000003 CACO-2 63.9 FPKM High ENSG00000000005 A-431 0.0 FPKM Not detected ENSG00000000005 A-549 0.0 FPKM Not detected ENSG00000000005 AN3-CA 0.0 FPKM Not detected ENSG00000000005 BEWO 0.0 FPKM Not detected ENSG00000000005 CACO-2 0.0 FPKM Not detected
  • 输出格式 Name A-431 A-549 AN3-CA BEWO CACO-2 ENSG00000000460 25.2 14.2 10.6 24.4 14.2 ENSG00000000938 0.0 0.0 0.0 0.0 0.0 ENSG00000001084 19.1 155.1 24.4 12.6 23.5 ENSG00000000457 2.8 3.4 3.8 5.8 2.9

10.写程序 reverseComplementary.py计算序列 ACGTACGTACGTCACGTCAGCTAGAC的反向互补序列

  • reverse
  • list(seq)
  • 用到的知识点

11.写程序 collapsemiRNAreads.py转换smRNA-Seq的测序数据

  • 输入文件格式(mir.collapse, tab-分割的两列文件,第一列为序列,第二列为序列被测到的次数) ID_REF VALUE ACTGCCCTAAGTGCTCCTTCTGGC 2 ATAAGGTGCATCTAGTGCAGATA 25 TGAGGTAGTAGTTTGTGCTGTTT 100 TCCTACGAGTTGCATGGATTC 4
  • 输出文件格式 (mir.collapse.fa, 名字的前3个字母为样品的特异标示,中间的数字表示第几条序列,是序列名字的唯一标示,第三部分是x加每个reads被测到的次数。三部分用下划线连起来作为fasta序列的名字。) >ESB_1_x2 ACTGCCCTAAGTGCTCCTTCTGGC >ESB_2_x25 ATAAGGTGCATCTAGTGCAGATA >ESB_3_x100 TGAGGTAGTAGTTTGTGCTGTTT >ESB_4_x4 TCCTACGAGTTGCATGGATTC

12.简化的短序列匹配程序 (map.py) 把short.fa中的序列比对到ref.fa, 输出短序列匹配到ref.fa文件中哪些序列的哪些位置

  • find
  • 用到的知识点
  • 输出格式 (输出格式为bed格式,第一列为匹配到的染色体,第二列和第三列为匹配到染色体序列的起始终止位置(位置标记以0为起始,代表第一个位置;终止位置不包含在内,第一个例子中所示序列的位置是(199,208](前闭后开,实际是chr1染色体第199-206的序列,0起始). 第4列为短序列自身的序列.)。
  • 附加要求:可以只匹配到给定的模板链,也可以考虑匹配到模板链的互补链。这时第5列可以为短序列的名字,第六列为链的信息,匹配到模板链为’+’,匹配到互补链为’-‘。注意匹配到互补链时起始位置也是从模板链的5’端算起的。 chr1 199 208 TGGCGTTCA chr1 207 216 ACCCCGCTG chr2 63 70 AAATTGC chr3 0 7 AATAAAT

13.备注:

  • 每个提到提到的“用到的知识点”为相对于前面的题目新增的知识点,请综合考虑。此外,对于不同的思路并不是所有提到的知识点都会用着,而且也可能会用到未提到的知识点。但是所有知识点都在前面的讲义部分有介绍。
  • 每个程序对于你身边会写的人来说都很简单,因此你一定要克制住,独立去把答案做出,多看错误提示,多比对程序输出结果和预期结果的差异。
  • 学习锻炼“读程序”,即对着文件模拟整个的读入、处理过程来发现可能的逻辑问题。
  • 程序运行没有错误不代表你写的程序完成了你的需求,你要去查验输出结果是不是你想要的。

14.关于程序调试

  • 在初写程序时,可能会出现各种各样的错误,常见的有缩进不一致,变量名字拼写错误,丢失冒号,文件名未加引号等,这时要根据错误提示查看错误类型是什么,出错的是哪一行来定位错误。当然,有的时候报错的行自身不一定有错,可能是其前面或后面的行出现了错误。
  • 当结果不符合预期时,要学会使用print来查看每步的操作是否正确,比如我读入了字典,我就打印下字典,看看读入的是不是我想要的,是否含有不该存在的字符;或者在每个判断句、函数调入的情况下打印个字符,来跟踪程序的运行轨迹

%%writefile script/cat.py
#作业1和2, cat.py
#包含了三中处理换行符的写法
for line in open("data/test1.fa"):
    print line.strip()

for line in open("data/test2.fa"):
    print line,

for line in open("data/test1.fq"):
    line = line.strip()
    print line
Overwriting script/cat.py
%run script/cat
>NM_001011874 gene=Xkr4 CDS=151-2091
gcggcggcgggcgagcgggcgctggagtaggagctggggagcggcgcggccggggaaggaagccagggcg
>NM_001195662 gene=Rp1 CDS=55-909
AGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCA
>NM_011283 gene=Rp1 CDS=128-6412
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCAC
>NM_0112835 gene=Rp15 CDS=128-6412
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCAC
>NM_001011874 gene=Xkr4 CDS=151-2091
gcggcggcgggcgagcgggcgctggagtaggagctggggagcggcgcggccggggaaggaagccagggcg
aggcgaggaggtggcgggaggaggagacagcagggacaggTGTCAGATAAAGGAGTGCTCTCCTCCGCTG
CCGAGGCATCATGGCCGCTAAGTCAGACGGGAGGCTGAAGATGAAGAAGAGCAGCGACGTGGCGTTCACC
CCGCTGCAGAACTCGGACAATTCGGGCTCTGTGCAAGGACTGGCTCCAGGCTTGCCGTCGGGGTCCGGAG
>NM_001195662 gene=Rp1 CDS=55-909
AAGCTCAGCCTTTGCTCAGATTCTCCTCTTGATGAAACAAAGGGATTTCTGCACATGCTTGAGAAATTGC
AGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCA
AGTTCCTTCCCCTCGCCATTCAAATATCACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGT
GGAGACCCACAGTTTGGCGGCGTTCGGGTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGC
TGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACA
CAGCATCACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTG
>NM_011283 gene=Rp1 CDS=128-6412
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCAC
ACAAACTATTTACTCTTTTCTTCGTAAGGAAAGGTTCAACTTCTGGTCTCACCCAAAATGAGTGACACAC
CTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATAT
CACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGG
GTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCC
TGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGA
GGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTGCCAGTTGACCTGGACAAGGCCCGCAGG
CGCCCTCGGCCCTGGCTGAGTAGTCGCTCCATAAGCACGCATGTGCAGCTCTGTCCTGCAACTGCCAATA
TGTCCACCATGGCACCTGGCATGCTCCGTGCCCCAAGGAGGCTCGTGGTCTTCCGGAATGGTGACCCGAA
>NM_0112835 gene=Rp1 CDS=128-6412
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCAC
ACAAACTATTTACTCTTTTCTTCGTAAGGAAAGGTTCAACTTCTGGTCTCACCCAAAATGAGTGACACAC
CTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATAT
CACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGG
GTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCC
TGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGA
GGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTGCCAGTTGACCTGGACAAGGCCCGCAGG
CGCCCTCGGCCCTGGCTGAGTAGTCGCTCCATAAGCACGCATGTGCAGCTCTGTCCTGCAACTGCCAATA
TGTCCACCATGGCACCTGGCATGCTCCGTGCCCCAAGGAGGCTCGTGGTCTTCCGGAATGGTGACCCGAA
@HWI-ST1223:80:D1FMTACXX:2:1101:1243:2213 1:N:0:AGTCAA
TCTGTGTAGCCNTGGCTGTCCTGGAACTCACTTTGTAGACCAGGCTGGCATGCACCACCACNNNCGGCTCATTTGTCTTTNNTTTTTGTTTTGTTCTGTA
+
BCCFFFFFFHH#4AFHIJJJJJJJJJJJJJJJJJIJIJJJJJGHIJJJJJJJJJJJJJIIJ###--5ABECFFDDEEEEE##,5=@B8?CDD<AD>C:@>
@HWI-ST1223:80:D1FMTACXX:2:1101:1375:2060 1:N:0:AGTCAA
NTGCTGAGCCACGACAAGGATCCCAGAGGGCCNAGCCCTGCATCTTGTATGGACCAGTTACNCATCAAAAGAGACTACTGTAGGCACCATCAATCAGATC
+
#1:DDDD;?CFFHDFEEIGIIIIIIG;DHFGG#)0?BFBDHBFF<FCFEFD;@DD@A=7?E#,,,;=(>3;=;;C>ACCC@CCCCCBBBCCAACCCCCCC
@HWI-ST1223:80:D1FMTACXX:2:1101:1383:2091 1:N:0:AGTCAA
NGTTCGTGTGGAACCTGGCGCTAAACCATTCGTAGACGACCTGCTTCTGGGTCGGGGTTTCGTACGTAGCAGAGCAGCTCCCTCGCTGCGATCTATTGAA
+
#1=DDFDFHHHHHJGJJJJJJJJJJJJJJJIJIGDHIHIGIJJJJJJJIIIGHHFDD3>BDDBDDDDDDDDDDBDCCBDDDDDDDDDDDBBDDDDEEACD
@HWI-ST1223:80:D1FMTACXX:2:1101:1452:2138 1:N:0:AGTCAA
NTCTAGGAGGTCTAGAAAGCCCAGGCCACCGGTACAAACATCAAGGGTGTTACGGATGTGCCGCTCTGAACCTCCAGGACGACTTTGATTTCAACTACAA
+
#4=DFFEFHHHHHJJJJJIJJJJHIIJGJJJJ@GIIJJJJJJIJJJJFGHIIIJJHHHDFFFFDDDDDDDDDDDDCDDDDDDDDDDDCCCEDEDDDDDDD
%%writefile script/splitName.py
#作业3
filename = "data/test2.fa"
for line in open(filename):
    if line[0] == '>':
        lineL = line.split(' ')
        print lineL[0]
    else:
        print line,
Overwriting script/splitName.py
%run script/splitName
>NM_001011874
gcggcggcgggcgagcgggcgctggagtaggagctggggagcggcgcggccggggaaggaagccagggcg
aggcgaggaggtggcgggaggaggagacagcagggacaggTGTCAGATAAAGGAGTGCTCTCCTCCGCTG
CCGAGGCATCATGGCCGCTAAGTCAGACGGGAGGCTGAAGATGAAGAAGAGCAGCGACGTGGCGTTCACC
CCGCTGCAGAACTCGGACAATTCGGGCTCTGTGCAAGGACTGGCTCCAGGCTTGCCGTCGGGGTCCGGAG
>NM_001195662
AAGCTCAGCCTTTGCTCAGATTCTCCTCTTGATGAAACAAAGGGATTTCTGCACATGCTTGAGAAATTGC
AGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCA
AGTTCCTTCCCCTCGCCATTCAAATATCACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGT
GGAGACCCACAGTTTGGCGGCGTTCGGGTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGC
TGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACA
CAGCATCACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTG
>NM_011283
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCAC
ACAAACTATTTACTCTTTTCTTCGTAAGGAAAGGTTCAACTTCTGGTCTCACCCAAAATGAGTGACACAC
CTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATAT
CACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGG
GTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCC
TGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGA
GGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTGCCAGTTGACCTGGACAAGGCCCGCAGG
CGCCCTCGGCCCTGGCTGAGTAGTCGCTCCATAAGCACGCATGTGCAGCTCTGTCCTGCAACTGCCAATA
TGTCCACCATGGCACCTGGCATGCTCCGTGCCCCAAGGAGGCTCGTGGTCTTCCGGAATGGTGACCCGAA
>NM_0112835
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCAC
ACAAACTATTTACTCTTTTCTTCGTAAGGAAAGGTTCAACTTCTGGTCTCACCCAAAATGAGTGACACAC
CTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATAT
CACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGG
GTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCC
TGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGA
GGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTGCCAGTTGACCTGGACAAGGCCCGCAGG
CGCCCTCGGCCCTGGCTGAGTAGTCGCTCCATAAGCACGCATGTGCAGCTCTGTCCTGCAACTGCCAATA
TGTCCACCATGGCACCTGGCATGCTCCGTGCCCCAAGGAGGCTCGTGGTCTTCCGGAATGGTGACCCGAA
%%writefile script/formatFasta.py
#作业4
filename = "data/test2.fa"
aList = []
for line in open(filename):
    if line[0] == '>':
        lineL = line.split(' ')
        if aList:
            print ''.join(aList)
            aList = []
        name = lineL[0]
        print name
    else:
        aList.append(line.strip())
#不要忘了最后一个序列
print ''.join(aList)
Overwriting script/formatFasta.py
%run script/formatFasta
>NM_001011874
gcggcggcgggcgagcgggcgctggagtaggagctggggagcggcgcggccggggaaggaagccagggcgaggcgaggaggtggcgggaggaggagacagcagggacaggTGTCAGATAAAGGAGTGCTCTCCTCCGCTGCCGAGGCATCATGGCCGCTAAGTCAGACGGGAGGCTGAAGATGAAGAAGAGCAGCGACGTGGCGTTCACCCCGCTGCAGAACTCGGACAATTCGGGCTCTGTGCAAGGACTGGCTCCAGGCTTGCCGTCGGGGTCCGGAG
>NM_001195662
AAGCTCAGCCTTTGCTCAGATTCTCCTCTTGATGAAACAAAGGGATTTCTGCACATGCTTGAGAAATTGCAGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATATCACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGGGTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTG
>NM_011283
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCACACAAACTATTTACTCTTTTCTTCGTAAGGAAAGGTTCAACTTCTGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATATCACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGGGTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTGCCAGTTGACCTGGACAAGGCCCGCAGGCGCCCTCGGCCCTGGCTGAGTAGTCGCTCCATAAGCACGCATGTGCAGCTCTGTCCTGCAACTGCCAATATGTCCACCATGGCACCTGGCATGCTCCGTGCCCCAAGGAGGCTCGTGGTCTTCCGGAATGGTGACCCGAA
>NM_0112835
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCACACAAACTATTTACTCTTTTCTTCGTAAGGAAAGGTTCAACTTCTGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATATCACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGGGTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTGCCAGTTGACCTGGACAAGGCCCGCAGGCGCCCTCGGCCCTGGCTGAGTAGTCGCTCCATAAGCACGCATGTGCAGCTCTGTCCTGCAACTGCCAATATGTCCACCATGGCACCTGGCATGCTCCGTGCCCCAAGGAGGCTCGTGGTCTTCCGGAATGGTGACCCGAA
%%writefile script/formatFasta-2.py
#作业5 
filename = "data/test2.fa"
length = 80
aList = []
for line in open(filename):
    if line[0] == '>':
        lineL = line.split(' ')
        if aList:
            seq = ''.join(aList)
            len_seq = len(seq)
            for i in range(0,len_seq,length):
                print seq[i:i+length]
            aList = []
        name = lineL[0]
        print name
    else:
        aList.append(line.strip())
#不要忘了最后一个序列
seq = ''.join(aList)
len_seq = len(seq)
for i in range(0,len_seq,length):
    print seq[i:i+length]
Overwriting script/formatFasta-2.py
%run script/formatFasta-2
>NM_001011874
gcggcggcgggcgagcgggcgctggagtaggagctggggagcggcgcggccggggaaggaagccagggcgaggcgaggag
gtggcgggaggaggagacagcagggacaggTGTCAGATAAAGGAGTGCTCTCCTCCGCTGCCGAGGCATCATGGCCGCTA
AGTCAGACGGGAGGCTGAAGATGAAGAAGAGCAGCGACGTGGCGTTCACCCCGCTGCAGAACTCGGACAATTCGGGCTCT
GTGCAAGGACTGGCTCCAGGCTTGCCGTCGGGGTCCGGAG
>NM_001195662
AAGCTCAGCCTTTGCTCAGATTCTCCTCTTGATGAAACAAAGGGATTTCTGCACATGCTTGAGAAATTGCAGGTCTCACC
CAAAATGAGTGACACACCTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATT
CAAATATCACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGGGTG
GTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGT
AAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCT
CCCACAATAAGAAGGTGCTG
>NM_011283
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCACACAAACTATT
TACTCTTTTCTTCGTAAGGAAAGGTTCAACTTCTGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATG
ATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATATCACTCATCCTGTAGTGGCTAAACGCATCAG
TTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGGGTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTC
TGCTGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATC
ACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTGCCAGTTGACCTGGACAA
GGCCCGCAGGCGCCCTCGGCCCTGGCTGAGTAGTCGCTCCATAAGCACGCATGTGCAGCTCTGTCCTGCAACTGCCAATA
TGTCCACCATGGCACCTGGCATGCTCCGTGCCCCAAGGAGGCTCGTGGTCTTCCGGAATGGTGACCCGAA
>NM_0112835
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCACACAAACTATT
TACTCTTTTCTTCGTAAGGAAAGGTTCAACTTCTGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATG
ATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATATCACTCATCCTGTAGTGGCTAAACGCATCAG
TTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGGGTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTC
TGCTGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATC
ACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTGCCAGTTGACCTGGACAA
GGCCCGCAGGCGCCCTCGGCCCTGGCTGAGTAGTCGCTCCATAAGCACGCATGTGCAGCTCTGTCCTGCAACTGCCAATA
TGTCCACCATGGCACCTGGCATGCTCCGTGCCCCAAGGAGGCTCGTGGTCTTCCGGAATGGTGACCCGAA
%%writefile script/sortFasta.py
#作业6
aDict = {}

filename = "data/test2.fa"

for line in open(filename):
    if line[0] == '>':
        key = line.split()[0]
        aDict[key] = []
    else:
        aDict[key].append(line.strip())
#--------END reading--------------------
keyL = aDict.keys()
keyL.sort()

for key in keyL:
    print "%s\n%s" % (key, ''.join(aDict[key]))
Overwriting sortFasta.py
%run script/sortFasta
>NM_001011874
gcggcggcgggcgagcgggcgctggagtaggagctggggagcggcgcggccggggaaggaagccagggcgaggcgaggaggtggcgggaggaggagacagcagggacaggTGTCAGATAAAGGAGTGCTCTCCTCCGCTGCCGAGGCATCATGGCCGCTAAGTCAGACGGGAGGCTGAAGATGAAGAAGAGCAGCGACGTGGCGTTCACCCCGCTGCAGAACTCGGACAATTCGGGCTCTGTGCAAGGACTGGCTCCAGGCTTGCCGTCGGGGTCCGGAG
>NM_001195662
AAGCTCAGCCTTTGCTCAGATTCTCCTCTTGATGAAACAAAGGGATTTCTGCACATGCTTGAGAAATTGCAGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATATCACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGGGTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTG
>NM_011283
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCACACAAACTATTTACTCTTTTCTTCGTAAGGAAAGGTTCAACTTCTGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATATCACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGGGTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTGCCAGTTGACCTGGACAAGGCCCGCAGGCGCCCTCGGCCCTGGCTGAGTAGTCGCTCCATAAGCACGCATGTGCAGCTCTGTCCTGCAACTGCCAATATGTCCACCATGGCACCTGGCATGCTCCGTGCCCCAAGGAGGCTCGTGGTCTTCCGGAATGGTGACCCGAA
>NM_0112835
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCACACAAACTATTTACTCTTTTCTTCGTAAGGAAAGGTTCAACTTCTGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATATCACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGGGTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTGCCAGTTGACCTGGACAAGGCCCGCAGGCGCCCTCGGCCCTGGCTGAGTAGTCGCTCCATAAGCACGCATGTGCAGCTCTGTCCTGCAACTGCCAATATGTCCACCATGGCACCTGGCATGCTCCGTGCCCCAAGGAGGCTCGTGGTCTTCCGGAATGGTGACCCGAA
%%writefile script/grepFasta.py
#作业7.1
aDict = {}

seqFile = "data/test2.fa"
nameFile = "data/fasta.name"

for line in open(seqFile):
    if line[0] == '>':
        key = line.split()[0][1:] #注意去掉开头的'>',只取序列的名字
        aDict[key] = []  #字典的值是一个列表
    else:
        aDict[key].append(line.strip())
#--------END reading--------------------
for line in open(nameFile):
    name = line.strip()
    print ">%s\n%s" % (name, ''.join(aDict[name]))
Overwriting grepFasta.py
%run script/grepFasta.py
>NM_001011874
gcggcggcgggcgagcgggcgctggagtaggagctggggagcggcgcggccggggaaggaagccagggcgaggcgaggaggtggcgggaggaggagacagcagggacaggTGTCAGATAAAGGAGTGCTCTCCTCCGCTGCCGAGGCATCATGGCCGCTAAGTCAGACGGGAGGCTGAAGATGAAGAAGAGCAGCGACGTGGCGTTCACCCCGCTGCAGAACTCGGACAATTCGGGCTCTGTGCAAGGACTGGCTCCAGGCTTGCCGTCGGGGTCCGGAG
>NM_011283
AATAAATCCAAAGACATTTGTTTACGTGAAACAAGCAGGTTGCATATCCAGTGACGTTTATACAGACCACACAAACTATTTACTCTTTTCTTCGTAAGGAAAGGTTCAACTTCTGGTCTCACCCAAAATGAGTGACACACCTTCTACTAGTTTCTCCATGATTCATCTGACTTCTGAAGGTCAAGTTCCTTCCCCTCGCCATTCAAATATCACTCATCCTGTAGTGGCTAAACGCATCAGTTTCTATAAGAGTGGAGACCCACAGTTTGGCGGCGTTCGGGTGGTGGTCAACCCTCGTTCCTTTAAGACTTTTGACGCTCTGCTGGACAGTTTATCCAGGAAGGTACCCCTGCCCTTTGGGGTAAGGAACATCAGCACGCCCCGTGGACGACACAGCATCACCAGGCTGGAGGAGCTAGAGGACGGCAAGTCTTATGTGTGCTCCCACAATAAGAAGGTGCTGCCAGTTGACCTGGACAAGGCCCGCAGGCGCCCTCGGCCCTGGCTGAGTAGTCGCTCCATAAGCACGCATGTGCAGCTCTGTCCTGCAACTGCCAATATGTCCACCATGGCACCTGGCATGCTCCGTGCCCCAAGGAGGCTCGTGGTCTTCCGGAATGGTGACCCGAA
%%writefile script/grepFastq.py
#作业7.2
seqFile = "data/test1.fq"
nameFile = "data/fastq.name"

aDict = {}
count = 0
for line in open(seqFile):
    count += 1
    if count % 4 == 1:
        name = line.strip()[1:] #去掉开头的@
        aDict[name] = [line] #存储名字行
    else:
        aDict[name].append(line)
#-------END reading----------这行知识注释,可有可无------
for line in open(nameFile):
    name = line.strip()
    print ''.join(aDict[name]),
Overwriting grepFastq.py
%run script/grepFastq
@HWI-ST1223:80:D1FMTACXX:2:1101:1243:2213 1:N:0:AGTCAA
TCTGTGTAGCCNTGGCTGTCCTGGAACTCACTTTGTAGACCAGGCTGGCATGCACCACCACNNNCGGCTCATTTGTCTTTNNTTTTTGTTTTGTTCTGTA
+
BCCFFFFFFHH#4AFHIJJJJJJJJJJJJJJJJJIJIJJJJJGHIJJJJJJJJJJJJJIIJ###--5ABECFFDDEEEEE##,5=@B8?CDD<AD>C:@>
@HWI-ST1223:80:D1FMTACXX:2:1101:1375:2060 1:N:0:AGTCAA
NTGCTGAGCCACGACAAGGATCCCAGAGGGCCNAGCCCTGCATCTTGTATGGACCAGTTACNCATCAAAAGAGACTACTGTAGGCACCATCAATCAGATC
+
#1:DDDD;?CFFHDFEEIGIIIIIIG;DHFGG#)0?BFBDHBFF<FCFEFD;@DD@A=7?E#,,,;=(>3;=;;C>ACCC@CCCCCBBBCCAACCCCCCC
@HWI-ST1223:80:D1FMTACXX:2:1101:1383:2091 1:N:0:AGTCAA
NGTTCGTGTGGAACCTGGCGCTAAACCATTCGTAGACGACCTGCTTCTGGGTCGGGGTTTCGTACGTAGCAGAGCAGCTCCCTCGCTGCGATCTATTGAA
+
#1=DDFDFHHHHHJGJJJJJJJJJJJJJJJIJIGDHIHIGIJJJJJJJIIIGHHFDD3>BDDBDDDDDDDDDDBDCCBDDDDDDDDDDDBBDDDDEEACD
@HWI-ST1223:80:D1FMTACXX:2:1101:1452:2138 1:N:0:AGTCAA
NTCTAGGAGGTCTAGAAAGCCCAGGCCACCGGTACAAACATCAAGGGTGTTACGGATGTGCCGCTCTGAACCTCCAGGACGACTTTGATTTCAACTACAA
+
#4=DFFEFHHHHHJJJJJIJJJJHIIJGJJJJ@GIIJJJJJJIJJJJFGHIIIJJHHHDFFFFDDDDDDDDDDDDCDDDDDDDDDDDCCCEDEDDDDDDD
%%writefile script/screenResult.py
#作业8
file = "data/test.expr"

head = 1 #This is used to indicate there is one head line needing to skip
for line in open(file):
    if head:
        head -= 1
        continue
    #-----Begin data lines------
    lineL = line.split()

    foldChange = float(lineL[9])
    adjP = float(lineL[12])
    if foldChange > 2 and adjP < 0.05:
        print lineL[0]
Overwriting screenResult.py
%run script/screenResult
Novel00011
Novel00043
Novel00047
Novel00077
Novel00079
Novel00080
Novel00084
Novel00085
Novel00086
Novel00087
Novel00090
Novel00124
Novel00148
Novel00156
Novel00162
Novel00166
%%writefile script/transferMultipleColumToMatrix.py
#作业9
file = "data/multipleColExpr.txt"
head = 1
tissueSet = set()
aDict = {}
for line in open(file):
    if head: #skip header line
        head -= 1
        continue
    #-------------------------------
    lineL = line.split()
    gene = lineL[0]
    tissue = lineL[1]
    expr = lineL[2]
    if gene not in aDict:
        aDict[gene] = {}
    assert tissue not in aDict[gene], "Duplicate tissues"
    aDict[gene][tissue] = expr
    tissueSet.add(tissue)
#---END reading----------------------------
tissueL = list(tissueSet)
tissueL.sort()
print "Gene\t%s" % '\t'.join(tissueL)

for gene, tissueD in aDict.items():
    exprL = [gene]
    for tissue in tissueL:
        exprL.append(tissueD[tissue])
    print '\t'.join(exprL)
Overwriting transferMultipleColumToMatrix.py
%run script/transferMultipleColumToMatrix
Gene    A-431    A-549    AN3-CA    BEWO    CACO-2
ENSG00000000460    25.2    14.2    10.6    24.4    14.2
ENSG00000000938    0.0    0.0    0.0    0.0    0.0
ENSG00000000457    2.8    3.4    3.8    5.8    2.9
ENSG00000000419    73.8    38.6    33.9    53.7    155.5
ENSG00000000003    21.3    32.5    38.2    31.4    63.9
ENSG00000000005    0.0    0.0    0.0    0.0    0.0
%%writefile script/reverseComplementary.py
#作业10
str1 = "ACGTACGTACGTCACGTCAGCTAGAC"

ATCG_dict = {'A':'T','T':'A','C':'G','G':'C', 'a':'t','t':'a','c':'g','g':'c'}

tmpL = []
for i in str1:
    tmpL.append(ATCG_dict[i])
tmpL.reverse()
print ''.join(tmpL)
Overwriting reverseComplementary.py
%run script/reverseComplementary
GTCTAGCTGACGTGACGTACGTACGT
%%writefile script/collapsemiRNAreads.py
#作业11
smRNA_file = "data/mir.collapse"

head = 1
sample = 'ESB'
lineno = 0
for line in open(smRNA_file):
    if head:
        head -= 1
        continue
    #----end skip header line---
    seq, value = line.split()
    lineno += 1
    print ">%s_%d_x%s\n%s" % (sample, lineno, value, seq)
Overwriting collapsemiRNAreads.py
%run script/collapsemiRNAreads
>ESB_1_x2
ACTGCCCTAAGTGCTCCTTCTGGC
>ESB_2_x2
ATAAGGTGCATCTAGTGCAGATA
>ESB_3_x1
TGAGGTAGTAGTTTGTGCTGTTT
>ESB_4_x1
TCCTACGAGTTGCATGGATTC
>ESB_5_x1
ACCGGGTGGAGCCGCCGCA
>ESB_6_x1
ACTGCCCTAAGTGCTCCTTCTGGT
>ESB_7_x1
TACAGGGCTGGGGATGG
>ESB_8_x1
CCCTGGATGCTGTAGGATG
>ESB_9_x1
AAAATGCTACTACTTTTGAGTC
>ESB_10_x1
TCCCTGGTGGTCTAGTGGCTAGGAT
>ESB_11_x4
CTCTTAGATCGATGTGGTGCTC
>ESB_12_x1
TTGGTGGTTCAGTGGTA
>ESB_13_x1
TCACAATTCCCCTATACCATGGGCCAT
>ESB_14_x1
CAAATGCTAGGGTTGGTGG
>ESB_15_x4
AGGCAGCTTCCACAGCA
>ESB_16_x3
GCACTGAGATGGAGTGGTGTAA
>ESB_17_x3
CTCCATTTGTTTGATGATGGA
>ESB_18_x1
GCACTGAGATGGAGTGGTGTAT
>ESB_19_x2
AGTGCCCAGAGTTTGTAGTGT
>ESB_20_x1
ATAAGGTGCATCTAGTGCTGTTAGA
>ESB_21_x9
AAAATGCTTCCACTTTGTGTG
>ESB_22_x2
GGTAATTCTAGAGCTAATACATGCCG
>ESB_23_x6
CCGGGCGGAAACACCA
>ESB_24_x1
TGAAAGGCTGACCCGCCGGGC
>ESB_25_x1
TTCGTACAGTGGTCGTAAGTTCGTGC
>ESB_26_x1
GGTTAGCACTCTGGACTC
%%writefile script/map.py
#作业12
ref = "data/ref.fa"
short = "data/short.fa"

refD = {}
for line in open(ref):
    if line[0] == '>':
        name = line[1:-1]
        assert name not in refD
        refD[name] = []
    else:
        refD[name].append(line.strip())
#---------END reading----------------
for key, valueL in refD.items():
    refD[key] = ''.join(valueL)
#------------------------------------
for line in open(short):
    if line[0] == '>':
        pass
    else:
        seq = line.strip()
        len_seq = len(seq)
        for key, value in refD.items():
            pos = value.find(seq)
            while pos != -1:
                print "%s\t%d\t%d\t%s" % (key, pos, pos+len_seq, seq)
                current = pos + 1
                newpos = value[current:].find(seq)
                if newpos == -1:
                    break
                pos = current + newpos
Overwriting map.py
%run script/map
chr1    199    208    TGGCGTTCA
chr1    207    216    ACCCCGCTG
chr2    63    70    AAATTGC
chr3    0    7    AATAAAT
chr3    66    74    CCACACAA
chr3    206    214    ATATCACT
chr2    163    171    ATATCACT
chr3    417    423    AGAGGA
chr2    374    380    AGAGGA
chrX    137    143    AGAGGA

原文发布于微信公众号 - 生信宝典(Bio_data)

原文发表时间:2017-05-18

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏jeremy的技术点滴

py3_cookbook_notes_02

38612
来自专栏文武兼修ing——机器学习与IC设计

开放地址法散列开放地址法代码实现

开放地址法 开放地址法是另一种(相对于分离链接法)解决散列冲突的方法。适用于装填因子(散列表中元素个数和散列表长度比)较小(小于0.5)的散列表。 开放地址法中...

47112
来自专栏生信小驿站

数据处理第3部分:选择行的基本和高级的方法

原文地址:https://suzan.rbind.io/2018/02/dplyr-tutorial-3/ 作者:Suzan Baert 这是系列dplyr...

981
来自专栏北京马哥教育

Python 中被忽略的 else

1444
来自专栏DOTNET

.Net多线程编程—Parallel LINQ、线程池

Parallel LINQ 1 System.Linq.ParallelEnumerable 重要方法概览: 1)public static ParallelQ...

3017
来自专栏Petrichor的专栏

tensorflow: variable初始化

tf.global_variables_initializer() 将在其创建时查看全局图并自动将依赖关系添加到图中的每个 tf.initializer。

2702
来自专栏逍遥剑客的游戏开发

MPQ文件系统优化(续)

1775
来自专栏Java进阶架构师

一文读懂JDK7,8,JD9的hashmap,hashtable,concurrenthashmap及他们的区别

图中,紫色部分即代表哈希表,也称为哈希数组(默认数组大小是16,每对key-value键值对其实是存在map的内部类entry里的),数组的每个元素...

953
来自专栏数据之美

BloomFilter 简介及在 Hadoop reduce side join 中的应用

1、BloomFilter能解决什么问题?      以少量的内存空间判断一个元素是否属于这个集合, 代价是有一定的错误率 2、工作原理 ...

2517
来自专栏程序员宝库

移除注释的完善思路:真的可以用正则实现?

网上有很多自称能实现移除JS注释的正则表达式,实际上存在种种缺陷。这使人多少有些愕然,也不禁疑惑到:真的可以用正则实现吗?而本篇文章以使用正则移除JS注释为目标...

1363

扫码关注云+社区

领取腾讯云代金券