本章将继续Python语言的基础知识。到本章结尾,你将了解如何:
搜索DNA或蛋白质中的模体
用键盘与用户交互
将数据写入文件
使用循环
使用字符串内置函数find
根据条件测试的结果执行不同的操作
通过操作字符串和列表来详细检查序列数据
在本章中,你将学习编写一个在序列数据中查找模体的程序,并且提供编写生物信息学程序所需的技能。
1. 流程控制
流程控制是执行程序语句的顺序,一般程序按顺序从顶部第一个语句到底部最后一个语句执行,有两种方法可以改变程序执行顺序:条件语句和循环。条件语句只有条件成功时才执行一组语句,否则只是跳过一组语句。循环重复一组语句直到关联的测试失败。
1.1 条件语句
让我们回想open语句,如果你尝试打开的文件不存在,则会收到错误消息。因此,在尝试打开文件之前,你可以先判断文件是否存在。实际上,这些判断是计算机语言最强大的功能之一。if、if-else是python中存在的判断语句。
这些结构的特征是计算True/False值,如果条件为True,则执行其下面的语句;为假,则跳过语句(反之亦然)。
然而,什么是真?不同的编程语言可能会稍微不同。
本节演示了一些Python条件的示例。每个例子中的真假条件是两个数字之间的相等,数字的相等性由两个等号“==”表示,因为单个等号=已用于赋值给变量。以下示例演示条件测试是否计算为True/False,通常这些简单地测试没有太多用处,大多数条件测试是判断变量的值或函数调用返回的结果——你事先不知道的事情。
条件测试为True:
if 1 == 1: print("1 equals 1\n\n")
输出结果:
1 equals 1
测试条件是“1 == 1”,条件计算值为True,则执行与if语句关联的语句,打印出消息。
你也可以这样判断:
if 1: print("1 evaluates to true\n\n")
输出结果:
1 evaluates to true
条件测试为False:
if 1 == 0 : print("1 equals 0\n\n")
没有输出结果。测试条件是“1 == 0”,条件计算结果为False,因此不执行与if关联的语句,也不打印任何消息。
你也可以这样判断:
if 0: print("0 evaluates to true\n\n")
也没有输出结果,因为0的计算结果为False,因此不执行与if关联的语句,也不打印任何消息。
现在我们再看看计算结果为True的if-else语句:
if 1 == 1: print("1 equals 1\n\n")else: print("1 does not equal 1\n\n")
输出结果为:
1 equals 1
if-else在测试条件为True时执行一项操作,如果为False则执行另一项操作。 如下是计算结果为False的if-else语句:
if 1 == 0: print("1 equals 0\n\n")else: print("1 does not equal 0\n\n")
输出结果为:
1 does not equal 01.1.1 条件判断和缩进
关于条件判断有两点需要额外说明。首先,有几个运算符可以在条件判断部分使用,除了前面示例中的“==”之外,还有不等于“!=”、大于“>”、小于“
再看例子5-1中if-elif-else语句,条件判断首先是if,然后是elif,依次评估,一旦结果为True,就会执行其块,并忽略其余条件。如果没有条件的计算结果为True,则执行else块。
例子5-1 if-elif-else
#!/usr/bin/env python# if-elif-elseword = 'MNIDDKL'# if-elif-else conditionalsif word == 'QSTVSGE': print("QSTVSGE\n")elif word == 'MRQQDMISHDEL': print("MRQQDMISHDEL\n")elif word == 'MNIDDKL': print("MNIDDKL--the magic word!\n")else: print('Is \"%s\" a peptide? This program is not sure.\n' % word)exit()
注意else块print语句中的\",它允许你打印双引号"。反斜杠字符告诉Python将"视为符号本身而不是将其解释为字符串结尾的标记。
例子5-1输出结果:
MNIDDKL--the magic word!1.2 循环
循环允许你重复执行块内的语句块,在Python中有几种循环方式:while循环、for循环等等。例子5-2(来自第四章)显示了while循环以及从文件中读取蛋白质序列数据时如何使用它。
例子5-2 从文件中读取蛋白质序列数据, 4
#!/usr/bin/env pythonimport os# Reading protein sequence data from a file, take 4# The filename of the file containing the protein sequence dataproteinfilename = 'NM_021964fragment.pep'# First we have to "open" the file, and in case the# open fails, print an error message and exit the program.if os.path.exists(proteinfilename): PROTEINFILE = open(proteinfilename)else: print("Could not open file %s!\n" % proteinfilename) exit() # Read the protein sequence data from the file in a "while" loop,# printing each line as it is read.protein = PROTEINFILE .readline()while protein: print(" ###### Here is the next line of the file:\n") print(protein) protein = PROTEINFILE .readline()# Close the file.PROTEINFILE.close()exit()
例子5-2输出如下:
###### Here is the next line of the file:MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD ###### Here is the next line of the file:SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ ###### Here is the next line of the file:GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR
首先,程序读取文件的第一行赋值给变量protein,然后在while循环中,循环每次给变量protein赋值文件下一行的内容,并判断读取的行是否为空来判断条件是否为False,是否退出循环,跳过两个print语句块。条件为True,新行储存在变量protein中,并执行带有两个print语句块。
1.2.1 open函数和os模块
open函数调用时系统调用,因为要打开文件,Python必须从操作系统中请求该文件。操作系统可以是Unix/Linux、Microsoft Windows、Apple Macintosh等等,文件由操作系统管理,只能由它访问。检查系统调用的成功或失败是一个好习惯,特别是在打开文件时。如果系统调用失败,并且没有检查它,程序将继续读取或写入无法打开的文件。你应始终检查故障,并在无法打开文件时立即通知用户或退出程序。
总而言之,Python中学习条件和循环不难,而且条件和循环是编程语言最强大的功能之一。条件允许你为程序制定多个备选方案,并根据获得的输入类型做出决策。循环利用计算机的速度,通过几行代码就可以处理大量输入或不断迭代和优化计算。
2. 搜索模体
在生物信息学中最常见的事情之一是寻找特别感兴趣的模体、DNA或蛋白质短片段,它们可能是DNA的调节原件或短链蛋白质,在许多物种中都是保守的。(PROSITE网站提供了有关蛋白质模体的广泛信息。)
在生物序列中寻找的模体通常不是一个特定序列,它们可能具有多种变体,例如,存在碱基或氨基酸无关紧要的位置,可能有不同的长度。这种情况可以使用正则表达式,在第九章和本专辑其它地方你将看到更多例子。
Python有一套便于查找字符串的功能,例子5-3介绍了这种字符串搜索功能,类似的程序一直在生物学研究中使用。它执行以下操作:
从文件中读取蛋白质序列数据
将所有序列数据放入一个字符串中以便于搜索
查找用户输入的模体
例子5-3 寻找模体
#!/usr/bin/env pythonimport os# Searching for motifs# Ask the user for the filename of the file containing# the protein sequence data, and collect it from the keyboardprint "Please type the filename of the protein sequence data: ";proteinfilename = input()# open the file, or exitif os.path.exists(proteinfilename):
PROTEINFILE = open(proteinfilename)
else:
print("Could not open file %s!\n" % proteinfilename)
exit()# Read the protein sequence data from the file, and store it# into the array variable proteinsproteins = PROTEINFILE.readlines()# Close the file - we've read all the data into @protein now.PROTEINFILE.close()# Put the protein sequence data into a single string, as it's easier# to search for a motif in a string than in an array of# lines (what if the motif occurs over a line break?)protein = ''.join(proteins)# Remove whitespaceprotein = protein.replace('\n', '')# In a loop, ask the user for a motif, search for the motif,# and report if it was found.# Exit if no motif is entered.while True: print("Enter a motif to search for: ") motif = input() # exit on an empty user input if not motif: break # Look for the motif if protein.find(motif) != -1: print("I found it!\n\n") else: print("I couldn\'t find it.\n\n")# exit the programexit()
例子5-3输出结果:
Please type the filename of the protein sequence data: NM_021964fragment.pepEnter a motif to search for: SVLQI found it!Enter a motif to search for: jklI couldn't find it.Enter a motif to search for: QDSVI found it!Enter a motif to search for: HERLPQGLQI found it!Enter a motif to search for: I couldn't find it.
从输出中可以看出,该程序找到用户输入的模体序列。美中不足的是如果这个程序不仅报告它找到模体序列,而且在什么位置,那就太好了。你将在第九章中看到如何实现这一点,该章中练习要求你修改此程序以便报告模体的位置。
以下内容将检查讨论例子5-3中的内容;
从键盘获取用户输入
将文件的行连接成单个字符串
find函数和strip函数
not
2.1 从键盘获取用户输入
Python使用内置函数来获取用户在键盘上键入的输入。在例子5-3中,一个名为input的函数接受用户输入数据,返回为 string 类型。当用户键入文件名并通过Enter键发送输入时,文件名会保存到变量proteinfilename。
2.2 使用join函数将列表转成字符串
通常蛋白质序列数据是分成80个残基的短片段,因为当数据打印在纸上或显示在屏幕上时,需要将其分解为合适的行个数。但是,如果你正在搜索的模体序列在文件中由换行符分分割了,程序就找不到这个模体序列。在例子5-3中,搜索的一些模体序列是由换行符分开的,在Python中,你可以使用join函数将列表中所有数据组合成一个储存在新变量protein中的单个字符串。
protein = ''.join(proteins)
你可以指定列表元素连接时使用的字符串,本示例中,指定要放置在输入文件行之间为空字符串,空字符串用一对单引号''(或者双引号“”)表示。
回想例子4-2中,我们介绍了两个DNA片段的连接方法,与join函数的使用非常相似。例子4-2中的以下语句,它是连接两个字符串的方法之一:
DNA3 = DNA1 + DNA2
完全相同的另一种连接方法是使用join函数:
DNA3 = ''.join([DNA1, DNA2])
上面的方法中,指定了一个字符串元素列表,而不是一个列表的名称:
[DNA1, DNA2]2.3 python中实现do-until类似循环
例子5-3中,使用while和if判断实现了类似do-until功能的循环。它首先执行一个块,然后进行if判断,如果测试成功,就用break跳出循环。例子5-3首先打印用户提示信息,获取用户输入,调用find函数搜索模体并报告结果(是否为-1,-1意味着没有找到)。在执行查找操作之前,用if语句判断用户是否输入的是空行,空行意味着用户用户没有更多的模体序列要查找,退出循环。
2.4 字符串函数find和索引
Python可以轻松操作各种字符串,例如DNA和蛋白质序列数据。字符串内含函数find用来搜索子字符串(模体序列)是否出现在字符串(蛋白质序列)中,如果找到则返回子字符串的起始索引,否则返回-1。
3. 统计核苷酸个数
关于给定的DNA序列,你可能想知道是编码还是非编码?是否包含调节因子?与其它DNA序列是否有关?DNA序列中四种核苷酸的个数?事实上,在某些物种中,编码区具有特定的核苷酸偏差,因此最后一个问题对于寻找基因非常重要。此外,不同物种具有不同的核算比例,因此计算核苷酸比例是十分有用的。
在下面的两个程序,例子5-4和5-6中,它们计算了DNA中每种核苷酸的含量。使用Python部分新功能:
将字符串转换成列表
迭代列表
要获得给定DNA中每种核苷酸的个数,你必须遍历每个碱基,看看是什么碱基,然后统计每个核苷酸的个数,这里使用for循环来统计。
首先,来看看伪代码,之后再写跟详细地伪代码来编写Python程序。
以下伪代码描述了所需的内容:
for each base in the DNA if base is A count_of_A = count_of_A + 1 if base is C count_of_C = count_of_C + 1 if base is G count_of_G = count_of_G + 1 if base is T count_of_T = count_of_T + 1doneprint count_of_A, count_of_C, count_of_G, count_of_T
如你所见,思路非常简单,现在让我们来看看如何用Python编写该程序。
4. 将字符串转换成列表
首先将字符串转换成列表,指将DNA字符串中的每个碱基是分开的,并且每个字母在列表中称为单独的元素。然后你可以逐个查看列表元素(每个元素都是一个单字母),遍历列表统计每个核苷酸的数量。这与2.2节中的join函数将字符串列表中的元素连接成一个字符串的功能相反。
下面详细版的伪代码中添加了从文件中获取DNA并操作该文件数据的指令。首先,连接从原始文件中读取的行列表得到字符串序列,替换字符串序列中的换行符和空格,然后将字符串序列转成单个字母的列表。
read in the DNA from a filejoin the lines of the file into a single string $DNA# make an array out of the bases of $DNA@DNA = explode $DNA# initialize the countscount_of_A = 0count_of_C = 0count_of_G = 0count_of_T = 0for each base in @DNA if base is A count_of_A = count_of_A + 1 if base is C count_of_C = count_of_C + 1 if base is G count_of_G = count_of_G + 1 if base is T count_of_T = count_of_T + 1doneprint count_of_A, count_of_C, count_of_G, count_of_T
上述伪代码详细地介绍了一种方法,通过将DNA序列转换成单个字母元素的裂变来查看每个碱基,还将每个核苷酸的计数初始化为0,例子5-4是实际可行的程序。
例子5-4 计算核苷酸的比例
#!/usr/bin/env python
import os# Determining frequency of nucleotides# Get the name of the file with the DNA sequence dataprint("Please type the filename of the DNA sequence data: ")dna_filename = input()# open the file, or exitif os.path.exists(dna_filename):
DNAFILE = open(dna_filename)
else:
print("Could not open file %s!\n" % dna_filename)
exit()
# Read the DNA sequence data from the file, and store it# into the array variable DNAsDNAs = DNAFILE.readlines()# Close the fileDNAFILE.close()# From the lines of the DNA file,# put the DNA sequence data into a single string.DNA = ''.join(DNAs)# Remove whitespaceDNA = DNA.replace('\n', '')# Now explode the DNA into an array where each letter of the# original string is now an element in the array.# This will make it easy to look at each position.# Notice that we're reusing the variable DNA for this purpose.DNA = list(DNA)# Initialize the counts.# Notice that we can use scalar variables to hold numbers.count_of_A = 0count_of_C = 0count_of_G = 0count_of_T = 0errors = 0# In a loop, look at each base in turn, determine which of the# four types of nucleotides it is, and increment the# appropriate count.for base in DNA: if base == 'A': ++count_of_A elif base == 'C': ++count_of_C elif base == 'G': ++count_of_G elif base == 'T': ++count_of_T else: print("!!!!!!!! Error - I don\'t recognize this base: %s\n" % base) ++errors# print the resultsprint("A = %s\n" % count_of_A)print("C = %s\n" % count_of_C)print("G = %s\n" % count_of_G)print("T = %s\n" % count_of_T)print("errors = %s\n" % errors)# exit the programexit()
为了演示例子5-4,我创建了以下DNA文件并命名为small.dna:
AAAAAAAAAAAAAAGGGGGGGTTTTCCCCCCCCCCCCCGTCGTAGTAAAGTATGCAGTAGCVGCCCCCCCCCCGGGGGGGGAAAAAAAAAAAAAAATTTTTTATAAACG
注意文件中有一个字母V,如下是例子5-4输出:
Please type the filename of the DNA sequence data: small.dna!!!!!!!! Error - I don't recognize this base: VA = 40C = 27G = 24T = 17
在这个程序中,我们使用了list这个新的函数,用法如下:
DNA = list(DNA)
这个函数将DNA字符串分割成了单个字母组成的列表。在交互式环境输入help(list)或使用文档查看list函数使用方法,list函数接受一个可迭代的对象,将其转换为列表。
在Python中,字符串就是一个可迭代对象,并且可迭代对象是可以使用for循环。因此,上述程序可去掉“DNA = list(DNA)”,写成如下方式:
#!/usr/bin/env pythonimport os# Determining frequency of nucleotides# Get the name of the file with the DNA sequence dataprint("Please type the filename of the DNA sequence data: ")dna_filename = input()# open the file, or exitif os.path.exists(dna_filename): DNAFILE = open(dna_filename)else: print("Could not open file %s!\n" % dna_filename) exit()# Read the DNA sequence data from the file, and store it# into the array variable DNAsDNAs = DNAFILE.readlines()# Close the fileDNAFILE.close()# From the lines of the DNA file,# put the DNA sequence data into a single string.DNA = ''.join(DNAs)# Remove whitespaceDNA = DNA.replace('\n', '')# Initialize the counts.# Notice that we can use scalar variables to hold numbers.count_of_A = 0count_of_C = 0count_of_G = 0count_of_T = 0errors = 0# In a loop, look at each base in turn, determine which of the# four types of nucleotides it is, and increment the# appropriate count.for base in DNA: if base == 'A': ++count_of_A elif base == 'C': ++count_of_C elif base == 'G': ++count_of_G elif base == 'T': ++count_of_T else: print("!!!!!!!! Error - I don\'t recognize this base: %s\n" % base) ++errors# print the resultsprint("A = %s\n" % count_of_A)print("C = %s\n" % count_of_C)print("G = %s\n" % count_of_G)print("T = %s\n" % count_of_T)print("errors = %s\n" % errors)# exit the programexit()
接着,有五个变量初始化为0,python中变量没有类型的分别,如果没有初始化变量,使用该变量时程序会报错终止。
5. 使用索引
如下是使用索引方式的伪代码:
read in the DNA from a filejoin the lines of the file into a single string of $DNA# initialize the countscount_of_A = 0count_of_C = 0count_of_G = 0count_of_T = 0for each base at each position in $DNA if base is A count_of_A = count_of_A + 1 if base is C count_of_C = count_of_C + 1 if base is G count_of_G = count_of_G + 1 if base is T count_of_T = count_of_T + 1doneprint count_of_A, count_of_C, count_of_G, count_of_T例子5-5 计算核苷酸评率 2
#!/usr/bin/env pythonimport os# Determining frequency of nucleotides# Get the name of the file with the DNA sequence dataprint("Please type the filename of the DNA sequence data: ")dna_filename = input()# open the file, or exitif os.path.exists(dna_filename): DNAFILE = open(dna_filename)else: print("Could not open file %s!\n" % dna_filename) exit()# Read the DNA sequence data from the file, and store it# into the array variable DNAsDNAs = DNAFILE.readlines()# Close the fileDNAFILE.close()# From the lines of the DNA file,# put the DNA sequence data into a single string.DNA = ''.join(DNAs)# Remove whitespaceDNA = DNA.replace('\n', '')# Initialize the counts.# Notice that we can use scalar variables to hold numbers.count_of_A = 0count_of_C = 0count_of_G = 0count_of_T = 0errors = 0# In a loop, look at each base in turn, determine which of the# four types of nucleotides it is, and increment the# appropriate count.for position in range(len(DNAs)): base = DNAs[position] if base == 'A': ++count_of_A elif base == 'C': ++count_of_C elif base == 'G': ++count_of_G elif base == 'T': ++count_of_T else: print("!!!!!!!! Error - I don\'t recognize this base: %s\n" % base) ++errors# print the resultsprint("A = %s\n" % count_of_A)print("C = %s\n" % count_of_C)print("G = %s\n" % count_of_G)print("T = %s\n" % count_of_T)print("errors = %s\n" % errors)# exit the programexit()
例子5-5输出如下:
Please type the filename of the DNA sequence data: small.dna!!!!!!!! Error - I don't recognize this vase: VA = 40C = 27G = 24T = 17errors = 1
上述for循环等价下面的while循环。
#!/usr/bin/env pythonimport os# Determining frequency of nucleotides# Get the name of the file with the DNA sequence dataprint("Please type the filename of the DNA sequence data: ")dna_filename = input()# open the file, or exitif os.path.exists(dna_filename): DNAFILE = open(dna_filename)else: print("Could not open file %s!\n" % dna_filename) exit()# Read the DNA sequence data from the file, and store it# into the array variable DNAsDNAs = DNAFILE.readlines()# Close the fileDNAFILE.close()# From the lines of the DNA file,# put the DNA sequence data into a single string.DNA = ''.join(DNAs)# Remove whitespaceDNA = DNA.replace('\n', '')# Initialize the counts.# Notice that we can use scalar variables to hold numbers.count_of_A = 0count_of_C = 0count_of_G = 0count_of_T = 0errors = 0# In a loop, look at each base in turn, determine which of the# four types of nucleotides it is, and increment the# appropriate count.position = 0while position
比较这两个循环,不难发现for循环只是while循环的简写,循环判断条件是position是否小于字符串DNA的长度,这里使用了字符串索引方式。默认情况下,Python假定字符串从0索引开始,其最后一个字符编号为字符串长度减一。
在上述程序中,我们还使用了python内置函数range,这是一个生成器函数(生成迭代器或列表),例如,range(5)等效生成[0, 1, 2, 3, 4]五个元素的列表。
6. 输出到文件
例子5-6介绍了另一种计算DNA字符串中核苷酸的方法,它使用了python字符串内置函数count。关于Python学习,你会体会到可能有一个相对简洁的方法来实现一个功能。
例子5-6 计算核苷酸的比例 3
#!/usr/bin/env pythonimport os# Determining frequency of nucleotides# Get the name of the file with the DNA sequence dataprint("Please type the filename of the DNA sequence data: ")dna_filename = input()# open the file, or exitif os.path.exists(dna_filename): DNAFILE = open(dna_filename)else: print("Could not open file %s!\n" % dna_filename) exit()# Read the DNA sequence data from the file, and store it# into the array variable DNAsDNAs = DNAFILE.readlines()# Close the fileDNAFILE.close()# From the lines of the DNA file,# put the DNA sequence data into a single string.DNA = ''.join(DNAs)# Remove whitespaceDNA = DNA.replace('\n', '')# In a loop, look at each base in turn, determine which of the# four types of nucleotides it is, and increment the# appropriate count.count_of_A = DNA.count('A') + DNA.count('a')count_of_C = DNA.count('C') + DNA.count('c')count_of_G = DNA.count('G') + DNA.count('g')count_of_T = DNA.count('T') + DNA.count('t')errors = len(DNA) - count_of_A -count_of_C - count_of_G - count_of_T# print the resultsprint("A=%d C=%d G=%d T=%d errors=%d\n" % (count_of_A, count_of_C, count_of_G, count_of_T, errors))# Also write the results to a file called "countbase"outputfile = "countbase"COUNTBASE = open(outputfile, 'w')COUNTBASE.write("A=%d C=%d G=%d T=%d errors=%d\n" % (count_of_A, count_of_C, count_of_G, count_of_T, errors))COUNTBASE.close()
# exit the programexit()
例子5-6输出如下:
Please type the filename of the DNA sequence data: small.dnaA=40 C=27 G=24 T=17 errors=1
例子5-7输出文件countbase内容如下:
A=40 C=27 G=24 T=17 errors=17.练习
5.1 编写一个无限循环的程序,循环每次判断条件为真。
5.2 用户输入两(短)DNA串,使用“+”运算符将第二个字符串连接到第一个字符串末尾。将连接的字符串打印,然后在连接的位置开始打印第二个字符串。例如,输入“AAAA”和“TTTT”,则打印:
AAAATTTT
TTTT
5.3 编写一个程序,打印从1到100的所有数字。
5.4 编写一个程序来获取DNA链的反向互补链。
5.5 编写一个程序来报告蛋白质序列中疏水性氨基酸的百分比。((要查找哪些氨基酸是疏水性的,请参阅有关蛋白质,分子生物学或细胞生物学的任何介绍性文章。)
5.6 编写一个程序,检查作为参数输入的两个字符串是否彼此反向互补。
5.7 编写一个程序来报告DNA序列GC的比例。
5.8 编写一个程序,可以替换DNA中指定位置的碱基。
参考资料
Begining Perl for Bioinformatics
领取专属 10元无门槛券
私享最新 技术干货