文章/答案/技术大牛

发布

第五章模体和循环

文章来源：企鹅号 - 生物信息学起步

本章将继续Python语言的基础知识。到本章结尾，你将了解如何：

搜索DNA或蛋白质中的模体

用键盘与用户交互

将数据写入文件

使用循环

使用字符串内置函数find

根据条件测试的结果执行不同的操作

通过操作字符串和列表来详细检查序列数据

在本章中，你将学习编写一个在序列数据中查找模体的程序，并且提供编写生物信息学程序所需的技能。

1. 流程控制

流程控制是执行程序语句的顺序，一般程序按顺序从顶部第一个语句到底部最后一个语句执行，有两种方法可以改变程序执行顺序：条件语句和循环。条件语句只有条件成功时才执行一组语句，否则只是跳过一组语句。循环重复一组语句直到关联的测试失败。

　　1.1 条件语句

让我们回想open语句，如果你尝试打开的文件不存在，则会收到错误消息。因此，在尝试打开文件之前，你可以先判断文件是否存在。实际上，这些判断是计算机语言最强大的功能之一。if、if-else是python中存在的判断语句。

这些结构的特征是计算True/False值，如果条件为True，则执行其下面的语句；为假，则跳过语句（反之亦然）。

然而，什么是真？不同的编程语言可能会稍微不同。

本节演示了一些Python条件的示例。每个例子中的真假条件是两个数字之间的相等，数字的相等性由两个等号“==”表示，因为单个等号=已用于赋值给变量。以下示例演示条件测试是否计算为True/False，通常这些简单地测试没有太多用处，大多数条件测试是判断变量的值或函数调用返回的结果——你事先不知道的事情。

条件测试为True：

if 1 == 1：　　print（"1 equals 1\n\n"）

输出结果：

1 equals 1

测试条件是“1 == 1”，条件计算值为True，则执行与if语句关联的语句，打印出消息。

你也可以这样判断：

if 1: print("1 evaluates to true\n\n")

输出结果：

1 evaluates to true

条件测试为False：

if 1 == 0 : print("1 equals 0\n\n")

没有输出结果。测试条件是“1 == 0”，条件计算结果为False，因此不执行与if关联的语句，也不打印任何消息。

你也可以这样判断：

if 0: print("0 evaluates to true\n\n")

也没有输出结果，因为0的计算结果为False，因此不执行与if关联的语句，也不打印任何消息。

现在我们再看看计算结果为True的if-else语句：

if 1 == 1: print("1 equals 1\n\n")else: print("1 does not equal 1\n\n")

输出结果为：

1 equals 1

if-else在测试条件为True时执行一项操作，如果为False则执行另一项操作。如下是计算结果为False的if-else语句：

if 1 == 0: print("1 equals 0\n\n")else: print("1 does not equal 0\n\n")

输出结果为：

1 does not equal 01.1.1 条件判断和缩进

关于条件判断有两点需要额外说明。首先，有几个运算符可以在条件判断部分使用，除了前面示例中的“==”之外，还有不等于“!=”、大于“>”、小于“

再看例子5-1中if-elif-else语句，条件判断首先是if，然后是elif，依次评估，一旦结果为True，就会执行其块，并忽略其余条件。如果没有条件的计算结果为True，则执行else块。

例子5-1 if-elif-else

#!/usr/bin/env python# if-elif-elseword = 'MNIDDKL'# if-elif-else conditionalsif word == 'QSTVSGE': print("QSTVSGE\n")elif word == 'MRQQDMISHDEL': print("MRQQDMISHDEL\n")elif word == 'MNIDDKL': print("MNIDDKL--the magic word!\n")else: print('Is \"%s\" a peptide? This program is not sure.\n' % word)exit()

注意else块print语句中的\"，它允许你打印双引号"。反斜杠字符告诉Python将"视为符号本身而不是将其解释为字符串结尾的标记。

例子5-1输出结果：

MNIDDKL--the magic word!1.2 循环

循环允许你重复执行块内的语句块，在Python中有几种循环方式：while循环、for循环等等。例子5-2（来自第四章）显示了while循环以及从文件中读取蛋白质序列数据时如何使用它。

例子5-2 从文件中读取蛋白质序列数据， 4

#!/usr/bin/env pythonimport os# Reading protein sequence data from a file, take 4# The filename of the file containing the protein sequence dataproteinfilename = 'NM_021964fragment.pep'# First we have to "open" the file, and in case the# open fails, print an error message and exit the program.if os.path.exists(proteinfilename): PROTEINFILE = open(proteinfilename)else: print("Could not open file %s!\n" % proteinfilename) exit() # Read the protein sequence data from the file in a "while" loop,# printing each line as it is read.protein = PROTEINFILE .readline()while protein: print(" ###### Here is the next line of the file:\n") print(protein) protein = PROTEINFILE .readline()# Close the file.PROTEINFILE.close()exit()

例子5-2输出如下：

###### Here is the next line of the file:MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD ###### Here is the next line of the file:SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ ###### Here is the next line of the file:GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR

首先，程序读取文件的第一行赋值给变量protein，然后在while循环中，循环每次给变量protein赋值文件下一行的内容，并判断读取的行是否为空来判断条件是否为False，是否退出循环，跳过两个print语句块。条件为True，新行储存在变量protein中，并执行带有两个print语句块。

1.2.1 open函数和os模块

open函数调用时系统调用，因为要打开文件，Python必须从操作系统中请求该文件。操作系统可以是Unix/Linux、Microsoft Windows、Apple Macintosh等等，文件由操作系统管理，只能由它访问。检查系统调用的成功或失败是一个好习惯，特别是在打开文件时。如果系统调用失败，并且没有检查它，程序将继续读取或写入无法打开的文件。你应始终检查故障，并在无法打开文件时立即通知用户或退出程序。

总而言之，Python中学习条件和循环不难，而且条件和循环是编程语言最强大的功能之一。条件允许你为程序制定多个备选方案，并根据获得的输入类型做出决策。循环利用计算机的速度，通过几行代码就可以处理大量输入或不断迭代和优化计算。

2. 搜索模体

在生物信息学中最常见的事情之一是寻找特别感兴趣的模体、DNA或蛋白质短片段，它们可能是DNA的调节原件或短链蛋白质，在许多物种中都是保守的。（PROSITE网站提供了有关蛋白质模体的广泛信息。）

在生物序列中寻找的模体通常不是一个特定序列，它们可能具有多种变体，例如，存在碱基或氨基酸无关紧要的位置，可能有不同的长度。这种情况可以使用正则表达式，在第九章和本专辑其它地方你将看到更多例子。

Python有一套便于查找字符串的功能，例子5-3介绍了这种字符串搜索功能，类似的程序一直在生物学研究中使用。它执行以下操作：

从文件中读取蛋白质序列数据

将所有序列数据放入一个字符串中以便于搜索

查找用户输入的模体

例子5-3 寻找模体

#!/usr/bin/env pythonimport os# Searching for motifs# Ask the user for the filename of the file containing# the protein sequence data, and collect it from the keyboardprint "Please type the filename of the protein sequence data: ";proteinfilename = input()# open the file, or exitif os.path.exists(proteinfilename):

PROTEINFILE = open(proteinfilename)

else:

print("Could not open file %s!\n" % proteinfilename)

exit()# Read the protein sequence data from the file, and store it# into the array variable proteinsproteins = PROTEINFILE.readlines()# Close the file - we've read all the data into @protein now.PROTEINFILE.close()# Put the protein sequence data into a single string, as it's easier# to search for a motif in a string than in an array of# lines (what if the motif occurs over a line break?)protein = ''.join(proteins)# Remove whitespaceprotein = protein.replace('\n', '')# In a loop, ask the user for a motif, search for the motif,# and report if it was found.# Exit if no motif is entered.while True: print("Enter a motif to search for: ") motif = input() # exit on an empty user input if not motif: break # Look for the motif if protein.find(motif) != -1: print("I found it!\n\n") else: print("I couldn\'t find it.\n\n")# exit the programexit()

例子5-3输出结果：

Please type the filename of the protein sequence data: NM_021964fragment.pepEnter a motif to search for: SVLQI found it!Enter a motif to search for: jklI couldn't find it.Enter a motif to search for: QDSVI found it!Enter a motif to search for: HERLPQGLQI found it!Enter a motif to search for: I couldn't find it.

从输出中可以看出，该程序找到用户输入的模体序列。美中不足的是如果这个程序不仅报告它找到模体序列，而且在什么位置，那就太好了。你将在第九章中看到如何实现这一点，该章中练习要求你修改此程序以便报告模体的位置。

以下内容将检查讨论例子5-3中的内容;

从键盘获取用户输入

将文件的行连接成单个字符串

find函数和strip函数

not

　　2.1 从键盘获取用户输入

Python使用内置函数来获取用户在键盘上键入的输入。在例子5-3中，一个名为input的函数接受用户输入数据，返回为 string 类型。当用户键入文件名并通过Enter键发送输入时，文件名会保存到变量proteinfilename。

　　2.2 使用join函数将列表转成字符串

通常蛋白质序列数据是分成80个残基的短片段，因为当数据打印在纸上或显示在屏幕上时，需要将其分解为合适的行个数。但是，如果你正在搜索的模体序列在文件中由换行符分分割了，程序就找不到这个模体序列。在例子5-3中，搜索的一些模体序列是由换行符分开的，在Python中，你可以使用join函数将列表中所有数据组合成一个储存在新变量protein中的单个字符串。

protein = ''.join(proteins)

你可以指定列表元素连接时使用的字符串，本示例中，指定要放置在输入文件行之间为空字符串，空字符串用一对单引号''（或者双引号“”）表示。

回想例子4-2中，我们介绍了两个DNA片段的连接方法，与join函数的使用非常相似。例子4-2中的以下语句，它是连接两个字符串的方法之一：

DNA3 = DNA1 + DNA2

完全相同的另一种连接方法是使用join函数：

DNA3 = ''.join([DNA1, DNA2])

上面的方法中，指定了一个字符串元素列表，而不是一个列表的名称：

[DNA1, DNA2]2.3 python中实现do-until类似循环

例子5-3中，使用while和if判断实现了类似do-until功能的循环。它首先执行一个块，然后进行if判断，如果测试成功，就用break跳出循环。例子5-3首先打印用户提示信息，获取用户输入，调用find函数搜索模体并报告结果（是否为-1，-1意味着没有找到）。在执行查找操作之前，用if语句判断用户是否输入的是空行，空行意味着用户用户没有更多的模体序列要查找，退出循环。

　　2.4 字符串函数find和索引

Python可以轻松操作各种字符串，例如DNA和蛋白质序列数据。字符串内含函数find用来搜索子字符串（模体序列）是否出现在字符串（蛋白质序列）中，如果找到则返回子字符串的起始索引，否则返回-1。

3. 统计核苷酸个数

关于给定的DNA序列，你可能想知道是编码还是非编码？是否包含调节因子？与其它DNA序列是否有关？DNA序列中四种核苷酸的个数？事实上，在某些物种中，编码区具有特定的核苷酸偏差，因此最后一个问题对于寻找基因非常重要。此外，不同物种具有不同的核算比例，因此计算核苷酸比例是十分有用的。

在下面的两个程序，例子5-4和5-6中，它们计算了DNA中每种核苷酸的含量。使用Python部分新功能：

将字符串转换成列表

迭代列表

要获得给定DNA中每种核苷酸的个数，你必须遍历每个碱基，看看是什么碱基，然后统计每个核苷酸的个数，这里使用for循环来统计。

首先，来看看伪代码，之后再写跟详细地伪代码来编写Python程序。

以下伪代码描述了所需的内容：

for each base in the DNA if base is A count_of_A = count_of_A + 1 if base is C count_of_C = count_of_C + 1 if base is G count_of_G = count_of_G + 1 if base is T count_of_T = count_of_T + 1doneprint count_of_A, count_of_C, count_of_G, count_of_T

如你所见，思路非常简单，现在让我们来看看如何用Python编写该程序。

4. 将字符串转换成列表

首先将字符串转换成列表，指将DNA字符串中的每个碱基是分开的，并且每个字母在列表中称为单独的元素。然后你可以逐个查看列表元素（每个元素都是一个单字母），遍历列表统计每个核苷酸的数量。这与2.2节中的join函数将字符串列表中的元素连接成一个字符串的功能相反。

下面详细版的伪代码中添加了从文件中获取DNA并操作该文件数据的指令。首先，连接从原始文件中读取的行列表得到字符串序列，替换字符串序列中的换行符和空格，然后将字符串序列转成单个字母的列表。

read in the DNA from a filejoin the lines of the file into a single string $DNA# make an array out of the bases of $DNA@DNA = explode $DNA# initialize the countscount_of_A = 0count_of_C = 0count_of_G = 0count_of_T = 0for each base in @DNA if base is A count_of_A = count_of_A + 1 if base is C count_of_C = count_of_C + 1 if base is G count_of_G = count_of_G + 1 if base is T count_of_T = count_of_T + 1doneprint count_of_A, count_of_C, count_of_G, count_of_T

上述伪代码详细地介绍了一种方法，通过将DNA序列转换成单个字母元素的裂变来查看每个碱基，还将每个核苷酸的计数初始化为0，例子5-4是实际可行的程序。

例子5-4 计算核苷酸的比例

#!/usr/bin/env python

import os# Determining frequency of nucleotides# Get the name of the file with the DNA sequence dataprint("Please type the filename of the DNA sequence data: ")dna_filename = input()# open the file, or exitif os.path.exists(dna_filename):

DNAFILE = open(dna_filename)

else:

print("Could not open file %s!\n" % dna_filename)

exit()

# Read the DNA sequence data from the file, and store it# into the array variable DNAsDNAs = DNAFILE.readlines()# Close the fileDNAFILE.close()# From the lines of the DNA file,# put the DNA sequence data into a single string.DNA = ''.join(DNAs)# Remove whitespaceDNA = DNA.replace('\n', '')# Now explode the DNA into an array where each letter of the# original string is now an element in the array.# This will make it easy to look at each position.# Notice that we're reusing the variable DNA for this purpose.DNA = list(DNA)# Initialize the counts.# Notice that we can use scalar variables to hold numbers.count_of_A = 0count_of_C = 0count_of_G = 0count_of_T = 0errors = 0# In a loop, look at each base in turn, determine which of the# four types of nucleotides it is, and increment the# appropriate count.for base in DNA: if base == 'A': ++count_of_A elif base == 'C': ++count_of_C elif base == 'G': ++count_of_G elif base == 'T': ++count_of_T else: print("!!!!!!!! Error - I don\'t recognize this base: %s\n" % base) ++errors# print the resultsprint("A = %s\n" % count_of_A)print("C = %s\n" % count_of_C)print("G = %s\n" % count_of_G)print("T = %s\n" % count_of_T)print("errors = %s\n" % errors)# exit the programexit()

为了演示例子5-4，我创建了以下DNA文件并命名为small.dna：

AAAAAAAAAAAAAAGGGGGGGTTTTCCCCCCCCCCCCCGTCGTAGTAAAGTATGCAGTAGCVGCCCCCCCCCCGGGGGGGGAAAAAAAAAAAAAAATTTTTTATAAACG

注意文件中有一个字母V，如下是例子5-4输出：

Please type the filename of the DNA sequence data: small.dna!!!!!!!! Error - I don't recognize this base: VA = 40C = 27G = 24T = 17

在这个程序中，我们使用了list这个新的函数，用法如下:

DNA = list(DNA)

这个函数将DNA字符串分割成了单个字母组成的列表。在交互式环境输入help（list）或使用文档查看list函数使用方法，list函数接受一个可迭代的对象，将其转换为列表。

在Python中，字符串就是一个可迭代对象，并且可迭代对象是可以使用for循环。因此，上述程序可去掉“DNA = list(DNA)”，写成如下方式：

#!/usr/bin/env pythonimport os# Determining frequency of nucleotides# Get the name of the file with the DNA sequence dataprint("Please type the filename of the DNA sequence data: ")dna_filename = input()# open the file, or exitif os.path.exists(dna_filename):　　DNAFILE = open(dna_filename)else:　　print("Could not open file %s!\n" % dna_filename)　　exit()# Read the DNA sequence data from the file, and store it# into the array variable DNAsDNAs = DNAFILE.readlines()# Close the fileDNAFILE.close()# From the lines of the DNA file,# put the DNA sequence data into a single string.DNA = ''.join(DNAs)# Remove whitespaceDNA = DNA.replace('\n', '')# Initialize the counts.# Notice that we can use scalar variables to hold numbers.count_of_A = 0count_of_C = 0count_of_G = 0count_of_T = 0errors = 0# In a loop, look at each base in turn, determine which of the# four types of nucleotides it is, and increment the# appropriate count.for base in DNA: if base == 'A': ++count_of_A elif base == 'C': ++count_of_C elif base == 'G': ++count_of_G elif base == 'T': ++count_of_T else: print("!!!!!!!! Error - I don\'t recognize this base: %s\n" % base) ++errors# print the resultsprint("A = %s\n" % count_of_A)print("C = %s\n" % count_of_C)print("G = %s\n" % count_of_G)print("T = %s\n" % count_of_T)print("errors = %s\n" % errors)# exit the programexit()

接着，有五个变量初始化为0，python中变量没有类型的分别，如果没有初始化变量，使用该变量时程序会报错终止。

5. 使用索引

如下是使用索引方式的伪代码：

read in the DNA from a filejoin the lines of the file into a single string of $DNA# initialize the countscount_of_A = 0count_of_C = 0count_of_G = 0count_of_T = 0for each base at each position in $DNA if base is A count_of_A = count_of_A + 1 if base is C count_of_C = count_of_C + 1 if base is G count_of_G = count_of_G + 1 if base is T count_of_T = count_of_T + 1doneprint count_of_A, count_of_C, count_of_G, count_of_T例子5-5 计算核苷酸评率 2

#!/usr/bin/env pythonimport os# Determining frequency of nucleotides# Get the name of the file with the DNA sequence dataprint("Please type the filename of the DNA sequence data: ")dna_filename = input()# open the file, or exitif os.path.exists(dna_filename):　　DNAFILE = open(dna_filename)else:　　print("Could not open file %s!\n" % dna_filename)　　exit()# Read the DNA sequence data from the file, and store it# into the array variable DNAsDNAs = DNAFILE.readlines()# Close the fileDNAFILE.close()# From the lines of the DNA file,# put the DNA sequence data into a single string.DNA = ''.join(DNAs)# Remove whitespaceDNA = DNA.replace('\n', '')# Initialize the counts.# Notice that we can use scalar variables to hold numbers.count_of_A = 0count_of_C = 0count_of_G = 0count_of_T = 0errors = 0# In a loop, look at each base in turn, determine which of the# four types of nucleotides it is, and increment the# appropriate count.for position in range(len(DNAs)): base = DNAs[position] if base == 'A': ++count_of_A elif base == 'C': ++count_of_C elif base == 'G': ++count_of_G elif base == 'T': ++count_of_T else: print("!!!!!!!! Error - I don\'t recognize this base: %s\n" % base) ++errors# print the resultsprint("A = %s\n" % count_of_A)print("C = %s\n" % count_of_C)print("G = %s\n" % count_of_G)print("T = %s\n" % count_of_T)print("errors = %s\n" % errors)# exit the programexit()

例子5-5输出如下：

Please type the filename of the DNA sequence data: small.dna!!!!!!!! Error - I don't recognize this vase: VA = 40C = 27G = 24T = 17errors = 1

上述for循环等价下面的while循环。

比较这两个循环，不难发现for循环只是while循环的简写，循环判断条件是position是否小于字符串DNA的长度，这里使用了字符串索引方式。默认情况下，Python假定字符串从0索引开始，其最后一个字符编号为字符串长度减一。

在上述程序中，我们还使用了python内置函数range，这是一个生成器函数（生成迭代器或列表），例如，range（5）等效生成[0, 1, 2, 3, 4]五个元素的列表。

6. 输出到文件

例子5-6介绍了另一种计算DNA字符串中核苷酸的方法，它使用了python字符串内置函数count。关于Python学习，你会体会到可能有一个相对简洁的方法来实现一个功能。

例子5-6 计算核苷酸的比例 3

#!/usr/bin/env pythonimport os# Determining frequency of nucleotides# Get the name of the file with the DNA sequence dataprint("Please type the filename of the DNA sequence data: ")dna_filename = input()# open the file, or exitif os.path.exists(dna_filename):　　DNAFILE = open(dna_filename)else:　　print("Could not open file %s!\n" % dna_filename)　　exit()# Read the DNA sequence data from the file, and store it# into the array variable DNAsDNAs = DNAFILE.readlines()# Close the fileDNAFILE.close()# From the lines of the DNA file,# put the DNA sequence data into a single string.DNA = ''.join(DNAs)# Remove whitespaceDNA = DNA.replace('\n', '')# In a loop, look at each base in turn, determine which of the# four types of nucleotides it is, and increment the# appropriate count.count_of_A = DNA.count('A') + DNA.count('a')count_of_C = DNA.count('C') + DNA.count('c')count_of_G = DNA.count('G') + DNA.count('g')count_of_T = DNA.count('T') + DNA.count('t')errors = len(DNA) - count_of_A -count_of_C - count_of_G - count_of_T# print the resultsprint("A=%d C=%d G=%d T=%d errors=%d\n" % (count_of_A, count_of_C, count_of_G, count_of_T, errors))# Also write the results to a file called "countbase"outputfile = "countbase"COUNTBASE = open(outputfile, 'w')COUNTBASE.write("A=%d C=%d G=%d T=%d errors=%d\n" % (count_of_A, count_of_C, count_of_G, count_of_T, errors))COUNTBASE.close()

# exit the programexit()

例子5-6输出如下：

Please type the filename of the DNA sequence data: small.dnaA=40 C=27 G=24 T=17 errors=1

例子5-7输出文件countbase内容如下：

A=40 C=27 G=24 T=17 errors=17.练习

5.1 编写一个无限循环的程序，循环每次判断条件为真。

5.2 用户输入两（短）DNA串，使用“+”运算符将第二个字符串连接到第一个字符串末尾。将连接的字符串打印，然后在连接的位置开始打印第二个字符串。例如，输入“AAAA”和“TTTT”，则打印：

AAAATTTT

TTTT

5.3 编写一个程序，打印从1到100的所有数字。

5.4 编写一个程序来获取DNA链的反向互补链。

5.5 编写一个程序来报告蛋白质序列中疏水性氨基酸的百分比。（（要查找哪些氨基酸是疏水性的，请参阅有关蛋白质，分子生物学或细胞生物学的任何介绍性文章。）

5.6 编写一个程序，检查作为参数输入的两个字符串是否彼此反向互补。

5.7 编写一个程序来报告DNA序列GC的比例。

5.8 编写一个程序，可以替换DNA中指定位置的碱基。

参考资料

Begining Perl for Bioinformatics

发表于: 2018-08-122018-08-12 20:42:57
原文链接：https://kuaibao.qq.com/s/20180812G18SOB00?refer=cp_1026
腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号（企鹅号）传播渠道之一，根据《腾讯内容开放平台服务协议》转载发布内容。
如有侵权，请联系 cloudcommunity@tencent.com 删除。

第五章模体和循环

相关快讯

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

第五章 模体和循环

相关快讯

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

第五章模体和循环