我在生物信息学:全景一文中,阐述了生物信息学的应用领域非常广泛。但是有一点是很关键的,就是细胞内的生命活动都遵从中心法则,生物信息学很多时候就是在中心法则上做文章:
如何用计算机语言描述生物大分子,以及它们之间如何相互转换,是首先要面对的问题。
中心法则涉及 3 种生物序列,在计算机中,以字符串的形式表示:
遗传密码是三联体的,有 4 x 4 x 4 共 64 种可能,但是只编码 20 种氨基酸。说明有的密码并不编码氨基酸(终止密码子),而有的多个密码子共同决定一个氨基酸。所谓遗传翻译,就是把三联体密码对应到其代表的氨基酸的过程。
给定:一条单链的 mRNA 序列(最长不超过 10kb)。
需得:其编码的蛋白质序列。
AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA
MAMAPRTEINSTRING
Translating_RNA_into_Protein.py
import sys
table = {
'UUU':'F','CUU':'L','AUU':'I','GUU':'V',
'UUC':'F','CUC':'L','AUC':'I','GUC':'V',
'UUA':'L','CUA':'L','AUA':'I','GUA':'V',
'UUG':'L','CUG':'L','AUG':'M','GUG':'V',
'UCU':'S','CCU':'P','ACU':'T','GCU':'A',
'UCC':'S','CCC':'P','ACC':'T','GCC':'A',
'UCA':'S','CCA':'P','ACA':'T','GCA':'A',
'UCG':'S','CCG':'P','ACG':'T','GCG':'A',
'UAU':'Y','CAU':'H','AAU':'N','GAU':'D',
'UAC':'Y','CAC':'H','AAC':'N','GAC':'D',
'UAA':'Stop','CAA':'Q','AAA':'K','GAA':'E',
'UAG':'Stop','CAG':'Q','AAG':'K','GAG':'E',
'UGU':'C','CGU':'R','AGU':'S','GGU':'G',
'UGC':'C','CGC':'R','AGC':'S','GGC':'G',
'UGA':'Stop','CGA':'R','AGA':'R','GGA':'G',
'UGG':'W','CGG':'R','AGG':'R','GGG':'G'
}
def translate(rna, table):
n = len(rna)
amino_acids = ''
for i in range(0, n, 3):
codon = rna[i:i+3]
if codon not in table or table[codon] == "Stop":
break
else:
amino_acids += table[codon]
return amino_acids
def test():
rna = 'AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA'
return translate(rna, table) == 'MAMAPRTEINSTRING'
if __name__ == '__main__':
if not test():
print("translate: Failed")
sys.exit(1)
with open('rosalind_prot.txt') as fh:
rna = fh.read()
amino_acids = translate(rna, table)
print(amino_acids)
BioPython
中的密码子表搜集得比较全面,是很好的参考。The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.
The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet.
Given: An RNA string corresponding to a strand of mRNA (of length at most 10 kbp).
Return: The protein string encoded by .
AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA
MAMAPRTEINSTRING