叶绿体基因组类的文章通常会有一幅图来展示叶绿体基因组的相似性(Sequence identity plot),出图的工具是mVISTA:mVISTA分为本地版和在线版两种。本文简要介绍使用在线版mVISTA获得Sequence identity plot的步骤。
本文使用到的序列数据为5个苹果属植物的叶绿体基因组序列
Malus florentina | NC_035625
Malus micromalus | NC_036368
Malus prunifolia | NC_031163
Malus trilobata | NC_035671
Malus tschonoskii | NC_035672
第一步:下载序列
下载每个叶绿体基因组的fasta格式;下载作为参考基因组的genbank格式文件。(序列不是很多可以手动下载;如果序列条数非常多自己写个一个简单的python脚本批量下载叶绿体基因组序列)
import os
import sys
import argparse
parser = argparse.ArgumentParser(description='This script was used to download gb or fasta file of cp genome from NCBI nucleotides database')
parser.add_argument('-f','--format',help='Please input file format you want to download',required=True)
parser.add_argument('-a','--accession',help='file name contain accession number of cp genome you want to download',required=True)
args = parser.parse_args()
fr = open(args.accession,'r')
acc = []
for line in fr:
acc.append(line.strip())
print("The number of sequence will be downloaded is: ",len(acc))
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "" #在这里添加邮箱
fold_name = "downladed_seq_"+args.format
os.mkdir(fold_name)
os.chdir(fold_name)
for line in acc:
print(line + " will be downloaded!")
fw = open(line+"."+args.format,'w')
cp = Entrez.efetch(db='nucleotide',id=[line],rettype=args.format)
seq = SeqIO.read(cp,args.format)
SeqIO.write(seq,fw,args.format)
fw.close()
print(line + " is downloaded!")
print(len(os.listdir()),"sequence have been downloaded")
使用方式
#下载fasta格式
python download_gb_or_fa_from_NCBI_cp_genome_database_1.py -f fasta -a accession_numbers.txt
# 下载genbank格式
python download_gb_or_fa_from_NCBI_cp_genome_database_1.py -f gb -a accession_numbers.txt
# 序列号放到文本文件中每行一个
http://genome.lbl.gov/vista/mvista/submit.shtml
image.png 本例中 total number of seuqences is 5
image.png
image.png
image.png 可以通过处理genbank格式文件得到
import argparse
from Bio import SeqIO
parser = argparse.ArgumentParser(description = 'This script was used to get mVISTA annotation file used to cp genome comparative analysis')
parser.add_argument('-i','--genbank',help="Please input genBank format file", required = True)
args = parser.parse_args()
fw = open(args.genbank + "_mVISTA_annotation","w")
for rec in SeqIO.parse(args.genbank,"gb"):
for feature in rec.features:
if feature.type == "gene":
for part in feature.location.parts:
if part.strand == -1:
start_location = str(part.start)
end_location = str(part.end)
gene_name = feature.qualifiers["gene"][0]
fw.write("<\t%s\t%s\t%s\n"%(start_location,end_location,gene_name))
else:
start_location = str(part.start)
end_location = str(part.end)
gene_name = feature.qualifiers["gene"][0]
fw.write("<\t%s\t%s\t%s\n"%(start_location,end_location,gene_name))
elif feature.type == "CDS":
for part in feature.location.parts:
if part.strand == -1:
start_location = str(part.start)
end_location = str(part.end)
fw.write("%s\t%s\t%s\n"%(start_location,end_location,"exon"))
else:
start_location = str(part.start)
end_location = str(part.end)
fw.write("%s\t%s\t%s\n"%(start_location,end_location,"exon"))
elif feature.type == "tRNA" or feature.type == "rRNA":
for part in feature.location.parts:
start_location = str(feature.location.start)
end_location = str(feature.location.end)
fw.write("%s\t%s\t%s\n"%(start_location,end_location,"utr"))
else:
print("%s %s"%(feature.type,"is not needed"))
print("The process is done!")
fw.close()
# 使用方法
python get_mVISTA_annotation_file_from_genbank_1.py -i NC_031163.gb
mage.png 这里只上传参考序列的注释文件和全部上传注释文件是否有区别自己还没有搞清楚,这里暂时只选择上传参考序列的注释文件。其他参数选择默认,具体参数是什么功能自己还需要仔细查看帮助文档。
image.png 结果以邮件的形式发送邮箱
选择输出的长度
点击此处下载结果
image.png 输出结果为pdf文件自己调节细节