在使用EVM或者maker进行基因注释后,通常的下一个需求就是对注释的gff的ID进行重命名,一般我们会按照物种的名称,按照基因在染色体的位置进行命名。这个该如何实现呢?这里借助近期看到的一些笔记,和大家分享其中的方法。
gff全称为general feature format,gff文件是一种用来描述基因组特征的文件,现在我们所使用的大部分都是第三版(gff3)。
gff文件除gff1以外均由9列数据组成,前8列在gff的3个版本中信息都是相同的,只是名称不同:
第9列attributes的内容存在很大的版本特异性。这9列信息(以gff3为例)分别是:
seqid source type start end score strand strand attributes
另外,在基因结构注释gff文件中中,基因包含mRNA,mRNA包含exon, CDS, UTR等信息,同时在注释文件中除基因行外,其他行在第9列会通过Parent指明该行从属的上一级ID,也就是一个基因的gene行、mRNA行、CDS行、exon行都会通过Parent层层关联在一起。具体例子
这里采用的是简书一位小伙伴写的脚本,他调用了python
中gffutils
的包。gffutils
能以极其简便的方式分层的方式处理GFF文件。
使用的脚本rename_gff.py
如下:
####rename_gff.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import sys
import argparse
import logging
import gffutils
from collections import defaultdict
def rename(args):
seqid2name = dict()
for line in open(args.change, 'r'):
tem = line.strip().split()
seqid2name[tem[0]] = tem[1]
db = gffutils.create_db(args.gff,':memory:',force=True,keep_order=True,merge_strategy="create_unique", sort_attribute_values = True)
mRNA_children=("five_prime_UTR","three_prime_UTR","CDS","exon")
idmap={
"CDS" : "cds",
"exon" : "exon",
"five_prime_UTR":"5utrp",
"three_prime_UTR":"3utrp"
}
f_out = open('%s.rename.gff3' % args.prefix, 'w')
seqid = None
for gene in db.features_of_type("gene",order_by=('seqid','start','end')):
if gene.seqid != seqid:
genenum = 0
genenum += args.addnum
seqid = gene.seqid
genename = '{0}G{1:06}'.format(seqid2name[seqid],genenum)
f_out.write('{seqid}\t{source}\t{featuretype}\t{start}\t{end}\t{score}\t{strand}\t{frame}\tID={geneid}\n'.format(seqid=gene.seqid, source=gene.source, featuretype=gene.featuretype, start=gene.start, end=gene.end, score=gene.score, strand=gene.strand, frame=gene.frame, geneid=genename))
for t,mRNA in enumerate(db.children(gene,featuretype="mRNA",order_by=('seqid','start','end'))):
mrna_num = t + 1
mrnaid = '{genename}.mRNA{num}'.format(genename=genename,num=mrna_num)
f_out.write('{seqid}\t{source}\t{featuretype}\t{start}\t{end}\t{score}\t{strand}\t{frame}\tID={mrnaid};Parent={geneid}\n'.format(seqid=mRNA.seqid, source=mRNA.source, featuretype=mRNA.featuretype, start=mRNA.start, end=mRNA.end, score=mRNA.score, strand=mRNA.strand, frame=mRNA.frame, mrnaid=mrnaid, geneid=genename))
numdict = defaultdict(int)
for child in db.children(mRNA,featuretype=mRNA_children,order_by=("start",'end')):
numdict[child.featuretype] += 1
child_id = '{genename}.{childid}{num}'.format(genename=genename,childid=idmap[child.featuretype],num=numdict[child.featuretype])
f_out.write('{seqid}\t{source}\t{featuretype}\t{start}\t{end}\t{score}\t{strand}\t{frame}\tID={child_id};Parent={mrnaid}\n'.format(seqid=child.seqid, source=child.source, featuretype=child.featuretype, start=child.start, end=child.end, score=child.score, strand=child.strand, frame=child.frame, child_id=child_id, mrnaid=mrnaid))
f_out.close()
def main():
logging.basicConfig(
stream=sys.stderr,
level=logging.INFO,
format="[%(levelname)s] %(message)s"
)
parser = argparse.ArgumentParser(
formatter_class=argparse.RawDescriptionHelpFormatter,
description="rename gff3 file")
parser.add_argument('-g', '--gff', required=True, help='gff3 file')
parser.add_argument('-c', '--change', required=True, help='a file, correspondence between sequence name and gene name prefix')
parser.add_argument('-a', '--addnum', type=int,default=1, help='diff in gene number, such as if addnum = 10, xx1G000010, xx1G000020')
parser.add_argument('-p', '--prefix', default='result', help='prefix of output')
args = parser.parse_args()
rename(args)
if __name__ == "__main__":
main()
脚本的帮助文档:
python rename_gff.py -h
usage: rename_gff.py [-h] -g GFF -c CHANGE [-a ADDNUM] [-p PREFIX]
rename gff3 file
optional arguments:
-h, --help show this help message and exit
-g GFF, --gff GFF gff3 file #输入注释文件
-c CHANGE, --change CHANGE #序列id和更换前缀之间的对应关系文件
a file, correspondence between sequence name and gene
name prefix
-a ADDNUM, --addnum ADDNUM #相邻基因编号差值
diff in gene number, such as if addnum = 10,
xx1G000010, xx1G000020
-p PREFIX, --prefix PREFIX #输出文件前缀
prefix of output
这个是我原始的gff3文件:
chr1H EVM CDS 400601 400684 . + 0 ID=Hordeum_vulgare_S_Scaffold1097_G00001.t1.cds1;Parent=Hordeum_vulgare_S_Scaffold1097_G00001.t1
chr1H EVM exon 400601 400684 . + . ID=Hordeum_vulgare_S_Scaffold1097_G00001.t1.exon1;Parent=Hordeum_vulgare_S_Scaffold1097_G00001.t1
chr1H EVM gene 400601 401782 . + . ID=Hordeum_vulgare_S_Scaffold1097_G00001
chr1H EVM mRNA 400601 401782 . + . ID=Hordeum_vulgare_S_Scaffold1097_G00001.t1;Parent=Hordeum_vulgare_S_Scaffold1097_G00001
chr1H EVM CDS 401648 401782 . + 0 ID=Hordeum_vulgare_S_Scaffold1097_G00001.t1.cds2;Parent=Hordeum_vulgare_S_Scaffold1097_G00001.t1
chr1H EVM exon 401648 401782 . + . ID=Hordeum_vulgare_S_Scaffold1097_G00001.t1.exon2;Parent=Hordeum_vulgare_S_Scaffold1097_G00001.t1
chr1H EVM CDS 402205 402518 . + 0 ID=Hordeum_vulgare_S_Scaffold1097_G00002.t1.cds1;Parent=Hordeum_vulgare_S_Scaffold1097_G00002.t1
chr1H EVM exon 402205 402518 . + . ID=Hordeum_vulgare_S_Scaffold1097_G00002.t1.exon1;Parent=Hordeum_vulgare_S_Scaffold1097_G00002.t1
chr1H EVM gene 402205 403658 . + . ID=Hordeum_vulgare_S_Scaffold1097_G00002
chr1H EVM mRNA 402205 403658 . + . ID=Hordeum_vulgare_S_Scaffold1097_G00002.t1;Parent=Hordeum_vulgare_S_Scaffold1097_G00002
这里需要提供一个key文件,将你想转换的信息写成一个文件,我的key.txt文件如下:
##Key.txt
Chr1H Hordeum_vulgare_S_1H
Chr2H Hordeum_vulgare_S_2H
Chr3H Hordeum_vulgare_S_3H
Chr4H Hordeum_vulgare_S_4H
Chr5H Hordeum_vulgare_S_5H
Chr6H Hordeum_vulgare_S_6H
Chr7H Hordeum_vulgare_S_7H
使用该脚本进行gff基因ID的替换:
python rename.py -g test.gff -c key.txt
默认会生成一个result.rename.gff3
文件,让我们查看一下效果如何:
head result.rename.gff3
chr1H EVM gene 400601 401782 . + . ID=Hordeum_vulgare_S_1HG000001
chr1H EVM mRNA 400601 401782 . + . ID=Hordeum_vulgare_S_1HG000001.mRNA1;Parent=Hordeum_vulgare_S_1HG000001
chr1H EVM CDS 400601 400684 . + 0 ID=Hordeum_vulgare_S_1HG000001.cds1;Parent=Hordeum_vulgare_S_1HG000001.mRNA1
chr1H EVM exon 400601 400684 . + . ID=Hordeum_vulgare_S_1HG000001.exon1;Parent=Hordeum_vulgare_S_1HG000001.mRNA1
chr1H EVM CDS 401648 401782 . + 0 ID=Hordeum_vulgare_S_1HG000001.cds2;Parent=Hordeum_vulgare_S_1HG000001.mRNA1
chr1H EVM exon 401648 401782 . + . ID=Hordeum_vulgare_S_1HG000001.exon2;Parent=Hordeum_vulgare_S_1HG000001.mRNA1
chr1H EVM gene 402205 403658 . + . ID=Hordeum_vulgare_S_1HG000002
chr1H EVM mRNA 402205 403658 . + . ID=Hordeum_vulgare_S_1HG000002.mRNA1;Parent=Hordeum_vulgare_S_1HG000002
chr1H EVM CDS 402205 402518 . + 0 ID=Hordeum_vulgare_S_1HG000002.cds1;Parent=Hordeum_vulgare_S_1HG000002.mRNA1
chr1H EVM exon 402205 402518 . + . ID=Hordeum_vulgare_S_1HG000002.exon1;Parent=Hordeum_vulgare_S_1HG000002.mRNA1
完美达成任务有没有,希望今天的推文对你也有帮忙。
参考资料:
python package gffutils: http://daler.github.io/gffutils/
简书:https://www.jianshu.com/p/e2f13e2f42ba