共 4 篇文章

清单首页ANNOVAR region-based annotation、gene-based annotation、region-based annotation文章详情

清单「ANNOVAR region-based annotation、gene-based annotation、region-based annotation」 04/04

ANNOVAR region-based annotation-下篇

在UCSC网站上，对于不同的参考基因组，提供了许多数据，这些数据中只要是提供了基因组区域的，理论上都可以用于annovar region-based annotation。在选择时，只需要根据目的选择对应的数据库即可。annovar 官方的文档中，给出了以下几种用法。

1. ENCODE annnotated regions

ENCODE项目提供了许多基因组特殊区域的注释，比如转录区，H3K4Me1，H3K4Me3，H3K27Ac 等各种修饰后的组蛋白结合区，DNaseI 超敏感区域，转录因子结合区域等。这些区域都可以用于annovar的注释。

对于不同的组蛋白修饰，常规的认知如下

active promoter: H3K4me3, H3K9Ac
active enhancer: H3K4me1, H3K27Ac
active elongation: H3K36me3, H3K79me2
repressed promoters and broad regions: H3K27me3, H3K9me3

H3K4me3和H3K9Ac这两种组蛋白修饰通常富集在激活的启动子区域上； H3K4me1和H3K27Ac这两种组蛋白修饰通常富集在基因的增强子区域；H3K36me3和 H3K79me2这两种组蛋白修饰通常富集在RNA聚合酶延伸区域；H3K27me3和H3K9me3这两种组蛋白修饰用于抑制基因表达。

目前，ENCODE的相关注释只支持hg18版本的数据库。以H3K4Me1组蛋白修饰结合区域为例，说明如下

第一步：下载wgEncodeBroadChipSeqPeaksGm12878H3k4me1 数据库，命令如下

annotate_variation.pl -downdb wgEncodeBroadChipSeqPeaksGm12878H3k4me1 humandb/

NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/wgEncodeBroadChipSeqPeaksGm12878H3k4me1.txt.gz ... Done
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg18 build version, with files saved at the 'humandb' directory

数据库文件内容如下

585     chr1    0       625     .       1000    .       18.57   14.6993 -1
585     chr1    14800   18500   .       1000    .       7.21    12.9486 -1
585     chr1    19250   20600   .       1000    .       8.21    13.6344 -1
586     chr1    224700  228000  .       1000    .       12.17   13.2353 -1
587     chr1    367150  369075  .       1000    .       7.42    13.5232 -1

第二步，进行注释，命令如下

annotate_variation.pl -regionanno -buildver hg18 -dbtype wgEncodeBroadChipSeqPeaksGm12878H3k4me1 -scorecolumn 5 ex1.avinput humandb/

NOTICE: Output file is written to ex1.avinput.hg18_wgEncodeBroadChipSeqPeaksGm12878H3k4me1
NOTICE: Reading annotation database humandb/hg18_wgEncodeBroadChipSeqPeaksGm12878H3k4me1.txt ... Done with 66675 regions
NOTICE: Finished region-based annotation on 23 genetic variants

输出文件的后缀为hg18_wgEncodeBroadChipSeqPeaksGm12878H3k4me1, 在输入文件的前面新增了两列，内容如下

wgEncodeBroadChipSeqPeaksGm12878H3k4me1    Score=1000;Name=.
wgEncodeBroadChipSeqPeaksGm12878H3k4me1    Score=1000;Name=.
wgEncodeBroadChipSeqPeaksGm12878H3k4me1    Score=1000;Name=.

第一列为数据库的名称，第二列为名称和打分值，-scorecolumn 5参数指定数据库文件中的第五列作为score值。

2. dbsnp 数据库

第一步：下载snp138数据库，命令如下

annotate_variation.pl -buildver hg19 -downdb snp138 humandb/

第二步，进行注释, 命令如下

annotate_variation.pl -buildver hg19 -regionanno -dbtype snp138 ex1.avinput humandb/

NOTICE: Output file is written to ex1.avinput.hg19_snp138
NOTICE: Reading annotation database humandb/hg19_snp138.txt ... Done with 1000 regions
NOTICE: Finished region-based annotation on 23 genetic variants

由于SNP数据库非常的大，我这里选取了前1000行用于测试，输出结果后缀为hg19_snp138, 在输入文件的前面新增了两列，内容如下

snp138    Name=rs373328635
snp138    Name=rs368469931

第一列为数据库的名称，第二列为overlap区对应的rs号。

3. 对非编码区的变异位点进行分类

ENCODE项目针对不同的细胞系，对非编码区的变异位点根据位置分成了增强子，启动子等类别。细胞系列表如下：

在注释时，选择对应的细胞系即可。以Gm12878细胞系进行说明

第一步：下载wgEncodeBroadHmmGm12878HMM 数据库，命令如下

annotate_variation.pl -downdb wgEncodeBroadHmmGm12878HMM humandb/

NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/wgEncodeBroadHmmGm12878HMM.txt.gz ... Done
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg18 build version, with files saved at the 'humandb' directory

数据库文件列数较多，截取了前5列，内容如下

585     chr1    0       600     15 Repetitive/CNV
585     chr1    1000    1600    8 Insulator
1347    chr1    100000000       100000200       6 Weak Enhancer
1347    chr1    100000200       100001000       2 Weak Promoter
1347    chr1    100001000       100001400       6 Weak Enhancer
1347    chr1    100001400       100001600       2 Weak Promoter
1347    chr1    100001600       100002200       6 Weak Enhancer
1347    chr1    100002200       100002400       2 Weak Promoter
1347    chr1    100002400       100003000       1 Active Promoter

第二步，进行注释，命令如下

annotate_variation.pl -regionanno -buildver hg18 -dbtype wgEncodeBroadHmmGm12878HMM ex1.avinput humandb/

NOTICE: Output file is written to ex1.avinput.hg18_wgEncodeBroadHmmGm12878HMM
NOTICE: Reading annotation database humandb/hg18_wgEncodeBroadHmmGm12878HMM.txt ... Done with 570580 regions
NOTICE: Finished region-based annotation on 23 genetic variants

输出文件的后缀为hg18_wgEncodeBroadHmmGm12878HMM, 在输入文件的前面新增了两列，内容如下

wgEncodeBroadHmmGm12878HMM    Name=11 Weak Txn
wgEncodeBroadHmmGm12878HMM    Name=10 Txn Elongation
wgEncodeBroadHmmGm12878HMM    Name=11 Weak Txn
wgEncodeBroadHmmGm12878HMM    Name=13 Heterochrom/lo
wgEncodeBroadHmmGm12878HMM    Name=13 Heterochrom/lo

第一列为数据库的名称，第二列为突变位点的分类。在定义分类时，不同数字代表的含义如下

State 1 - Active Promoter
State 2 - Weak Promoter
State 3 - Inactive/poised Promoter
State 4 - Strong enhancer
State 5 - Strong enhancer
State 6 - Weak/poised enhancer
State 7 - Weak/poised enhancer
State 8 - Insulator
State 9 - Transcriptional transition
State 10 - Transcriptional elongation
State 11 - Weak transcribed
State 12 - Polycomb-repressed
State 13 - Heterochromatin; low signal
State 14 - Repetitive/Copy Number Variation
State 15 - Repetitive/Copy Number Variation

除了采用别人定义好的现成数据库，也可以采用自定义的数据库。自定义的数据库支持以下两种格式

1. 自定义的GFF3文件

关于GFF3格式的详细定义可以参考如下链接

http://www.sequenceontology.org/gff3.shtml

第一步，准备数据库，文件名称为hg19_tfbs_gff3.txt, 内容如下

gff-version 3
chr1    tfloc   TF_binding_site 83024   83039   849     -       .       ID=pos83031;Name=V$FREAC3_01
chr1    tfloc   TF_binding_site 229352  229367  849     -       .       ID=pos229359;Name=V$FREAC3_01
chr1    tfloc   TF_binding_site 405674  405689  849     +       .       ID=pos405681;Name=V$FREAC3_01
chr1    tfloc   TF_binding_site 564503  564518  849     -       .       ID=pos564510;Name=V$FREAC3_01

文件名称自己定义，第一行必须为##gff-version 3, 正文部分为\t分隔的9列内容，数据库文件必须存放在humandb 目录下。

第二步，进行注释，命令如下

annotate_variation.pl -regionanno -buildver hg19 -dbtype gff3 -gff3dbfilehg19_tfbs_gff3.txt ex1.avinput humandb/

NOTICE: Output file is written to ex1.avinput.hg19_gff3
NOTICE: Reading annotation database humandb/hg19_tfbs_gff3.txt ... Done with 4 regions from 4 GFF3 records
NOTICE: Finished region-based annotation on 23 genetic variants

输出文件的后缀为hg19_gff3, 在输入文件的前面新增了两列，内容如下

gff3    Score=849;Name=pos83031
gff3    Score=849;Name=pos564510

第一列为数据库的名称，第二列为overlap区间的名称和打分。

2. 自定义BED文件

annovar 支持直接使用bed文件作为数据库，需要注意的一点是，必须把bed文件拷贝到对应的数据库目录下，因为annovar 只会在数据库目录下去查找对应的文件。

注释的命令如下

annotate_variation.pl -regionanno -buildver hg19 -dbtype bed -colsWanted 5 -bedfile human_exon.bed ex1.avinput humandb/

NOTICE: Output file is written to ex1.avinput.hg19_bed
NOTICE: Reading annotation database humandb/human_exons.bed ... Done with 10 regions
NOTICE: Finished region-based annotation on 23 genetic variants

输出结果如下

bed     Name=NR_024540_8        chr1    18174   18174   0       0

第一列表示数据库，第二列为bed文件中某一列的内容，默认为NA, 可以通过colsWanted参数指定bed文件的某一列作为输出。

ANNOVAR region-based annotation、gene-based annotation、region-based annotation

ANNOVAR region-based annotation-下篇

1. ENCODE annnotated regions

2. dbsnp 数据库

3. 对非编码区的变异位点进行分类

1. 自定义的GFF3文件

2. 自定义BED文件

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐