文章/答案/技术大牛

发布

ANNOVAR region-based annotation、gene-based annotation、region-based annotation

共 4 篇文章

ANNOVAR gene-based annotation

ANNOVAR Filter-based Annotation

ANNOVAR region-based annotation-上篇

ANNOVAR region-based annotation-下篇

清单首页ANNOVAR region-based annotation、gene-based annotation、region-based annotation文章详情

清单「ANNOVAR region-based annotation、gene-based annotation、region-based annotation」 03/04

ANNOVAR region-based annotation-上篇

生信修炼手册

通过gene-based annotation 可以得到变异位点与基因之间的关系，除了与基因的关系之外，变异位点在基因组上某些特征区域的分布（比如转录因子结合区域，启动子区，增强子区等）更引人关注，这一功能通过region-based annotation 来实现。

在进行区域相关注释时，需要各种数据库，不同的特征区域对应的数据库不同。annovar支持下列多种数据库

1. 物种间保守区域

对人，小鼠，大鼠等5个脊椎动物的基因组序列进行多序列比对，然后采用phastCons软件识别在不同物种间保守的基因组区域。在识别保守区域时，软件会对每个保守区域进行打分。

第一步：下载phastConsElements46way数据库，命令如下

annotate_variation.pl -build hg19 -downdb phastConsElements46way humandb/

NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/phastConsElements46way.txt.gz ... Done
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory

数据库文件内容如下，第二列到第四列代表保守区域在基因组上的位置，第五列代表保守区域的名字，第六列代表该保守守区域的打分score值。

585     chr1    12002   12085   lod=33  343
585     chr1    12170   12232   lod=123 483
585     chr1    12594   12702   lod=219 545
585     chr1    12994   13054   lod=101 462

第二步，执行注释，命令如下

annotate_variation.pl -regionanno -build hg19 -out ex1 -dbtype phastConsElements46way ex1.avinput humandb/

NOTICE: Output file is written to ex1.hg19_phastConsElements46way
NOTICE: Reading annotation database humandb/hg19_phastConsElements46way.txt ... Done with 5163775 regions
NOTICE: Finished region-based annotation on 21 genetic variants

输出文件的后缀为hg19_phastConsElements46way, 在输入文件的前面新增了两列，内容如下

phastConsElements46way    Score=300;Name=lod=22
phastConsElements46way    Score=387;Name=lod=50
phastConsElements46way    Score=420;Name=lod=68
phastConsElements46way    Score=385;Name=lod=49
phastConsElements46way    Score=395;Name=lod=54
phastConsElements46way    Score=545;Name=lod=218

第一列为对应的数据库的名字，第二列为基因组上保守区域的得分和名字。

2. TFBS

TFBS是Transcription factor binding site的缩写，代表转录因子结合位点。在UCSC网站上，提供了转录因子结合位点的数据库。

第一步：下载tfbsConsSites数据库，命令如下

annotate_variation.pl -build hg19 -downdb tfbsConsSites humandb/

NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/tfbsConsSites.txt.gz ... Done
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory

数据库文件内容如下，第二列到第四列代表转录因子在基因组上的结合位置，第五列代表转录因子的名字

591     chr1    894640  894654  V$P300_01       842     -       1.68
591     chr1    894641  894657  V$ELK1_01       898     -       2.7
591     chr1    894644  894654  V$CETS1P54_01   971     -       2.22

第二步，进行注释，命令如下

annotate_variation.pl -regionanno -build hg19 -dbtype tfbsConsSites ex1.avinput humandb/

NOTICE: Output file is written to ex1.avinput.hg19_tfbsConsSites
NOTICE: Reading annotation database humandb/hg19_tfbsConsSites.txt ... Done with 5797266 regions
NOTICE: Finished region-based annotation on 21 genetic variants

输出文件的后缀为hg19_tfbsConsSites, 在输入文件的前面新增了两列，内容如下

tfbsConsSites   Score=767;Name=V$PAX5_02
tfbsConsSites   Score=880;Name=V$CEBPA_01
tfbsConsSites   Score=878;Name=V$FREAC3_01

第一列为对应的数据库的名字，第二列为转录因子结合区域的得分和对应的转录因子的名字。

3. cytoband

UCSC提供了cytoband的数据库。

第一步，下载cytoBand数据库，命令如下

annotate_variation.pl -build hg19 -downdb cytoBand humandb/

NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz ... Done
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory

数据库文件内容如下

chr1    0       2300000 p36.33  gneg
chr1    2300000 5400000 p36.32  gpos25
chr1    5400000 7200000 p36.31  gneg
chr1    7200000 9200000 p36.23  gpos25

第二步，进行注释，命令如下

annotate_variation.pl -regionanno -build hg19 -dbtype cytoBand ex1.avinput humandb/

NOTICE: Output file is written to ex1.avinput.hg19_cytoBand
NOTICE: Reading annotation database humandb/hg19_cytoBand.txt ... Done with 862 regions
NOTICE: Finished region-based annotation on 21 genetic variants

输出文件的后缀为hg19_cytoBand, 在输入文件的前面新增了两列，内容如下

cytoBand    1p36.33
cytoBand    1p36.33
cytoBand    1p36.31
cytoBand    1q23.3
cytoBand    1p31.1

第一列为对应的数据库的名字，第二列为对应的cytoband区域的名字。

4. microRNA和snoRNA

UCSC提供了microRNA和snoRNA在基因组上的位置，叫做wgRna,通过这个数据库，可以查看变异位点是否位于microRNA和snoRNA对应的基因组区域上。

第一步，下载数据库，命令如下

annotate_variation.pl -build hg19 -downdb wgRna humandb/

NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/wgRna.txt.gz ... Done
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory

数据库中文件内容如下：

585     chr1    30365   30503   hsa-mir-1302-2  0       +       0       0       miRNA
593     chr1    1102483 1102578 hsa-mir-200b    0       +       0       0       miRNA
799     chr1    28160911        28161077        ACA35   0       +       0       0       scaRna
804     chr1    28833876        28834083        U17a    0       +       0       0       HAcaBox
804     chr1    28835069        28835274        U17b    0       +       0       0       HAcaBox

第二步，进行注释，命令如下

annotate_variation.pl -regionanno -build hg19 -dbtype wgRna ex1.avinput humandb/

NOTICE: Output file is written to ex1.avinput.hg19_wgRna
NOTICE: Reading annotation database humandb/hg19_wgRna.txt ... Done with 1341 regions
NOTICE: Finished region-based annotation on 21 genetic variants

输出文件的后缀为hg19_wgRna, 在输入文件的前面新增了两列，内容如下

wgRna   Name=hsa-mir-1302-2
wgRna   Name=hsa-mir-1290
wgRna   Name=HBII-420

第一列为对应的数据库的名字，第二列为micoRNA/snoRNA的名字。

5. microRNA binding sites

UCSC给出了TargetScanHuman网站预测的microRNA结合位点。

第一步，下载targetScanS数据库，命令如下

annotate_variation.pl -build hg19 -downdb targetScanS humandb/

NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/targetScanS.txt.gz ... Done
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory

数据库中文件内容如下：

591     chr1    879822  879830  SAMD11:miR-504  90      +
591     chr1    900599  900606  KLHL17:miR-299/299-3p   26      +
591     chr1    900605  900612  KLHL17:miR-124/506      7       +
591     chr1    900933  900941  KLHL17:miR-19   82      +
591     chr1    901054  901061  KLHL17:miR-137  14      +

第二步，进行注释，命令如下

annotate_variation.pl -regionanno -build hg19 -dbtype targetScanS ex1.avinput humandb/

NOTICE: Output file is written to ex1.avinput.hg19_targetScanS
NOTICE: Reading annotation database humandb/hg19_targetScanS.txt ... Done with 54199 regions
NOTICE: Finished region-based annotation on 21 genetic variants

输出文件的后缀为hg19_targetScanS, 在输入文件的前面新增了两列，内容如下

targetScanS     Score=90;Name=SAMD11:miR-504
targetScanS     Score=82;Name=KLHL17:miR-19

第一列为对应的数据库的名字，第二列为结合区域的打分和对应的基因和microRNA的名字。

6. segmental duplications

基因组上的重复序列区域，这部分序列在比对时由于同源性，会存在比对情况不正确的情况。

第一步，下载genomicSuperDups 数据库，命令如下

annotate_variation.pl -build hg19 -downdb genomicSuperDups humandb/

NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/genomicSuperDups.txt.gz ... Done
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory

数据库文件列数较多，截取了前5列，内容如下：

585     chr1    10000   87112   chr15:102446355
585     chr1    10000   20818   chr12:84886
585     chr1    10000   19844   chrY:59352887
585     chr1    10000   19844   chrX:155249881
585     chr1    10464   40733   chr2:114330297

第二步，进行注释，命令如下

annotate_variation.pl -regionanno -build hg19 -dbtype genomicSuperDups ex1.avinput humandb/

NOTICE: Output file is written to ex1.avinput.hg19_genomicSuperDups
NOTICE: Reading annotation database humandb/hg19_genomicSuperDups.txt ... Done with 51599 regions
NOTICE: Finished region-based annotation on 21 genetic variants

输出文件的后缀为hg19_genomicSuperDups, 在输入文件的前面新增了两列，内容如下

genomicSuperDups    Score=0.905283;Name=chr1:1439902
genomicSuperDups    Score=0.99612;Name=chr1:13142561
genomicSuperDups    Score=0.991956;Name=chr15:102446355

第一列为对应的数据库的名字，第二列为重复区域的名字和打分。

7. structural variants

DGV数据库中存储了基因组结构变异的信息，annovar利用这个数据库来分析变异位点是否在已发表的结构变异区间上。

第一步，下载dgvMerged数据库，命令如下

annotate_variation.pl -build hg19 -downdb dgvMerged humandb/

NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/dgvMerged.txt.gz ... Done
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory

数据库文件列数较多，截取了前5列，内容如下：

9       chr1    0       2300000 nsv482937
585     chr1    10000   127330  nsv7879
585     chr1    10000   22118   dgv1n82
585     chr1    10190   10281   nsv958854
73      chr1    10376   1018704 esv2758911

第二步，进行注释，命令如下

annotate_variation.pl -regionanno -build hg19 -dbtype dgvMerged ex1.avinput humandb/

NOTICE: Output file is written to ex1.avinput.hg19_dgvMerged
NOTICE: Reading annotation database humandb/hg19_dgvMerged.txt ... Done with 392583 regions
NOTICE: Finished region-based annotation on 21 genetic variants

输出文件的后缀为hg19_dgvMerged, 在输入文件的前面新增了两列，内容如下

dgvMerged    Name=nsv832536,nsv545407
dgvMerged    Name=nsv830937,dgv235n100
dgvMerged    Name=nsv1243
dgvMerged    Name=nsv584699
dgvMerged    Name=esv3638608

第一列为对应的数据库的名字，第二列为DGV数据库中结构变异的ID。

8. GWAS

分析变异位点是否在之前的GWAS研究中报导过。

第一步，下载gwasCatalog数据库,命令如下

annotate_variation.pl -build hg19 -downdb gwasCatalog humandb/

NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gwasCatalog.txt.gz ... Done
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory

数据库文件列数较多，截取了前5列，内容如下：

590     chr1    780396  780397  rs141175086
591     chr1    894572  894573  rs13303010
592     chr1    1005805 1005806 rs3934834
593     chr1    1079197 1079198 rs11260603
593     chr1    1173610 1173611 rs6697886

第二步，进行注释，命令如下

annotate_variation.pl -regionanno -build hg19 -dbtype gwasCatalog ex1.avinput humandb/

NOTICE: Output file is written to ex1.avinput.hg19_gwasCatalog
NOTICE: Reading annotation database humandb/hg19_gwasCatalog.txt ... Done with 75593 regions
NOTICE: Finished region-based annotation on 21 genetic variants

输出文件的后缀为hg19_gwasCatalog, 在输入文件的前面新增了两列，内容如下

gwasCatalog    Name=Crohn's disease
gwasCatalog    Name=Chronic inflammatory diseases

第一列为对应的数据库的名字，第二列与该变异位点存在关联的疾病或者形状的名字。

在region-based annotation中，相关的数据库非常多，本篇只介绍上述几个数据库，剩余的数据库在后续文章中在进行介绍。

ANNOVAR region-based annotation、gene-based annotation、region-based annotation

ANNOVAR region-based annotation-上篇

1. 物种间保守区域

2. TFBS

3. cytoband

4. microRNA和snoRNA

5. microRNA binding sites

6. segmental duplications

7. structural variants

8. GWAS

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐