为了其它相关软件的顺利运行,我们根据教程来设置默认的安装目录及变量环境:Ensembl's VEP , If you don't have VEP installed, then follow this gist.
export cd=$HOME/vep
export VEP_DATA=$HOME/.vep
mkdir $VEP_PATH $VEP_DATA; cd $VEP_PATH
export PERL5LIB=$VEP_PATH:$PERL5LIB
export PATH=$VEP_PATH/htslib:$PATH
## 这一块代码就创建文件夹和下载数据,理论上不会出错,取决于网速
perl -e '{print join"\n",@INC}'
## 这种临时添加perl模块路径的方法不好用,需要修改
source ~/.bashrc
curl -LO https://github.com/Ensembl/ensembl-tools/archive/release/86.tar.gz
tar -zxf 86.tar.gz --starting-file variant_effect_predictor --transform='s|.*/|./|g'
Download and unpack VEP's offline cache for GRCh37, GRCh38, and GRCm38:
cd $VEP_DATA
#rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-86/variation/VEP/homo_sapiens_vep_86_GRCh37.tar.gz $VEP_DATA
rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-86/variation/VEP/homo_sapiens_vep_86_GRCh38.tar.gz $VEP_DATA
#rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-86/variation/VEP/mus_musculus_vep_86_GRCm38.tar.gz $VEP_DATA
cat $VEP_DATA/*_vep_86_GRC{h37,h38,m38}.tar.gz | tar -izxf - -C $VEP_DATA
## 解压下载好的数据库到指定文件夹
# 4.9G Apr 23 19:40 homo_sapiens_vep_86_GRCh38.tar.gz
## 这一步下载的文件有点大,可能会些微耗时,一般不修改默认文件夹。
Install the Ensembl API, the reference FASTAs for GRCh37/GRCh38/GRCm38:
cd $VEP_PATH
#perl INSTALL.pl --AUTO af --SPECIES homo_sapiens --ASSEMBLY GRCh37 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
perl INSTALL.pl --AUTO af --SPECIES homo_sapiens --ASSEMBLY GRCh38 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
#perl INSTALL.pl --AUTO af --SPECIES mus_musculus --ASSEMBLY GRCm38 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
## 这中间会安装 BioPerl
如果成功,会有提示,如下:
- downloading Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
- converting sequence data to bgzip format
Going to run:
/home/jianmingzeng/vep/biodbhts/scripts/convert_gz_2_bgz.sh /home/jianmingzeng/.vep/homo_sapiens/86_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz /home/jianmingzeng/vep/htslib/bgzip
This may take some time and will be removed when files are provided in bgzip format
Converted FASTA gzip file to bgzip successfully
[fai_load] build FASTA index.
- indexing OK
The FASTA file should be automatically detected by the VEP when using --cache or --offline. If it is not, use "--fasta /home/jianmingzeng/.vep/homo_sapiens/86_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
All done
因为用到perl模块,如果你的服务器环境没有配置好,会需要一些设置;
perl -e 'use LWP::Simple'
wget -O- http://cpanmin.us | perl - -l ~/perl5 App::cpanminus local::lib
eval `perl -I ~/perl5/lib/perl5 -Mlocal::lib`
echo 'eval `perl -I ~/perl5/lib/perl5 -Mlocal::lib`' >> ~/.profile
echo 'export MANPATH=$HOME/perl5/man:$MANPATH' >> ~/.profile
source ~/.profile
cpanm -v --notest -l ~/perl5 Archive::Extract;
cpanm -v --notest -l ~/perl5 Archive::Zip;
cpanm -v --notest -l ~/perl5 HTML::Entities;
cpanm -v --notest -l ~/perl5 LWP::Simple;
cpanm -v --notest -l ~/perl5 Compress::Zlib;
perl -e 'use Archive::Extract'
perl -e 'use HTML::Entities'
perl -e 'use HTML::HeadParser'
perl -e 'use LWP::Simple'
perl -e 'use Archive::Zip'
perl -e 'use Compress::Zlib'
cpanm -v --notest -l ~/perl5 DBD::mysql;
perl -e 'use DBD::mysql'
Convert the offline cache for use with tabix, that significantly speeds up the lookup of known variants:
#perl convert_cache.pl --species homo_sapiens --version 86_GRCh37 --dir $VEP_DATA
perl convert_cache.pl --species homo_sapiens --version 86_GRCh38 --dir $VEP_DATA
#perl convert_cache.pl --species mus_musculus --version 86_GRCm38 --dir $VEP_DATA
## 这个步骤特别耗时
更多细节去看我以前在生信菜鸟团博客分享的笔记:http://www.bio-info-trainee.com/1600.html
安装过程如下:
2018-04-27 13:42:12 - Processing homo_sapiens
2018-04-27 13:42:12 - Processing version 86_GRCh38
2018-04-27 13:42:12 - Processing _var cache type
[===========================================================] [ 100% ]
2018-04-27 14:59:39 - All done!
Download and build samtools
and bcftools
, which we'll need for steps below, and when running vcf2maf/maf2maf:
mkdir $VEP_PATH/samtools && cd $VEP_PATH/samtools
curl -LOOO https://github.com/samtools/{samtools/releases/download/1.3.1/samtools-1.3.1,bcftools/releases/download/1.3.1/bcftools-1.3.1,htslib/releases/download/1.3.2/htslib-1.3.2}.tar.bz2
cat *tar.bz2 | tar -ijxf -
cd htslib-1.3.2 && make && make prefix=$VEP_PATH/samtools install && cd ..
cd samtools-1.3.1 && make && make prefix=$VEP_PATH/samtools install && cd ..
cd bcftools-1.3.1 && make && make prefix=$VEP_PATH/samtools install && cd ..
cd ..
Download the liftOver
binary down the same path, and make it executable:
curl -L http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver > bin/liftOver
chmod a+x bin/liftOver
Set $PATH to find all those tools, and also add this line to your ~/.bashrc
to make it persistent. Be sure to edit the path below, if you didn't do this in your $HOME
:
export PATH=$HOME/vep/samtools/bin:$PATH
一般都需要先看看帮助文件:
perl ~/vep/variant_effect_predictor.pl --help
#----------------------------------#
# ENSEMBL VARIANT EFFECT PREDICTOR #
#----------------------------------#
version 86
by Will McLaren (wm2@ebi.ac.uk)
Help: dev@ensembl.org , helpdesk@ensembl.org
Twitter: @ensembl , @EnsemblWill
http://www.ensembl.org/info/docs/tools/vep/script/index.html
Usage:
perl variant_effect_predictor.pl [--cache|--offline|--database] [arguments]
Basic options
=============
--help Display this message and quit
-i | --input_file Input file
-o | --output_file Output file
--force_overwrite Force overwriting of output file
--species [species] Species to use [default: "human"]
--everything Shortcut switch to turn on commonly used options. See web
documentation for details [default: off]
--fork [num_forks] Use forking to improve script runtime
For full option documentation see:
http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html
一般收入数据的vcf格式的:http://samtools.github.io/hts-specs/VCFv4.2.pdf
不过也没有那么标准,我给了如下:
chr1 12861477 . T C . . 32:1:3.03%:T:23:8:25.81%
chr1 16588939 . T C . . 22:0:0%:T:8:3:27.27%
chr1 16703018 . C G . . 28:0:0%:C:21:6:22.22%
处理起来毫无压力:
perl ~/vep/variant_effect_predictor.pl -i tmp.vcf -o test.results \
--cache --force_overwrite --assembly GRCh38 --vcf
得到的结果其实和snpEFF没啥子区别,反正工具嘛,顺手即可。
它支持好几种输入格式数据:
Any other files can be easily converted to be compatible with the VEP; the easiest format to produce is a BED-like file containing coordinates and an (optional) identifier:
其实重点就是给出你的突变的坐标即可,在哪条染色体,什么位置!
不过,值得注意的是,我测试了BED格式,似乎不可以。
建议打印说明慢慢理解,争取熟记掌握。
snpEFF的输出文件说明书我就打印出来了。
非常重要。
当然,你可能会喜欢snpEFF: 安装snpEFF工具并对VCF文件进行注释【直播】我的基因组85