deepvariant(A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology 36, 983–987 (2018). )为谷哥开源的基于机器学习的变异分析工具,今年年初有篇scientific report上的文献( https://www.nature.com/articles/s41598-022-05833-4 ),对GATK与deepvariant做了详细的比较,有兴趣的可自行阅读下这篇文献。
最终得出的结论是:Compared to GATK, DeepVariant had a shorter execution time and higher accuracy for clinical samples.
deepvariant production model使用6个core chanels(read base,base quality,mapping quality,strand of alignment,read supports variant,base differs from ref)作为基本训练模型(https://google.github.io/deepvariant/posts/2022-06-09-adding-custom-channels/),1.4版本引入了insert_size chanel后准确性进一步提升。For Illumina WGS and WES, we add an additional feature of read insert size (insert_size
) . This reduces errors by 4-10% for Illumina WGS and WES model.(https://github.com/google/deepvariant/releases)
大家有兴趣的也可用rtgtools、hap.py等工具对NA12878 金标准数据做一个评测。个人的直观感受就是deepvariant假阳性明显要比GATK少很多、假阴性比GATK也要少。下面举两个例子:
下面是处在non-uniqueness mappability边缘的一个变异,GATK haplotypecaller没有call出来proband的变异(GATK出了假阴性),只call出了母亲的变异,而deepvariant都准确call出来了。
另一个是位于参考基因组为n-polymer(polyA)附近的序列,GATK报了一个低VAF的indel,但deepvariant认为此处是refCall,不是变异
deepvariant最好采用docker安装运行,demo命令行如下:
docker run --privileged --rm --user `id -u`:`id -g` -v "/sg2/8.xuxiong/WES_Clinical/workstation_V6.2.0_WES_20220916A_T7/b.cram":"/input" -v "/bi/8.xuxiong":"/output" -v "/sg2/8.xuxiong/TargetSeqV6/genome":"/reference" -v "/bi/8.xuxiong/database":"/database" google/deepvariant:"latest" /opt/deepvariant/bin/run_deepvariant --model_type=WES --ref=/reference/ucsc.hg19.fasta --reads=/input/PES22090081-HE.deduped.cram --regions chr1:215913883-215915883 --output_vcf=/output/PES22090081.dv.vcf.gz --output_gvcf=/output/PES22090081.raw.g.vcf.gz --intermediate_results_dir /output/PES22090081_tmp_dir --num_shards=8