前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >你以为的可能不是你以为的

你以为的可能不是你以为的

作者头像
生信技能树
发布2018-07-27 14:35:27
8860
发布2018-07-27 14:35:27
举报
文章被收录于专栏:生信技能树生信技能树

最近生信技能树管理员小朋友XZG跟我炫耀他植物的简化基因组的gvcf模式,两百个测序数据,我一直没用过这个gvcf功能,因为的确没有需求。癌症研究,关注的主要是somatic mutation。

而且一直以为gvcf是把所有位点都report出来,不管那个位点有没有突变与否,只需要测序覆盖到了!!所以输出的文件行数会非常夸张!!!因为正常人的germline mutation在千分之一左右,所以gvcf得到的vcf会暴涨到1000倍行数。

但是简单的测试了一个坐标区间,发现事实可能不是我想象的那么简单!

$GATK  HaplotypeCaller  -ERC GVCF  -L $bed -R $GENOME -I $bam  --dbsnp $DBSNP -O tmp.vcf

我使用的gatk版本信息如下:

Using GATK jar /home/jianmingzeng/biosoft/GATK/gatk-4.0.3.0/gatk-package-4.0.3.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false 
    -Dsamjdk.use_async_io_write_samtools=true 
    -Dsamjdk.use_async_io_write_tribble=false 
    -Dsamjdk.compression_level=2 
    -Xmx20G -Djava.io.tmpdir=./ 
    -jar /home/jianmingzeng/biosoft/GATK/gatk-4.0.3.0/gatk-package-4.0.3.0-local.jar HaplotypeCaller
USAGE: HaplotypeCaller [arguments]
Call germline SNPs and indels via local re-assembly of haplotypes
Version:4.0.3.0

我测试的这个坐标区别是 1217 bp ,在gatk的运行日志也可以看到

IntervalArgumentCollection - Processing 1217 bp from intervals 2249 read(s) filtered by: MappingQualityReadFilter

那么,按照道理,出来的vcf应该是1217行,结果:

 grep -v "^#" tmp.vcf|wc
    119    1190    9875

那么,为什么会缺失这么多呢?我首先想到的是我给的坐标区间大部分并没有被这个bam文件所覆盖,也就是说,没有测序到。

所以我用下面的命令检验:

samtools mpileup  -r chr1:68940-70157 $bam |wc
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
   1218    7308  581882

发现我给的坐标区间都被覆盖了,然后我想到了可能是很多reads测序质量太差被过滤了,所以我看了看质量

samtools view $bam chr1:68940-70157|wc
   2896   57920 1475460
samtools view $bam -q 20  chr1:68940-70157|wc
    651   13020  334253

看起来,测序质量高的reads不多,因为我测试坐标的特殊性。然后我看了看gvcf文件的前五列:

grep -v "^#" tmp.vcf|cut -f 1-5|head
chr1    68941   .   A   <NON_REF>
chr1    69072   .   G   <NON_REF>
chr1    69073   .   A   <NON_REF>
chr1    69083   .   A   <NON_REF>
chr1    69090   .   T   <NON_REF>
chr1    69101   .   A   <NON_REF>
chr1    69130   .   C   <NON_REF>
chr1    69159   .   G   <NON_REF>
chr1    69184   .   G   <NON_REF>
chr1    69185   .   T   <NON_REF>

有些坐标跳跃了,也有些是连续的,再跟mpileup对应一下,如下:

samtools mpileup  -r chr1:68940-70157 $bam |cut -f 1,2,4,5|head
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
chr1    68940   35  TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT^!T
chr1    68941   36  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA^!A
chr1    68942   36  TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
chr1    68943   36  A$AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
chr1    68944   36  GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG^!G
chr1    68945   37  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA^!A
chr1    68946   38  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA^!A
chr1    68947   38  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
chr1    68948   40  GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG^!G^!G
chr1    68949   41  TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT^!T

实在是太奇怪了,那些被跳跃的位点,的确有测序深度信息呀!

而且,虽然 -q 20 -Q 20 这样的reads过滤可以大幅度丢失reads,但其实并不会丢失坐标信息:

[jianmingzeng@jade germline]$ samtools mpileup -q 20 -Q 20  -r chr1:68940-70157 $bam |head
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
chr1    69072   N   1   ^:G ?
chr1    69073   N   2   A^:A    C?
chr1    69074   N   2   GG  >>
chr1    69075   N   2   TT  AA
chr1    69076   N   2   GG  ??
chr1    69077   N   2   AA  AC
chr1    69078   N   2   AA  BB
chr1    69079   N   2   AA  CB
chr1    69080   N   2   CC  AA
chr1    69081   N   2   GG  99
[jianmingzeng@jade germline]$ samtools mpileup  -r chr1:68940-70157 $bam |head |cut -f 1-5
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
chr1    68940   N   35  TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT^!T
chr1    68941   N   36  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA^!A
chr1    68942   N   36  TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
chr1    68943   N   36  A$AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
chr1    68944   N   36  GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG^!G
chr1    68945   N   37  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA^!A
chr1    68946   N   38  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA^!A
chr1    68947   N   38  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
chr1    68948   N   40  GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG^!G^!G
chr1    68949   N   41  TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT^!Tl

不过还没等我仔细分析个中缘由,XZG就直接甩给了我一个链接,里面很清楚写到

With GVCF, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band.

block挺好的呀,节省空间,要真的是bp分辨率记录,那么文件会大的可怕! 不过没有突变的地方既然被折叠了,也就失去了该位点的测序深度信息了。

不过仔细想想,GATK团队应该不至于没考虑到这一点,可能是有着同样测序深度的地方才会被折叠!问题在于我选取了前五列查看信息,其实应该看全部:

grep -v "^#" tmp.vcf|cut -f 1-8|head
chr1    68941   .   A   <NON_REF>   .   .   END=69071
chr1    69072   .   G   <NON_REF>   .   .   END=69072
chr1    69073   .   A   <NON_REF>   .   .   END=69082
chr1    69083   .   A   <NON_REF>   .   .   END=69089
chr1    69090   .   T   <NON_REF>   .   .   END=69100
chr1    69101   .   A   <NON_REF>   .   .   END=69129
chr1    69130   .   C   <NON_REF>   .   .   END=69158
chr1    69159   .   G   <NON_REF>   .   .   END=69183
chr1    69184   .   G   <NON_REF>   .   .   END=69184
chr1    69185   .   T   <NON_REF>   .   .   END=69198

折叠的非常巧妙。

具体GATK的gvcf可以阅读:https://software.broadinstitute.org/gatk/documentation/article.php?id=3893

https://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2018-04-18,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 生信技能树 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档