前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >小鼠全基因组数据分析

小鼠全基因组数据分析

作者头像
生信技能树
发布2018-08-16 17:20:47
2.3K0
发布2018-08-16 17:20:47
举报
文章被收录于专栏:生信技能树生信技能树

文章是:2017 Unexpected mutations after CRISPR–Cas9 editing in vivo

We performed WGS on a CRISPR–Cas9-edited mouse to identify all off-target mutations and found an unexpectedly high number of SNVs compared with the widely accepted assumption that CRISPR causes mostly indels at regions homologous to the sgRNA.

发表于 Nature Methods volume14, pages547–548 (2017) 作者已经撤稿,但是数据仍然在!

  • This article was retracted on 27 April 2018
  • Updated online 14 June 2017
  • Corrected online 25 July 2017
  • Retracted online 30 March 2018

DNA was isolated from two CRISPR-repaired mice (F03 and F05) and one uncorrected control

数据上传了:Bioproject (accession PRJNA382177). Sequencing data available at SRA:

  • FVB mouse (accession SRR5450998)
  • F03 mouse (accession SRR5450997),
  • F05 mouse (accession SRR5450996).

有点类似于肿瘤外显子的数据分析流程:

As additional controls, each of the variants was compared with the FVB/NJ genome in the mouse dbSNP database (v138), and each of the SNVs was also compared with all 36 strains in the Mouse Genome Project (v3).

首先下载数据

根据作者给出的ID号

下载:

代码语言:javascript
复制
cat srr.list |while read id;do (nohup ~/biosoft/sratoolkit/sratoolkit.2.8.2-1-centos_linux64/bin/prefetch $id -X 100G  & );done

下载得到的sra文件如下:

代码语言:javascript
复制
 67G Jun 29 18:58 SRR5450996.sra
 69G Jun 29 18:51 SRR5450997.sra
 39G Jun 29 18:15 SRR5450998.sra

sra转换为fastq后如下:

代码语言:javascript
复制
45G Jun 30 13:29 control_1.fastq.gz
47G Jun 30 13:29 control_2.fastq.gz
46G Jun 30 13:26 F03_1.fastq.gz
48G Jun 30 13:26 F03_2.fastq.gz
27G Jun 30 09:24 F05_1.fastq.gz
28G Jun 30 09:24 F05_2.fastq.gz

简单的使用trim_galore进行过滤后如下;

代码语言:javascript
复制
43G Jul  1 10:22 control_1_val_1.fq.gz
44G Jul  1 10:22 control_2_val_2.fq.gz
43G Jul  1 10:29 F03_1_val_1.fq.gz
45G Jul  1 10:29 F03_2_val_2.fq.gz
26G Jul  1 03:18 F05_1_val_1.fq.gz
27G Jul  1 03:18 F05_2_val_2.fq.gz

看起来没有啥太多数据被过滤,说明作者此次测序数据质量还不错!

小鼠WGS数据分析准备工作

一般来说,可以选择最新版小鼠参考基因组(mm10)了,如果你实在有其它需求,也可以自行选择其它版本。

遗憾的是GATK官网不提供小鼠相关资源:ftp://ftp.broadinstitute.org/bundle

所以只能去illumina官网下载咯: https://support.illumina.com/sequencing/sequencing_software/igenome.html

代码语言:javascript
复制
mkdir -p ~/biosoft/GATK/resources/bundle/mm10 
cd ~/biosoft/GATK/resources/bundle/mm10 
wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Mus_musculus/UCSC/mm10/Mus_musculus_UCSC_mm10.tar.gz 
nohup tar zxvf Mus_musculus_UCSC_mm10.tar.gz &

下载全部成功后,应该如下:

这个 Mus_musculus_UCSC_mm10.tar.gz &压缩包里面的数据文件实在是太多,我没办法列出来秀给大家了。

如果有要做 RealignerTargetCreator 需要下载dbsnp文件,参考: https://www.biostars.org/p/182917/

这个教程里面提到的文件非常之多,可以ftp登录看看,用户名就用guest即可。

代码语言:javascript
复制
cd ~/biosoft/GATK/resources/bundle/mm10 
mkdir dbsnp
cd dbsnp 
ftp ftp-mouse.sanger.ac.uk
cd current_indels/strain_specific_vcfs/
ls -lh  
lcd ./
prompt off
mget **

可以下载得到个小鼠品系的vcf文件如下:

代码语言:javascript
复制
 225.0M May  1  2015 129P2_OlaHsd.mgp.v5.snps.dbSNP142.vcf.gz
 201.7M May  1  2015 129S1_SvImJ.mgp.v5.snps.dbSNP142.vcf.gz
 206.7M May  1  2015 129S5SvEvBrd.mgp.v5.snps.dbSNP142.vcf.gz
 190.8M May  1  2015 A_J.mgp.v5.snps.dbSNP142.vcf.gz
 193.9M May  1  2015 AKR_J.mgp.v5.snps.dbSNP142.vcf.gz
 177.7M May  1  2015 BALB_cJ.mgp.v5.snps.dbSNP142.vcf.gz
 165.8M May  1  2015 BTBR_T__Itpr3tf_J.mgp.v5.snps.dbSNP142.vcf.gz
 199.1M May  1  2015 BUB_BnJ.mgp.v5.snps.dbSNP142.vcf.gz
 167.2M May  1  2015 C3H_HeH.mgp.v5.snps.dbSNP142.vcf.gz
 194.8M May  1  2015 C3H_HeJ.mgp.v5.snps.dbSNP142.vcf.gz
  16.5M May  1  2015 C57BL_10J.mgp.v5.snps.dbSNP142.vcf.gz
   6.8M May  1  2015 C57BL_6NJ.mgp.v5.snps.dbSNP142.vcf.gz
 101.6M May  1  2015 C57BR_cdJ.mgp.v5.snps.dbSNP142.vcf.gz
 100.2M May  1  2015 C57L_J.mgp.v5.snps.dbSNP142.vcf.gz
 108.4M May  1  2015 C58_J.mgp.v5.snps.dbSNP142.vcf.gz
 749.4M May  1  2015 CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz
 198.0M May  1  2015 CBA_J.mgp.v5.snps.dbSNP142.vcf.gz
 192.9M May  1  2015 DBA_1J.mgp.v5.snps.dbSNP142.vcf.gz
 197.9M May  1  2015 DBA_2J.mgp.v5.snps.dbSNP142.vcf.gz
 196.8M May  1  2015 FVB_NJ.mgp.v5.snps.dbSNP142.vcf.gz
 206.4M May  1  2015 I_LnJ.mgp.v5.snps.dbSNP142.vcf.gz
 216.4M May  1  2015 KK_HiJ.mgp.v5.snps.dbSNP142.vcf.gz
 234.6M May  1  2015 LEWES_EiJ.mgp.v5.snps.dbSNP142.vcf.gz
 206.0M May  1  2015 LP_J.mgp.v5.snps.dbSNP142.vcf.gz
 698.2M May  1  2015 MOLF_EiJ.mgp.v5.snps.dbSNP142.vcf.gz
 204.6M May  1  2015 NOD_ShiLtJ.mgp.v5.snps.dbSNP142.vcf.gz
 201.3M May  1  2015 NZB_B1NJ.mgp.v5.snps.dbSNP142.vcf.gz
 202.8M May  1  2015 NZO_HlLtJ.mgp.v5.snps.dbSNP142.vcf.gz
 212.4M May  1  2015 NZW_LacJ.mgp.v5.snps.dbSNP142.vcf.gz
 728.8M May  1  2015 PWK_PhJ.mgp.v5.snps.dbSNP142.vcf.gz
 203.2M May  1  2015 RF_J.mgp.v5.snps.dbSNP142.vcf.gz
 181.4M May  1  2015 SEA_GnJ.mgp.v5.snps.dbSNP142.vcf.gz
   1.5G May  1  2015 SPRET_EiJ.mgp.v5.snps.dbSNP142.vcf.gz
 197.1M May  1  2015 ST_bJ.mgp.v5.snps.dbSNP142.vcf.gz
 271.7M May  1  2015 WSB_EiJ.mgp.v5.snps.dbSNP142.vcf.gz
 267.7M May  1  2015 ZALENDE_EiJ.mgp.v5.snps.dbSNP142.vcf.gz

这些vcf文件的理解,需要对小鼠这个实验动物背景有一点了解,实际上这个时候我们需要的vcf文件应该是来自于dbSNP数据库的,需要需要的是dbsnp的rs ID号。 ftp://ftp.ncbi.nih.gov/snp/

代码语言:javascript
复制
cd ~/biosoft/GATK/resources/bundle/mm10/dbsnp
wget ftp://ftp.ncbi.nih.gov/snp/organisms/archive/mouse_10090/VCF/00-All.vcf.gz 
wget ftp://ftp.ncbi.nih.gov/snp/organisms/archive/mouse_10090/VCF/00-All.vcf.gz.tbi

下载的vcf文件也需要仔细理解哦:

代码语言:javascript
复制
##fileformat=VCFv4.0
##source=dbSNP
##dbSNP_BUILD_ID=150
##reference=GCF_000001635.24
##variationPropertyDocumentationUrl=ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf
##INFO=<ID=RSPOS,Number=1,Type=Integer,Description="Chr position reported in dbSNP">
##INFO=<ID=RV,Number=0,Type=Flag,Description="RS orientation is reversed">
##INFO=<ID=VP,Number=1,Type=String,Description="Variation Property.  Documentation is at ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf">
##INFO=<ID=GENEINFO,Number=.,Type=String,Description="Pairs each of gene symbol:gene id.  The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|)">
##INFO=<ID=dbSNPBuildID,Number=1,Type=Integer,Description="First dbSNP Build for RS">
##INFO=<ID=SAO,Number=1,Type=Integer,Description="Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both">
##INFO=<ID=VC,Number=1,Type=String,Description="Variation Class">

GATK流程

gatk只是上游流程,

代码语言:javascript
复制
touch config.sra config.raw config.clean config.bam readme.txt 
mkdir -p  {sra,raw_fastq,clean_fastq,alignment,somatic,germline,cnv,logs,qc,ccf,qualimap}
mkdir -p somatic/{mutect2,varscan,muse}
mkdir -p cnv/{cnvkit,gatk4,sequenza,gistic2}
mkdir -p ccf/{ABSOLUTE,THetA2,PureCN,PyClone,BubbleTree} 
mkdir -p qc/{align,clean,counts,raw,recal,trim,somatic,germline,cnv,qualimap}

比对以及GATK中间流程的数据太耗费空间,就不保留了,而且非常好内存。

首先看比对成sam后的bam文件,大小是:

代码语言:javascript
复制
 138G Jul  6 23:55 control.bam
 139G Jul  7 00:12 F03.bam
 83G Jul  6 04:47 F05.bam

接下来 25G的内存已经是不够用了,报错如下:

代码语言:javascript
复制
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Runtime.totalMemory()=19129171968
[Sat Jul 07 05:45:29 CST 2018] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 85.82 minutes.
slurmstepd: *** JOB 164819 ON compute-2-5 CANCELLED AT 2018-07-07T05:44:49 ***
slurmstepd: Exceeded job memory limit
slurmstepd: Job 164819 exceeded memory limit (26735988 > 26214400), being killed

调整到35G的内存,终于可以成功运行,得到的bam文件如下:

代码语言:javascript
复制
123G Jul  9 20:00 control_marked_fixed.bam
125G Jul  9 21:20 F03_marked_fixed.bam
80G Jul  9 16:44 F05_marked_fixed.bam

接下来的重头戏是 GATK的BaseRecalibratorHaplotypeCaller,可以参考:https://www.jianshu.com/p/49d035b121b8

代码语言:javascript
复制
for sample in {control,F03,F05};
do 
echo $sample
ls -lh ${sample}_marked_fixed.bam
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./"  BaseRecalibrator \
    -R $ref  \
    -I ${sample}_marked_fixed.bam  \
    --known-sites $snp \
    --known-sites $indel \
    -O ${sample}_recal.table \
    1>${sample}_log.recal 

$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./"   ApplyBQSR \
    -R $GENOME  \
    -I ${sample}_marked_fixed.bam  \
    -bqsr ${sample}_recal.table \
    -O ${sample}_bqsr.bam \
    1>${sample}_log.ApplyBQSR 

## 使用GATK的HaplotypeCaller命令
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./" HaplotypeCaller \
     -R $ref  \
     -I ${sample}_bqsr.bam \
      --dbsnp $snp \
      -O ${sample}_raw.vcf \
      1>${sample}_log.HC 

done  

其实这样的shell脚本是很烂的, 因为这个小鼠全基因组数据太大,如果这样一个步骤接着一个步骤,一个样本接着一个样本的运行,大半个月就过去了。

而且上面的数据都是非常大的,可以选择小批量数据进行测试脚本:

代码语言:javascript
复制
mkdir test 
samtools view -h control_marked_fixed.bam |head -100000 |samtools sort -o test/control_marked_fixed.bam -
cd test 
module load java/1.8.0_91
GENOME=/home/jianmingzeng/biosoft/GATK/resources/bundle/mm10/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa
ref=$GENOME
INDEX=/home/jianmingzeng/biosoft/GATK/resources/bundle/mm10/Mus_musculus/UCSC/mm10/Sequence/BWAIndex/genome.fa
GATK=/home/jianmingzeng/biosoft/GATK/gatk-4.0.2.1/gatk
dbsnpPath=/home/jianmingzeng/biosoft/GATK/resources/bundle/mm10/dbsnp
snp1=${dbsnpPath}/129S5SvEvBrd.mgp.v5.snps.dbSNP142.vcf.gz
snp2=${dbsnpPath}/C57BL_6NJ.mgp.v5.snps.dbSNP142.vcf.gz
snp3=${dbsnpPath}/I_LnJ.mgp.v5.snps.dbSNP142.vcf.gz
indel1=${dbsnpPath}/129S5SvEvBrd.mgp.v5.indels.dbSNP142.normed.vcf.gz
indel2=${dbsnpPath}/C57BL_6NJ.mgp.v5.indels.dbSNP142.normed.vcf.gz
indel3=${dbsnpPath}/I_LnJ.mgp.v5.indels.dbSNP142.normed.vcf.gz
sample=control
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./"  BaseRecalibrator \
    -R $ref  \
    -I ${sample}_marked_fixed.bam  \
    --known-sites $snp1 --known-sites $snp2 --known-sites $snp3 \
    --known-sites $indel1 --known-sites $indel2 --known-sites $indel3 \
    -O ${sample}_recal.table \
    1>${sample}_log.recal 

尤其值得注意的是染色体标记错误:

代码语言:javascript
复制
A USER ERROR has occurred: Input files reference and features have incompatible contigs: No overlapping contigs found.
  reference contigs = [chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY]
  features contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 3, 4, 5, 6, 7, 8, 9, MT, X, Y]

也就是说我们给的vcf文件里面的染色体是没有chr这个前缀,可是我们给的参考基因组里面却有这个前缀。

GATK的gvcf模式

这里有3只小鼠需要进行基因型比较,所以gvcf模式比较合适,使用 GATK的 Joint Calling , 过滤参考:https://mp.weixin.qq.com/s/W8Vfv1WmW6M7U0tIcPtlng

因为学校服务器最近总是出问题,所以这个流程只能是半成品,待续。

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2018-08-03,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 生信技能树 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 首先下载数据
  • 小鼠WGS数据分析准备工作
  • GATK流程
  • GATK的gvcf模式
相关产品与服务
云服务器
云服务器(Cloud Virtual Machine,CVM)提供安全可靠的弹性计算服务。 您可以实时扩展或缩减计算资源,适应变化的业务需求,并只需按实际使用的资源计费。使用 CVM 可以极大降低您的软硬件采购成本,简化 IT 运维工作。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档