(第二弹代码实现)iSA:DNA甲基化新的分析方法

接着上一篇 iSA:DNA甲基化新的分析方法(part 1) 下载并且安装好了软件工具+数据后,进行下一步分析了。


3.2 数据下载并且merge到fq中

$ fastq-dump --split-3 SRR1286778 SRR1286779 SRR1286780 SRR1286781 SRR1286782
(PS :先下载我们所需要的数据
 --split-3参数可以将PE的sra文件解压后的fastq文件拆分成_1.fastq和_2.fastq)
cat SRR1286778_1.fastq SRR1286779_1.fastq SRR1286780_1.fastq SRR1286781_1.fastq SRR1286782_1.fastq > m2C_1.fq
cat SRR1286778_2.fastq SRR1286779_2.fastq SRR1286780_2.fastq SRR1286781_2.fastq SRR1286782_2.fastq > m2C_2.fq
(PS : cat 命令是把得到的fq文件全部merge到一起)

下载结果(中间断过一次网,数据重新下T_T)

  可以看到,merge的fq文件非常大,接下来对数据进行质控!

3.3 用的是Trimmomatic这个软件(基于Java)

$ java -jar /Trimmomatic-0.38/trimmomatic-0.38.jar PE -threads 8 \
m2C_1.fq m2C_2.fq  \
m2C_1_trim.fq s1 m2C_2_trim.fq s2 \
ILLUMINACLIP:/Trimmomatic-0.38/adapters/TruSeq3-PE-2.fa:0:0:2 TRAILING:20  MINLEN:20

$ fastqc -t 2 --nogroup m2C_1_trim.fq m2C_2_trim.fq

(ps: java -jar 后面跟的是trimmomatic这个软件的路径
PE 是pair-end,如果是SE, 则是single-end
m2C_1.fq m2C_2.fq是上一步merge的fq文件
m2C_1_trim.fq s1 m2C_2_trim.fq s2 :m2C_1_trim.fq是经过质控后的fq,s1是被过滤的fq 后面同理 
fastqc是是看看质控后还有没有低质量的碱基数,如果存在,还需要重新质控)

(重新质控的代码:
java -jar /Trimmomatic-0.38/trimmomatic-0.38.jar PE -threads 8m2C_1.fq m2C_2.fq m2C_1_trim.fq s1 m2C_2_trim.fq s2ILLUMINACLIP:/Trimmomatic-0.38/adapters/TruSeq3-PE-2.fa:0:0:2TRAILING:20 SLIDINGWINDOW:3:20 MINLEN:20)

  这里只上了一个质控的结果,主要关注一下质控后碱基质量分布和接头有没有去干净。如果去的效果不好,需要重新设置参数,跑质控的代码。

质控后的结果

3.4准备BS转化的基因组数据

3.4.1下载基因组数据:http://hgdownload.cse.ucsc.edu/goldenPath/mm10/chromosomes/
cd ~/reference
mkdir -p  genome/mm10  && cd genome/mm10 
nohup wget http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz  &
tar zvfx chromFa.tar.gz
cat *.fa > mm10.fa
rm chr*.fa
3.4.2进行BS转化。用的是bismark,这里用的是默认的bowtie1进行转化的
 bismark_genome_preparation --verbose mm10_bismark

(PS :这个mm10_bismark是一个装着.fa文件的文件夹
--verbose 输出log信息)
3.4.3 下载lambda的数据并且进行转化
wget  https://github.com/chxu02/iSA/blob/master/lambda.fa
(PS:把它存在一个类似于lambda.bismark的文件夹中)
 bismark_genome_preparation --verbose lambda_bismark

3.5 进行mapping

bismark  --multicore  3  --score_min  L,0,-0.4  mm10_bismark  -1 m2C_1_trim.fq  -2  m2C_2_trim.fq
(ps: --multicore 设置的是线程数,
--score_min  L,0,-0.4:是设置一个函数来控制对齐被认为是“有效”所需的最小对齐分数。 
  这是读长度的函数。 例如,指定L,0,-0.4将最小分数函数f设置为f(x)= 0  -0.4 * x
 mm10_bismark 是指输入的BS转化路径
-1 输入的第一个fq文件
-2输入的第二个fq文件)

  接下来看一下mapping的结果(ps:这一步mapping的过程花费的时间非常久,~7d) 可以看一下mapping的结果

重点是生成了这两个文件

对这个生成的bam文件进行检查:

3.6 去重复

在去重复之前check一下结果

check bam文件

deduplicate_bismark -p --bam m2C_1_trim_bismark_bt2_pe.bam

就会得到生成的文件:m2C_1_trim_bismark_bt2_pe.deduplicated.bam


接下来就可以跑文章给的脚本了. 脚本见 https://github.com/chxu02/iSA

[ $# != 3 ] && { echo "Usage: iSA.sh <name> <t1> <t2>"
                 echo "<name> name of the bam file from Bismark (only the string before '.bam')"
                 echo "<t1> number of bases uniformly trimmed off from 5' end of mate 1 reads (0 if no trimming)"
                 echo "<t2> number of bases uniformly trimmed off from 5' end of mate 2 reads (0 if no trimming)"
                 exit 1
               }

echo Checking library type...
pbat=`samtools view -H $1.bam | grep ' --pbat' | wc -l | cut -f 1 -d ' '`
if [ $pbat -gt 0 ]
then echo Library appears to be made with PBAT method and not compatible with iSA.
     echo Exiting...
     exit
fi

echo Writing bam file for Watson strand...
samtools view -h $1.bam | awk '$3!="chrM" && $3!="ChrM" && $16!="XG:Z:GA"' | samtools view -bh -@ 8 - > $1.Wat.bam;
echo Writing bam file for Crick strand...
samtools view -h $1.bam | awk '$3!="chrM" && $3!="ChrM" && $16!="XG:Z:CT"' | samtools view -bh -@ 8 - > $1.Cri.bam;
echo Writing bed file for Watson strand...
bamToBed -bedpe -i $1.Wat.bam | awk -v t1=$2 -v t2=$3 '{if($2<t1){$9=0} else{$9=$2-t1}; print $1,$9,$6+t2,NR,$7}' OFS='\t' | sort -k1,1 -k2,2n -k3,3n -u -S 64G > $1.Wat.bed;
echo Writing bed file for Crick strand...
bamToBed -bedpe -i $1.Cri.bam | awk -v t1=$2 -v t2=$3 '{if($2<t2){$9=0} else{$9=$2-t2}; print $1,$9,$6+t1,NR,$7}' OFS='\t' | sort -k1,1 -k2,2n -k3,3n -u -S 64G > $1.Cri.bed;
echo Heating...
echo Cooling down... Strands annealing...$'\n'
intersectBed -a $1.Wat.bed -b $1.Cri.bed -wa -wb -f 1 -F 1 -sorted | cut -f 1-5,9,10 > $1.iSA.bed;
num=`wc -l $1.iSA.bed | cut -f 1 -d ' '`
echo You found $num pairs of alignments between Watson and Crick. | tee $1.iSA.report.txt
wa=`wc -l $1.Wat.bed | cut -f 1 -d ' '`
cr=`wc -l $1.Cri.bed | cut -f 1 -d ' '`
wat=$(echo "scale=2; $num*100/$wa" | bc)
cri=$(echo "scale=2; $num*100/$cr" | bc)
echo Pairing efficiency: $wat% for Watson, $cri% for Crick.$'\n' | tee -a $1.iSA.report.txt
echo Calculating fold enrichment over random pairing...
N30=`awk '{if($2>29) print $1,$2-30,$3-30}' OFS='\t' $1.Wat.bed | intersectBed -a - -b $1.Cri.bed -wa -wb -f 1 -F 1 -sorted | wc -l`
echo $N30 pairs of alignments found with -30 bp shift, | tee -a $1.iSA.report.txt
N20=`awk '{if($2>19) print $1,$2-20,$3-20}' OFS='\t' $1.Wat.bed | intersectBed -a - -b $1.Cri.bed -wa -wb -f 1 -F 1 -sorted | wc -l`
echo $N20 pairs of alignments found with -20 bp shift, | tee -a $1.iSA.report.txt
N10=`awk '{if($2>9) print $1,$2-10,$3-10}' OFS='\t' $1.Wat.bed | intersectBed -a - -b $1.Cri.bed -wa -wb -f 1 -F 1 -sorted | wc -l`
echo $N10 pairs of alignments found with -10 bp shift, | tee -a $1.iSA.report.txt
P10=`awk '{print $1,$2+10,$3+10}' OFS='\t' $1.Wat.bed | intersectBed -a - -b $1.Cri.bed -wa -wb -f 1 -F 1 -sorted | wc -l`
echo $P10 pairs of alignments found with +10 bp shift, | tee -a $1.iSA.report.txt
P20=`awk '{print $1,$2+20,$3+20}' OFS='\t' $1.Wat.bed | intersectBed -a - -b $1.Cri.bed -wa -wb -f 1 -F 1 -sorted | wc -l`
echo $P20 pairs of alignments found with +20 bp shift, | tee -a $1.iSA.report.txt
P30=`awk '{print $1,$2+30,$3+30}' OFS='\t' $1.Wat.bed | intersectBed -a - -b $1.Cri.bed -wa -wb -f 1 -F 1 -sorted | wc -l`
echo $P30 pairs of alignments found with +30 bp shift, | tee -a $1.iSA.report.txt
fold=$(echo "scale=1; $num*6/($N30+$N20+$N10+$P10+$P20+$P30)" | bc)
echo You got $fold-fold enrichment over random pairing.$'\n' | tee -a $1.iSA.report.txt

read -p "Do you want to proceed (Y/N)?" answer
case $answer in
Y|y) echo Continue on... Extracting paired alignments...
     cut -f 4 $1.iSA.bed | awk '{print $1*2-1"\n"$1*2}' > $1.Wat.LN;
     cut -f 6 $1.iSA.bed | awk '{print $1*2-1"\n"$1*2}' > $1.Cri.LN;
     samtools view -H $1.Wat.bam > header;
     samtools view $1.Wat.bam | awk 'FNR==NR {h[$1];next} (FNR in h)' $1.Wat.LN - | cat header - | samtools view -bh -@ 8 - > $1.Wat.iSA.bam;
     samtools view $1.Cri.bam | awk 'FNR==NR {h[$1];next} (FNR in h)' $1.Cri.LN - | cat header - | samtools view -bh -@ 8 - > $1.Cri.iSA.bam;
     echo Extracting DNA methylation calls...
     bismark_methylation_extractor -p --multicore 8 --no_header --gzip $1.Wat.iSA.bam
     bismark_methylation_extractor -p --multicore 8 --no_header --gzip $1.Cri.iSA.bam
     echo Pairing DNA methylation calls between Watson and Crick...
     awk '{print $5"\t"NR}' $1.iSA.bed | sort -k1,1 > $1.Wat.ID;
     awk '{print $7"\t"NR}' $1.iSA.bed | sort -k1,1 > $1.Cri.ID;
     gzip -cd CpG_OT_$1.Wat.iSA.txt.gz | sort -k1,1 | join -j 1 - $1.Wat.ID | sed 's/ /\t/g' | awk '{print $6"_"$3"_"$4,$3,$4-1,$4+1,$5}' OFS='\t' | sort -k1,1 > $1.Wat.CG.me;
     gzip -cd CpG_OB_$1.Cri.iSA.txt.gz | sort -k1,1 | join -j 1 - $1.Cri.ID | sed 's/ /\t/g' | awk '{print $6"_"$3"_"$4-1,$3,$4-2,$4,$5}' OFS='\t' | sort -k1,1 > $1.Cri.CG.me;
     gzip -cd CHG_OT_$1.Wat.iSA.txt.gz | sort -k1,1 | join -j 1 - $1.Wat.ID | sed 's/ /\t/g' | awk '{print $6"_"$3"_"$4,$3,$4-1,$4+2,$5}' OFS='\t' | sort -k1,1 > $1.Wat.CHG.me;
     gzip -cd CHG_OB_$1.Cri.iSA.txt.gz | sort -k1,1 | join -j 1 - $1.Cri.ID | sed 's/ /\t/g' | awk '{print $6"_"$3"_"$4-2,$3,$4-3,$4,$5}' OFS='\t' | sort -k1,1 > $1.Cri.CHG.me;
     join -j 1 $1.Wat.CG.me $1.Cri.CG.me | sed 's/ /\t/g' | awk '{if($5=="z" && $9=="z"){print $2,$3,$4,1,0,0,0} else if($5=="z" && $9=="Z"){print $2,$3,$4,0,1,0,0} else if($5=="Z" && $9=="z"){print $2,$3,$4,0,0,1,0} else{print $2,$3,$4,0,0,0,1}}' OFS='\t' | sort -k1,1 -k2,2n | groupBy -g 1-3 -c 4,5,6,7 -o sum > $1.intraCpG.bed;
     join -j 1 $1.Wat.CHG.me $1.Cri.CHG.me | sed 's/ /\t/g' | awk '{if($5=="x" && $9=="x"){print $2,$3,$4,1,0,0,0} else if($5=="x" && $9=="X"){print $2,$3,$4,0,1,0,0} else if($5=="X" && $9=="x"){print $2,$3,$4,0,0,1,0} else{print $2,$3,$4,0,0,0,1}}' OFS='\t' | sort -k1,1 -k2,2n | groupBy -g 1-3 -c 4,5,6,7 -o sum > $1.intraCWG.bed;
     awk '{To+=$4+$5+$6+$7; Un+=$4; HC+=$5; HW+=$6; Me+=$7} END{print "You found "To" intraCpGs, of which:\n"Un" are unmethylated,\n"HW" are hemi-Watson,\n"HC" are hemi-Crick,\n"Me" are methylated.\n"}' $1.intraCpG.bed | tee -a $1.iSA.report.txt
     awk '{To+=$4+$5+$6+$7; Un+=$4; HC+=$5; HW+=$6; Me+=$7} END{print "You found "To" intraCWGs, of which:\n"Un" are unmethylated,\n"HW" are hemi-Watson,\n"HC" are hemi-Crick,\n"Me" are methylated.\n"}' $1.intraCWG.bed | tee -a $1.iSA.report.txt;;
N|n) echo Cleaning up...
     rm $1.Wat.bam $1.Cri.bam $1.Wat.bed $1.Cri.bed
     exit;;
esac

read -p "Remove temp files (Y/N)?" clean
case $clean in
Y|y) echo Cleaning up...
     rm $1.Wat.bam $1.Cri.bam $1.Wat.bed $1.Cri.bed $1.Wat.LN $1.Cri.LN header $1.Wat.iSA.M-bias_R?.png $1.Cri.iSA.M-bias_R?.png $1.Wat.ID $1.Cri.ID $1.Wat.CG.me $1.Cri.CG.me $1.Wat.CHG.me $1.Cri.CHG.me;;
N|n) exit;;
esac

echo The end of iSA.

PS: 1 :在跑代码的时候有一块需要注意的是 ,这里的'\t' 我用的是双引号"\t",单引号的在结果当中会出现\t在文中的字符里面。而不是分隔符的形式。

遇到的问题

2:上个图片中t1和t2默认的是0,它指是是5'端的位置cut的adapter的碱基数,一般进行质控是不会从5'端去掉碱基的。如果去了,需要进行相应的补偿对齐。


后记:   这是一篇难产的笔记,本来早就写好了,但是各种原因非常的不容易才整理出来。第一次学习半甲基化的分析真的是各种的踩坑,有点不容易呀。继续加油,欢迎一起讨论~ Ref: 1:Xu, C. & Corces, V. G. Nascent DNA methylome mapping reveals inheritance of hemimethylation at CTCF/cohesin sites. Science 359, 1166-1170, doi:10.1126/science.aan5480 (2018). 2:Xu, C. & Corces, V. G. Resolution of the DNA methylation state of single CpG dyads using in silico Strand Annealing and WGBS data. Nat Protoc 14, 202-216, doi:10.1038/s41596-018-0090-x (2019).

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

扫码关注云+社区

领取腾讯云代金券