前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >甲基化测序数据分析之methylKit

甲基化测序数据分析之methylKit

作者头像
生信技能树
发布2022-01-10 08:45:00
2.7K0
发布2022-01-10 08:45:00
举报
文章被收录于专栏:生信技能树
我前面的甲基化教程主要是针对450k这样的芯片,所以champ流程就绰绰有余,很多小伙伴在咱们后台咨询甲基化测序数据分析,恰好最近实习生投稿:

下面是去年实习生的分享

  • methylKit是一个用于分析甲基化测序数据的R包,不仅支持WGBS,RRBS和目的区域甲基化测序,还支持oxBS-sq,TAB-seq等分析5hmc的数据。其核心功能是差异甲基化分析和差异甲基化位点和区域的注释。
  • 主要步骤包括数据描述性分析,聚类、样品质量可视化、差异甲基化分析和注释特征等功能。
  • 分析流程图如下:
  • 参考资料:
  • https://bioconductor.org/packages/release/bioc/vignettes/methylKit/inst/doc/methylKit.html#37_Correcting_for_overdispersion
  • https://static-content.springer.com/esm/art%3A10.1186%2Fgb-2012-13-10-r87/MediaObjects/13059_2012_2930_MOESM3_ESM.pdf

一、读取数据

  • 需要注意的是不同的文件形式使用不一样的函数

1.1 纯文本数据(methRead)

代码语言:javascript
复制
# BiocManager::install('methylKit')
# library(methylKit)
file.list=list( system.file("extdata", 
                            "test1.myCpG.txt", package = "methylKit"),
                system.file("extdata",
                            "test2.myCpG.txt", package = "methylKit"),
                system.file("extdata", 
                            "control1.myCpG.txt", package = "methylKit"),
                system.file("extdata", 
                            "control2.myCpG.txt", package = "methylKit") )


# read the files to a methylRawList object: myobj
myobj=methRead(file.list,
               sample.id=list("test1","test2","ctrl1","ctrl2"),
               assembly="hg18", # the genome 
               # 0:control,1:test samples
               treatment=c(1,1,0,0), 
               context="CpG",
               mincov = 10 # minimum read coverage,defaults to 10
)
class(myobj)
代码语言:javascript
复制
## [1] "methylRawList"
## attr(,"package")
## [1] "methylKit"
代码语言:javascript
复制
head(myobj[[1]])
代码语言:javascript
复制
##     chr   start     end strand coverage numCs numTs
## 1 chr21 9764539 9764539      -       12     3     9
## 2 chr21 9764513 9764513      -       12     0    12
## 3 chr21 9820622 9820622      +       13     0    13
## 4 chr21 9837545 9837545      +       11     0    11
## 5 chr21 9849022 9849022      +      124    90    34
## 6 chr21 9853326 9853326      +       17    12     5

每一行是一个甲基化位点,coverage代表覆盖这个位点的reads数,numCs代表甲基化C的比例,numTs代表非甲基化C的比例。文件大小相比bam文件小很多,读取的速度更快。

1.2 bam文件(processBismarkAln)

代码语言:javascript
复制
######### bam文件读取 #########
# reading multiple files
file.list2=list(system.file("extdata", "test.fastq_bismark.sorted.min.sam", 
                            package = "methylKit"),
                system.file("extdata", "test.fastq_bismark.sorted.min.sam", 
                            package = "methylKit"),
                system.file("extdata", "test.fastq_bismark.sorted.min.sam", 
                            package = "methylKit"),
                system.file("extdata", "test.fastq_bismark.sorted.min.sam",
                            package = "methylKit"))

objs=processBismarkAln(location=file.list2,
                       sample.id=list("test1","test2","ctrl1","ctrl1"),assembly="hg18",
                       save.folder=NULL,save.context=NULL,read.context="CpG",
                       nolap=FALSE,mincov=10,minqual=20,phred64=FALSE,
                       treatment=c(1,1,0,0))
代码语言:javascript
复制
## Conversion Statistics:
## 
## total otherC considered (>95% C+T): 10
## average conversion rate = 100
## total otherC considered (Forward) (>95% C+T): 10
## average conversion rate (Forward) = 100
## total otherC considered (Reverse) (>95% C+T): 0
## average conversion rate (Reverse) = 0
## 
## Reading methylation percentage per base for sample: test1 
## 
## Conversion Statistics:
## 
## total otherC considered (>95% C+T): 10
## average conversion rate = 100
## total otherC considered (Forward) (>95% C+T): 10
## average conversion rate (Forward) = 100
## total otherC considered (Reverse) (>95% C+T): 0
## average conversion rate (Reverse) = 0
## 
## Reading methylation percentage per base for sample: test2 
## 
## Conversion Statistics:
## 
## total otherC considered (>95% C+T): 10
## average conversion rate = 100
## total otherC considered (Forward) (>95% C+T): 10
## average conversion rate (Forward) = 100
## total otherC considered (Reverse) (>95% C+T): 0
## average conversion rate (Reverse) = 0
## 
## Reading methylation percentage per base for sample: ctrl1 
## 
## Conversion Statistics:
## 
## total otherC considered (>95% C+T): 10
## average conversion rate = 100
## total otherC considered (Forward) (>95% C+T): 10
## average conversion rate (Forward) = 100
## total otherC considered (Reverse) (>95% C+T): 0
## average conversion rate (Reverse) = 0
## 
## Reading methylation percentage per base for sample: ctrl1
代码语言:javascript
复制
# nolap 
# TRUE:SAM/BAM file has paired-end reads双端测序, the one read of the overlapping paired-end read pair will be ignored for methylation calling.
# mincov:minimum read coverage 
# minqual:minimum phred quality score

2.描述性统计量

  • 读取数据后,可以检查下覆盖率和甲基化百分比
代码语言:javascript
复制
getMethylationStats(myobj[[2]],plot=FALSE,both.strands=FALSE)
代码语言:javascript
复制
## methylation statistics per base
## summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   20.00   82.79   63.17   94.74  100.00 
## percentiles:
##        0%       10%       20%       30%       40%       50%       60%       70% 
##   0.00000   0.00000   0.00000  48.38710  70.00000  82.78556  90.00000  93.33333 
##       80%       90%       95%       99%     99.5%     99.9%      100% 
##  96.42857 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000
代码语言:javascript
复制
getMethylationStats(myobj[[2]],plot=TRUE,both.strands=FALSE)
代码语言:javascript
复制
getMethylationStats(myobj[[2]],plot=TRUE,both.strands=TRUE)
代码语言:javascript
复制
getCoverageStats(myobj[[2]],plot=TRUE,both.strands=FALSE)

3.过滤

代码语言:javascript
复制
filtered.myobj=filterByCoverage(myobj,lo.count=10,lo.perc=NULL,
                                hi.count=NULL,hi.perc=99.9)
getMethylationStats(filtered.myobj[[2]],plot=FALSE,both.strands=FALSE)
代码语言:javascript
复制
## methylation statistics per base
## summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   20.00   82.76   63.13   94.74  100.00 
## percentiles:
##        0%       10%       20%       30%       40%       50%       60%       70% 
##   0.00000   0.00000   0.00000  48.23226  70.00000  82.75862  90.00000  93.33333 
##       80%       90%       95%       99%     99.5%     99.9%      100% 
##  96.42857 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000

二、合并样本

  • 将所有样本的甲基化情况合并,得到所有样本的甲基化表达谱
代码语言:javascript
复制
meth=unite(myobj, destrand=FALSE)
代码语言:javascript
复制
## uniting...
代码语言:javascript
复制
head(meth)
代码语言:javascript
复制
##     chr   start     end strand coverage1 numCs1 numTs1 coverage2 numCs2 numTs2
## 1 chr21 9853296 9853296      +        17     10      7       333    268     65
## 2 chr21 9853326 9853326      +        17     12      5       329    249     79
## 3 chr21 9860126 9860126      +        39     38      1        83     78      5
## 4 chr21 9906604 9906604      +        68     42     26       111     97     14
## 5 chr21 9906616 9906616      +        68     52     16       111    104      7
## 6 chr21 9906619 9906619      +        68     59      9       111    109      2
##   coverage3 numCs3 numTs3 coverage4 numCs4 numTs4
## 1        18     16      2       395    341     54
## 2        16     14      2       379    284     95
## 3        83     83      0        41     40      1
## 4        23     18      5        37     33      4
## 5        23     14      9        37     27     10
## 6        22     18      4        37     29      8
  • 在合并的过程中,默认情况下,只有所有的样本都包含该位点时,才会保留,本质就是取的所有样本的交集,如果你想要取并集,可以修改min.per.group参数的值,该参数的值代表每组中至少有多少个样本覆盖该位点时才保留,如果设置为1,就是取并集。
代码语言:javascript
复制
# creates a methylBase object, 
# where only CpGs covered with at least 1 sample per group will be returned

# there were two groups defined by the treatment vector, 
# given during the creation of myobj: treatment=c(1,1,0,0)
meth.min=unite(myobj,min.per.group=1L)
代码语言:javascript
复制
## uniting...

1.检查相关性

代码语言:javascript
复制
getCorrelation(meth,plot=TRUE)
代码语言:javascript
复制
##           test1     test2     ctrl1     ctrl2
## test1 1.0000000 0.9252530 0.8767865 0.8737509
## test2 0.9252530 1.0000000 0.8791864 0.8801669
## ctrl1 0.8767865 0.8791864 1.0000000 0.9465369
## ctrl2 0.8737509 0.8801669 0.9465369 1.0000000

2.检查聚类情况

代码语言:javascript
复制
clusterSamples(meth, dist="correlation", method="ward", plot=TRUE)
代码语言:javascript
复制
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"

{width="768"}

代码语言:javascript
复制
## 
## Call:
## hclust(d = d, method = HCLUST.METHODS[hclust.method])
## 
## Cluster method   : ward.D 
## Distance         : pearson 
## Number of objects: 4
代码语言:javascript
复制
PCASamples(meth, screeplot=TRUE)
代码语言:javascript
复制
PCASamples(meth)

3.去批次化

代码语言:javascript
复制
# make some batch data frame
# this is a bogus data frame
# we don't have batch information
# for the example data
sampleAnnotation=data.frame(batch_id=c("a","a","b","b"),
                            age=c(19,34,23,40))
sampleAnnotation
代码语言:javascript
复制
##   batch_id age
## 1        a  19
## 2        a  34
## 3        b  23
## 4        b  40
代码语言:javascript
复制
as=assocComp(mBase=meth,sampleAnnotation)
as
代码语言:javascript
复制
## $pcs
##              PC1        PC2         PC3         PC4
## test1 -0.4978699 -0.5220504  0.68923849 -0.06737363
## test2 -0.4990924 -0.4805506 -0.71827964  0.06365693
## ctrl1 -0.5016543  0.4938800  0.08068700  0.70563101
## ctrl2 -0.5013734  0.5026102 -0.05014261 -0.70249091
## 
## $vars
## [1] 92.271885  4.525328  1.870950  1.331837
## 
## $association
##                PC1       PC2       PC3       PC4
## batch_id 0.3333333 0.3333333 1.0000000 1.0000000
## age      0.5864358 0.6794346 0.3140251 0.3467957
  • PCA主成分会依据方差的大小进行排序,通常来说,我们仅考察贡献度前2或者前3的主成分。如果某成分贡献度高的话,可考虑将其剔除
代码语言:javascript
复制
# construct a new object by removing the first pricipal component
# from percent methylation value matrix
newObj=removeComp(meth,comp=1)
  • 其他方法矫正批次效应
代码语言:javascript
复制
# 甲基化百分比
mat=percMethylation(meth)
head(mat)
代码语言:javascript
复制
##      test1    test2     ctrl1    ctrl2
## 1 58.82353 80.48048  88.88889 86.32911
## 2 70.58824 75.91463  87.50000 74.93404
## 3 97.43590 93.97590 100.00000 97.56098
## 4 61.76471 87.38739  78.26087 89.18919
## 5 76.47059 93.69369  60.86957 72.97297
## 6 86.76471 98.19820  81.81818 78.37838
代码语言:javascript
复制
# do some changes in the matrix
# this is just a toy example
# 手动进行修改百分比(究竟如何判断还不太清楚???)
mat[mat==100]=80
 
#将修改后的百分比和原来的meth形成新的矩阵
newobj=reconstruct(mat,meth)

三、差异分析

代码语言:javascript
复制
myDiff=calculateDiffMeth(meth)
代码语言:javascript
复制
## two groups detected:
##  will calculate methylation difference as the difference of
## treatment (group: 1) - control (group: 0)
  • 选择q值<0.01和甲基化百分比difference大于25%的差异位点
  • difference函数表明差异的阈值,只有差异大于该阈值时,才会保留,其实就是meth.diff的值,注意是绝对值大于difference的值。
代码语言:javascript
复制
# get hyper methylated bases
myDiff25p.hyper=getMethylDiff(myDiff,difference=25,qvalue=0.01,type="hyper")
#
# get hypo methylated bases
myDiff25p.hypo=getMethylDiff(myDiff,difference=25,qvalue=0.01,type="hypo")
#
#
# get all differentially methylated bases
myDiff25p=getMethylDiff(myDiff,difference=25,qvalue=0.01)
head(myDiff25p)
代码语言:javascript
复制
##      chr   start     end strand       pvalue       qvalue meth.diff
## 11 chr21 9906681 9906681      + 2.324256e-08 2.172846e-07  44.26808
## 12 chr21 9906694 9906694      + 2.983628e-04 1.006776e-03  28.85090
## 13 chr21 9906700 9906700      + 8.405880e-06 4.097059e-05  40.28016
## 18 chr21 9906873 9906873      + 4.976364e-06 2.603518e-05  47.62876
## 23 chr21 9927527 9927527      + 1.120126e-07 9.257475e-07 -46.10984
## 24 chr21 9944505 9944505      + 7.907909e-20 7.515975e-18 -51.01794

1.火山图

代码语言:javascript
复制
myDiffall=getMethylDiff(myDiff,difference=1,qvalue=1)
df=myDiffall
df$type=ifelse(df$pvalue>0.05,'none',
            ifelse( df$meth.diff >25,'hyperDMP', 
                    ifelse( df$meth.diff< -25,'hypoDMP','none') )
)
head(df)
代码语言:javascript
复制
##     chr   start     end strand      pvalue      qvalue meth.diff type
## 1 chr21 9853296 9853296      + 0.009908142 0.021565813 -7.012107 none
## 3 chr21 9860126 9860126      + 0.041604489 0.069780839 -4.111581 none
## 4 chr21 9906604 9906604      + 0.210651680 0.259453090 -7.346369 none
## 5 chr21 9906616 9906616      + 0.001576498 0.004322201 18.817505 none
## 6 chr21 9906619 9906619      + 0.002824594 0.007321638 14.193732 none
## 7 chr21 9906634 9906634      + 0.034326524 0.059268755 13.888889 none
代码语言:javascript
复制
table(df$type)
代码语言:javascript
复制
## 
## hyperDMP  hypoDMP     none 
##       78       60      518
代码语言:javascript
复制
this_tile <- paste0('Cutoff for meth.diff is ',25,
                    '\nThe number of hyperDMP is ',nrow(df[df$type == 'hyperDMP',]) ,
                    '\nThe number of hypoDMP is ',nrow(df[df$type == 'hypoDMP',])
)
#火山图
library(ggplot2)
p5 <- ggplot(data = df, 
             aes(x = meth.diff, 
                 y = -log10(pvalue))) +
  geom_point(alpha=0.6, size=1.5, 
             aes(color=type)) +
  ylab("-log10(Pvalue)")+
  scale_color_manual(values=c("#34bfb5", "#828586","#ff6633"))+
  geom_vline(xintercept= c(-25,25),lty=4,col="grey",lwd=0.8) +
  geom_hline(yintercept = -log10(0.05),lty=4,col="grey",lwd=0.8)+
  #xlim(-0.5, 0.5)+
  theme_classic()+
  ggtitle(this_tile )+
  theme(plot.title = element_text(size=12,hjust = 0.5),
        legend.title = element_blank()
  )
p5

染色体

可视化每条染色体的低/高甲基化碱基/区域的分布(本例只包含一条)

代码语言:javascript
复制
diffMethPerChr(myDiff,plot=T,qvalue.cutoff=0.01, meth.cutoff=25)
本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2022-01-04,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 生信技能树 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 一、读取数据
    • 1.1 纯文本数据(methRead)
      • 1.2 bam文件(processBismarkAln)
        • 2.描述性统计量
        • 二、合并样本
          • 1.检查相关性
            • 2.检查聚类情况
            • 三、差异分析
              • 1.火山图
              领券
              问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档