前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >数据分析-cuttag分析流程分享2-R代码可视化流程处理

数据分析-cuttag分析流程分享2-R代码可视化流程处理

原创
作者头像
小胡子刺猬的生信学习123
发布2022-04-04 12:01:50
2.3K0
发布2022-04-04 12:01:50
举报

在进行R语言的可视化的时候,建议也是把该用的包都提前安装上,这样可以省去后面报错的心累。

在前面的linux流程的时候,主要做了参考基因组的比对、数据的质控与标准化、文件格式的转换和callpeak,现在主要是选用R语言对相关的结果进行可视化。由于我们测的数据还没有发表。所以图片输出结果还是选用的官网上的图片,来供各位友友们参考。

测序数据质量结果可视化

首先查看比对到参考基因组上的数据比对结果。

代码语言:javascript
复制
// ###在分析之前建议用biocmanager装包 省时省力
library(dplyr)
library(stringr)
library(ggplot2)
library(viridis)
library(GenomicRanges)
library(chromVAR) ## For FRiP analysis and differential analysis
library(DESeq2) ## For differential analysis section
library(ggpubr) ## For customizing figures
library(corrplot) ## For correlation plot
## Path to the project and histone list
projPath = "/CUTTag"
sampleList = c("D_rep1","D_rep2", "D_rep3","IgG_rep1")
histList = c("D","IgG")
## Collect the alignment results from the bowtie2 alignment summary files
alignResult = c()
for(hist in sampleList){
  alignRes = read.table(paste0(projPath, "/2.cleandata/", hist, "_bowtie2.txt"), header = FALSE, fill = TRUE)
  alignRate = substr(alignRes$V1[6], 1, nchar(as.character(alignRes$V1[6]))-1)
  histInfo = strsplit(hist, "_")[[1]]
  alignResult = data.frame(Histone = histInfo[1], Replicate = histInfo[2], 
                           SequencingDepth = alignRes$V1[1] %>% as.character %>% as.numeric, 
                           MappedFragNum = alignRes$V1[4] %>% as.character %>% as.numeric + alignRes$V1[5] %>% as.character %>% as.numeric, 
                           AlignmentRate = alignRate %>% as.numeric)  %>% rbind(alignResult, .)
}
alignResult$Histone = factor(alignResult$Histone, levels = histList)
alignResult %>% mutate(AlignmentRate = paste0(AlignmentRate, "%"))
图片.png
图片.png

在这里需要注意的是文件的路径,如果路径没有写全的话,会报错,读不到文件,建议是可以提前设置工作路径。

查看比对到参考基因组和大肠杆菌基因组上的比对结果。

代码语言:javascript
复制
// spikeAlign = c()
for(hist in sampleList){
  spikeRes = read.table(paste0(projPath, "/2.cleandata/", hist, "_bowtie2_spikeIn.txt"), header = FALSE, fill = TRUE)
  alignRate = substr(spikeRes$V1[6], 1, nchar(as.character(spikeRes$V1[6]))-1)
  histInfo = strsplit(hist, "_")[[1]]
  spikeAlign = data.frame(Histone = histInfo[1], Replicate = histInfo[2], 
                          SequencingDepth = spikeRes$V1[1] %>% as.character %>% as.numeric, 
                          MappedFragNum_spikeIn = spikeRes$V1[4] %>% as.character %>% as.numeric + spikeRes$V1[5] %>% as.character %>% as.numeric, 
                          AlignmentRate_spikeIn = alignRate %>% as.numeric)  %>% rbind(spikeAlign, .)
}
spikeAlign$Histone = factor(spikeAlign$Histone, levels = histList)
spikeAlign %>% mutate(AlignmentRate_spikeIn = paste0(AlignmentRate_spikeIn, "%"))

#########Summarize the alignment to TAIR10 and E.coli########
alignSummary = left_join(alignResult, spikeAlign, by = c("Histone", "Replicate", "SequencingDepth")) %>%
  mutate(AlignmentRate = paste0(AlignmentRate, "%"), 
         AlignmentRate_spikeIn = paste0(AlignmentRate_spikeIn, "%"))
alignSummary
图片.png
图片.png

现在是对相关的数字结果进行图片整理,由于我是在服务器上调用的R,所以选用了ggplot2的包来保存pdf图片的结果,如果是在windows的R下面操作,可以直接看图片进行保存。

代码语言:javascript
复制
// 可以发现在这里我们主要需要改的就是几个figure的title,选择符合自己研究的题目来进行更改,我用的是拟南芥的基因组,直接改成了TAIR10。
fig3A = alignResult %>% ggplot(aes(x = Histone, y = SequencingDepth/1000000, fill = Histone)) +
    geom_boxplot() +
    geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    theme_bw(base_size = 18) +
    ylab("Sequencing Depth per Million") +
    xlab("") + 
    ggtitle("A. Sequencing Depth")
fig3B = alignResult %>% ggplot(aes(x = Histone, y = MappedFragNum/1000000, fill = Histone)) +
    geom_boxplot() +
    geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    theme_bw(base_size = 18) +
    ylab("Mapped Fragments per Million") +
    xlab("") +
    ggtitle("B. Alignable Fragment (TAIR10)")
fig3C = alignResult %>% ggplot(aes(x = Histone, y = AlignmentRate, fill = Histone)) +
    geom_boxplot() +
    geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    theme_bw(base_size = 18) +
    ylab("% of Mapped Fragments") +
    xlab("") +
    ggtitle("C. Alignment Rate (TAIR10)")

fig3D = spikeAlign %>% ggplot(aes(x = Histone, y = AlignmentRate_spikeIn, fill = Histone)) +
    geom_boxplot() +
    geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    theme_bw(base_size = 18) +
    ylab("Spike-in Alignment Rate") +
    xlab("") +
    ggtitle("D. Alignment Rate (E.coli)")

plot1 = ggarrange(fig3A, fig3B, fig3C, fig3D, ncol = 2, nrow=2, common.legend = TRUE, legend="bottom")
ggsave("dataquality.pdf", plot = plot1 , height = 10, width = 14)
图片.png
图片.png

核小体的片段大小及分布可视化

其实发现哪个callpeak的分析,最重要的是要看核小体的分布状况是不是符合要求,这个时候就需要来可视化对核小体的分布情况,来明确建库的时候是否要求。

代码语言:javascript
复制
// ##这个时候在前面已经有样本的名称,但是为了后续读文件的时候读错,可以在调用一次,如果是写的一个Rstript的一个脚本,只要在前面读文件的时候是对的,后面就不用考虑相关的问题。
sampleList = c("D_rep1", "D_rep2", "D_rep3", "D_rep4", "IgG_rep1")
histList = c("D", "IgG")
fragLen = c()
for(hist in sampleList){
  
  histInfo = strsplit(hist, "_")[[1]]
  fragLen = read.table(paste0(projPath, "/2.cleandata/", hist, "_fragmentLen.txt"), header = FALSE) %>% mutate(fragLen = V1 %>% as.numeric, fragCount = V2 %>% as.numeric, Weight = as.numeric(V2)/sum(as.numeric(V2)), Histone = histInfo[1], Replicate = histInfo[2], sampleInfo = hist) %>% rbind(fragLen, .) 
}
fragLen$sampleInfo = factor(fragLen$sampleInfo, levels = sampleList)
fragLen$Histone = factor(fragLen$Histone, levels = histList)

## Generate the fragment size density plot (violin plot)
fig5A = fragLen %>% ggplot(aes(x = sampleInfo, y = fragLen, weight = Weight, fill = Histone)) +
    geom_violin(bw = 5) +
    scale_y_continuous(breaks = seq(0, 800, 50)) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    theme_bw(base_size = 20) +
    ggpubr::rotate_x_text(angle = 20) +
    ylab("Fragment Length") +
    xlab("")

fig5B = fragLen %>% ggplot(aes(x = fragLen, y = fragCount, color = Histone, group = sampleInfo, linetype = Replicate)) +
  geom_line(size = 1) +
  scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma") +
  theme_bw(base_size = 20) +
  xlab("Fragment Length") +
  ylab("Count") +
  coord_cartesian(xlim = c(0, 500))
plot2 = ggarrange(fig5A, fig5B, ncol = 2)
ggsave("fragmentsize.pdf", plot = plot2 , height = 10, width = 14)
图片.png
图片.png

一般要求的是右图有几个峰,表明核小体是发生了切割,也是体现了建库后的这个数据是可以用的,如果没有大峰,小峰也是可以的。

评估样本之间的重复性

在前面的linux分析中,对样本分成了500个bin来进行样本之间的相关性进行评估,主要是为了保证样本之间是可重复的,表明我们的数据是可以用的,我一般是按照大于0.7这个阈值,如果低于这个阈值,这个样本的数据有可能就相对悬了,有可能需要补样本了。

代码语言:javascript
复制
// 
projPath = "/CUTTAG"
sampleList = c("D_rep1", "D_rep2", "D_rep3", "D_rep4", "IgG_rep1")
histList = c("D1", "IgG")
reprod = c()
fragCount = NULL
for(hist in sampleList){
  
  if(is.null(fragCount)){
    
    fragCount = read.table(paste0(projPath, "/2.cleandata/", hist, "_bowtie2.fragmentsCount.bin500.bed"), header = FALSE) 
    colnames(fragCount) = c("chrom", "bin", hist)
  
  }else{
    
    fragCountTmp = read.table(paste0(projPath, "/2.cleandata/", hist, "_bowtie2.fragmentsCount.bin500.bed"), header = FALSE)
    colnames(fragCountTmp) = c("chrom", "bin", hist)
    fragCount = full_join(fragCount, fragCountTmp, by = c("chrom", "bin"))
    
  }
}
M = cor(fragCount %>% select(-c("chrom", "bin")) %>% log2(), use = "complete.obs") 
corrplot(M, method = "color", outline = T,
    addgrid.col = "darkgray", order="hclust", 
    addrect = 3, rect.col = "black", rect.lwd = 3,
    cl.pos = "b", tl.col = "indianred4", 
    tl.cex = 1, cl.cex = 1, addCoef.col = "black", 
    number.digits = 2, number.cex = 1, 
    col = colorRampPalette(c("midnightblue","white","darkred"))(100)
    )

dev.off()
图片.png
图片.png

对spike-in校准的结果进行可视化

自我感觉这一步没啥用,但是官网都有流程了,那我们还是老老实实的做个乖孩子走一遍吧。

代码语言:javascript
复制
// ##spikein校准
scaleFactor = c()
multiplier = 10000
for(hist in sampleList){
  spikeDepth = read.table(paste0(projPath, "/2.cleandata/", hist, "_bowtie2_spikeIn.seqDepth"), header = FALSE, fill = TRUE)$V1[1]
  
  histInfo = strsplit(hist, "_")[[1]]
  scaleFactor = data.frame(scaleFactor = multiplier/spikeDepth, Histone = histInfo[1], Replicate = histInfo[2])  %>% rbind(scaleFactor, .)
}
scaleFactor$Histone = factor(scaleFactor$Histone, levels = histList)
left_join(alignDupSummary, scaleFactor, by = c("Histone", "Replicate"))

## Generate sequencing depth boxplot
fig6A = scaleFactor %>% ggplot(aes(x = Histone, y = scaleFactor, fill = Histone)) +
    geom_boxplot() +
    geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    theme_bw(base_size = 20) +
    ylab("Spike-in Scalling Factor") +
    xlab("")

normDepth = inner_join(scaleFactor, alignResult, by = c("Histone", "Replicate")) %>% mutate(normDepth = MappedFragNum * scaleFactor)
fig6B = normDepth %>% ggplot(aes(x = Histone, y = normDepth, fill = Histone)) +
    geom_boxplot() +
    geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    theme_bw(base_size = 20) +
    ylab("Normalization Fragment Count") +
    xlab("") + 
    coord_cartesian(ylim = c(1000000, 130000000))
ggarrange(fig6A, fig6B, ncol = 2, common.legend = TRUE, legend="bottom")
图片.png
图片.png
图片.png
图片.png

peakcalling结果可视化

在前面已经有了两个不同的callpeak的结果,这一步需要对相关的结果进行出图啦。

代码语言:javascript
复制
// ##=== R command ===## 
projPath = "/CUTTAG"
sampleList = c("D1_rep1", "D1_rep2", "D1_rep3", "D1_rep4", "IgG_rep1")
histList = c("D1", "IgG")
peakN = c()
peakWidth = c()
peakType = c("control", "top0.01")
for(hist in sampleList){
  histInfo = strsplit(hist, "_")[[1]]
  if(histInfo[1] != "IgG"){
    for(type in peakType){
      peakInfo = read.table(paste0(projPath, "/2.cleandata/", hist, "_seacr_", type, ".peaks.stringent.bed"), header = FALSE, fill = TRUE)  %>% mutate(width = abs(V3-V2))
      peakN = data.frame(peakN = nrow(peakInfo), peakType = type, Histone = histInfo[1], Replicate = histInfo[2]) %>% rbind(peakN, .)
      peakWidth = data.frame(width = peakInfo$width, peakType = type, Histone = histInfo[1], Replicate = histInfo[2])  %>% rbind(peakWidth, .)
    }
  }
}
peakN %>% select(Histone, Replicate, peakType, peakN)
图片.png
图片.png

查看每个重复中peak的交集

代码语言:javascript
复制
// 
histList = c("D1")
repL = paste0("rep",1:4)
peakType = c("control", "top0.01")
peakOverlap = c()
for(type in peakType){
  for(hist in histL){
    overlap.gr = GRanges()
    for(rep in repL){
      peakInfo = read.table(paste0(projPath, "/2.cleandata/", hist, "_", rep, "_seacr_", type, ".peaks.stringent.bed"), header = FALSE, fill = TRUE)
      peakInfo.gr = GRanges(peakInfo$V1, IRanges(start = peakInfo$V2, end = peakInfo$V3), strand = "*")
      if(length(overlap.gr) >0){
        overlap.gr = overlap.gr[findOverlaps(overlap.gr, peakInfo.gr)@from]
      }else{
        overlap.gr = peakInfo.gr
        
      }
    }
    peakOverlap = data.frame(peakReprod = length(overlap.gr), Histone = hist, peakType = type) %>% rbind(peakOverlap, .)
  }
}

peakReprod = left_join(peakN, peakOverlap, by = c("Histone", "peakType")) %>% mutate(peakReprodRate = peakReprod/peakN * 100)
peakReprod %>% select(Histone, Replicate, peakType, peakN, peakReprodNum = peakReprod, peakReprodRate)
图片.png
图片.png

统计峰值的片段大小并计算frip值

代码语言:javascript
复制
//
sampleList = c("D_rep1", "D_rep2",  "D_rep3", "D_rep4", "IgG_rep1")
histL = c("D1")
repL = paste0("rep",1:4)
bamDir = paste0(projPath, "/2.cleandata")
inPeakData = c()
## overlap with bam file to get count
for(hist in histL){
  for(rep in repL){
    peakRes = read.table(paste0(projPath, "/2.cleandata/", hist, "_", rep, "_seacr_control.peaks.stringent.bed"), header = FALSE, fill = TRUE)
	peak.gr = GRanges(seqnames = peakRes$V1, IRanges(start = peakRes$V2, end = peakRes$V3), strand = "*")
    bamFile = paste0(bamDir, "/", hist, "_", rep, "_bowtie2.mapped.bam")
	fragment_counts <- getCounts(bamFile, peak.gr, paired = TRUE, by_rg = FALSE, format = "bam")
    inPeakN = counts(fragment_counts)[,1] %>% sum
    inPeakData = rbind(inPeakData, data.frame(inPeakN = inPeakN, Histone = hist, Replicate = rep))
  }
}

frip = left_join(inPeakData, alignResult, by = c("Histone", "Replicate")) %>% mutate(frip = inPeakN/MappedFragNum * 100)
frip %>% select(Histone, Replicate, SequencingDepth, MappedFragNum, AlignmentRate, FragInPeakNum = inPeakN, FRiPs = frip)
##峰数、峰宽、峰重现性和 FRiP 的可视化
fig7A = peakN %>% ggplot(aes(x = Histone, y = peakN, fill = Histone)) +
    geom_boxplot() +
    geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
    facet_grid(~peakType) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.55, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    theme_bw(base_size = 18) +
    ylab("Number of Peaks") +
    xlab("")

fig7B = peakWidth %>% ggplot(aes(x = Histone, y = width, fill = Histone)) +
    geom_violin() +
    facet_grid(Replicate~peakType) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.55, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    scale_y_continuous(trans = "log", breaks = c(400, 3000, 22000)) +
    theme_bw(base_size = 18) +
    ylab("Width of Peaks") +
    xlab("")

fig7C = peakReprod %>% ggplot(aes(x = Histone, y = peakReprodRate, fill = Histone, label = round(peakReprodRate, 2))) +
    geom_bar(stat = "identity") +
    geom_text(vjust = 0.1) +
    facet_grid(Replicate~peakType) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.55, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    theme_bw(base_size = 18) +
    ylab("% of Peaks Reproduced") +
    xlab("")

fig7D = frip %>% ggplot(aes(x = Histone, y = frip, fill = Histone, label = round(frip, 2))) +
    geom_boxplot() +
    geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.55, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    theme_bw(base_size = 18) +
    ylab("% of Fragments in Peaks") +
    xlab("")

plot3 = ggarrange(fig7A, fig7B, fig7C, fig7D, ncol = 2, nrow=2, common.legend = TRUE, legend="bottom")
ggsave("peakcalling.pdf", plot = plot3, height = 10, width = 14)
图片.png
图片.png

总结

通过官网的整体流程分析发现,这些代码也是可以批量在后台运行的,只要是R包装的比较完整,然后文件路径是对的,基本上在出图的时候前期对大小及文件的分辨率调试好了,后面是没有什么大问题的。

代码语言:javascript
复制
##如果是R后台运行,需要首行加入#! /usr/bin/env Rscript
nohup Rscript R.r &>R.out 2>&1 &

同时发现在读文件的时候,里面是循环套循环,有的时候几个样本没有相同的重复,就比较惨了,需要把rep的这个循环去掉,加上全路径,这样才能一次性读进去,要不就一直报错,文件找不到。

我觉得这些文件中最重要的几个文件就是前面的mapping率、测序深度、peak数目、重复之间peak的重复数、frips值(可参考),因为我是做植物的,大肠杆菌基因组的比对率相对很低,如果是做动物的,有可能还是需要相对重视一下这些内容。

其实在后面的差异peak分析(官网推荐的方法)时,有的是不用IgG去除一定背景的,所以有的公司的人说其实是可以不用做IgG对照的,但是这个感觉还是看主要的研究内容是什么,来进行相关的考虑的。

下一篇主要是对callpeak后面的个性化分析进行相关的整理内容(数据分析-cuttag分析流程分享3-个性化分析内容),主要是来看峰的富集区域、峰的注释、富集分析和motif分析的相关内容。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 测序数据质量结果可视化
  • 核小体的片段大小及分布可视化
  • 评估样本之间的重复性
  • 对spike-in校准的结果进行可视化
  • peakcalling结果可视化
    • 查看每个重复中peak的交集
      • 统计峰值的片段大小并计算frip值
      • 总结
      相关产品与服务
      命令行工具
      腾讯云命令行工具 TCCLI 是管理腾讯云资源的统一工具。使用腾讯云命令行工具,您可以快速调用腾讯云 API 来管理您的腾讯云资源。此外,您还可以基于腾讯云的命令行工具来做自动化和脚本处理,以更多样的方式进行组合和重用。
      领券
      问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档