前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >影响差异分析后的火山图的对称性的因素有哪些?

影响差异分析后的火山图的对称性的因素有哪些?

作者头像
生信技能树
发布2022-07-26 10:40:59
1.2K0
发布2022-07-26 10:40:59
举报
文章被收录于专栏:生信技能树生信技能树

看到了一个感染与否的转录组数据的差异分析的文献,里面的火山图有点丑,让我想起来了在一些交流群总是会有人问到为什么他跟着我们的转录组测序数据分析流程处理他自己的数据,得到的火山图并不是很对称。

这个有点丑的火山图对应的文章是:《In vivo transcriptional analysis of mice infected with Leishmania major unveils cellular heterogeneity and altered transcriptomic profiling at single-cell resolution》,如下所示 :

火山图有点丑

一般来说,火山图的对称性会比较好,而且变化倍数比较大的基因一般来说统计学也会比较显著。但是上面的火山图里面居然有大量的基因变化倍数非常平庸但是统计学非常显著。这个文章的常规转录组和单细胞转录组数据集据分别是 GSE181720, GSE185253,可以看到其常规转录组是平平无奇的2分组找差异:

代码语言:javascript
复制
GSM5608822 naïve uninfected control 01
GSM5608823 naïve uninfected control 02
GSM5608824 naïve uninfected control 03
GSM5608825 naïve uninfected control 04
GSM5608826 L. major promastigote parasite infected 01
GSM5608827 L. major promastigote parasite infected 02
GSM5608828 L. major promastigote parasite infected 03
GSM5608829 L. major promastigote parasite infected 04
GSM5608830 L. major promastigote parasite infected 05
GSM5608831 L. major promastigote parasite infected 06

数据集链接是:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE185253 ,而且作者对每个样品都定量并且给出来了表达量矩阵:

代码语言:javascript
复制
GSM5608822_CON_01_counts.txt.gz 158.5 Kb
GSM5608823_CON_02_counts.txt.gz 149.1 Kb
GSM5608824_CON_03_counts.txt.gz 151.1 Kb
GSM5608825_CON_04_counts.txt.gz 147.4 Kb
GSM5608826_INF_01_counts.txt.gz 151.2 Kb
GSM5608827_INF_02_counts.txt.gz 149.2 Kb
GSM5608828_INF_03_counts.txt.gz 160.5 Kb
GSM5608829_INF_04_counts.txt.gz 160.8 Kb
GSM5608830_INF_05_counts.txt.gz 162.6 Kb
GSM5608831_INF_06_counts.txt.gz 154.1 Kb

看了看作者的数据分析描述:

  • FastQC,MultiQC,Lexogen’s QuantSeq data analysis pipeline,STAR, HTSeq (htseq-counts)
  • Genes with unique Entrez IDs and a minimum of ~2 counts-per-million (CPM) in 4 or more samples were selected for statistical testing.
  • Differential expression between naive and infected ears was evaluated using limma voomWithQualityWeights
  • adjusted p-values < 0.05 and absolute fold-changes > 1.5 were considered significant.
  • Transcripts that were either increased or decreased in RNA abundance with infection (logCPM><1) were included in the volcano plot
  • a fold change >2 and p<0.05 were considered statistically significant.
  • 211 transcripts were increased in RNA abundance and 34 transcripts were decreased in RNA abundance with infection
  • 10,014 genes did not show any significant differences
  • The complete list of genes detected in our RNA-Seq analysis is listed in S1 Table.

有点麻烦,它的正文和method的描述居然在阈值这方面出现了冲突,搞不清楚到底是absolute fold-changes > 1.5 还是a fold change >2。我怀疑作者应该是不是很懂转录组数据分析,并且这个数据大概率并不是他们自己分析的,而是外包给了数据分析公司。它的差异分析结果里面主要是统计学显著的上调基因,仅仅是少量下调基因,所以关心的是上调基因列表主要是富集到了:

  • antigen processing and presentation pathway (Cd4, H2-Q6, H2-M3, H2-Q4, Ifng, Cd8b1, H2-T22, Rfx5, Tap1, H2-Q7),
  • chemokine signaling (Cxcl9, Ccl5, Ccr7, Cxcl10, Cxcl5, Cxcl16, Cxcl1, Fgr, Pik3cd),
  • cell adhesion molecules (Cd4, Itgam, Itgal, Ctla4, Icos Itga6, Cd274, Cd28, Cd86, Selplg, Vcam1)

现在,我们下载压缩包文件:https://ftp.ncbi.nlm.nih.gov/geo/series/GSE185nnn/GSE185253/suppl/GSE185253_RAW.tar 进行简单的处理看看:

代码语言:javascript
复制
fs=list.files('GSE185253_RAW/', full.names = T)
fs
library(data.table)
gid=fread(fs[1],data.table = F)[,1]
head(gid)
rawcount = do.call(cbind,
        lapply(fs, function(x){
          fread(x,data.table = F)[,2]
        }))
rawcount[1:4,1:4]
keep <- rowSums(rawcount>0) >= floor(0.75*ncol(rawcount))
table(keep)
rawcount=rawcount[keep,]
gid=gid[keep] 

library(AnnoProbe)
ids=annoGene( gid ,
            ID_type = 'ENSEMBL',species = 'mouse'
)
head(ids)
ids=ids[!duplicated(ids$SYMBOL),]
rawcount=rawcount[match(ids$ENSEMBL , gid),]
rownames(rawcount) = ids$SYMBOL
rawcount[1:4,1:4]
rawcount['Gapdh',]
cg = log2(edgeR::cpm(rawcount)+1)['Gapdh',]
library(stringr)
gp = str_split(fs,'_',simplify = T)[,3]
t.test(cg ~ gp)


exprSet = rawcount
group_list = gp
# 加载包
library(DESeq2) 
# 第一步,构建DESeq2的DESeq对象
colData <- data.frame(row.names=colnames(exprSet),group_list=group_list)
dds <- DESeqDataSetFromMatrix(countData = exprSet,colData = colData,design = ~ group_list)

# 第二步,进行差异表达分析
dds2 <- DESeq(dds)

# 提取差异分析结果,trt组对untrt组的差异分析结果
table(group_list)
tmp <- results(dds2,contrast=c("group_list","INF","CON"))
DEG_DESeq2 <- as.data.frame(tmp[order(tmp$padj),])
head(DEG_DESeq2)

# 去除差异分析结果中包含NA值的行
DEG_DESeq2 = na.omit(DEG_DESeq2)
DEG_DESeq2['Gapdh',]

library(EnhancedVolcano)
res=DEG_DESeq2
head(res)
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'pvalue')
ggsave(filename = 'EnhancedVolcano_for_DEG_DEseq2.pdf',
       height = 12,width = 8)

ensembl_matrix=exprSet
colnames(ensembl_matrix)=paste0(1:length(group_list),group_list)
colnames(ensembl_matrix)
save(ensembl_matrix,group_list,DEG_DESeq2,file = 'ensembl_matrix.Rdata') 

如下所示:

我们的火山图

这个时候仍然是大同小异啦,上调基因偏多。亲爱的读者朋友们,你们知道为什么会这样吗,影响差异分析后的火山图的对称性的因素有哪些?

我们的转录组实战系列教程目录如下所示:

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2022-07-12,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 生信技能树 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档