前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >如果传统bulk转录组数据队列足够大也可以使用单细胞流程

如果传统bulk转录组数据队列足够大也可以使用单细胞流程

作者头像
生信技能树
发布2019-10-09 15:25:10
1.5K0
发布2019-10-09 15:25:10
举报
文章被收录于专栏:生信技能树生信技能树

昨天我在生信菜鸟团分享的学徒数据挖掘任务: 不一定正确的多分组差异分析结果热图展现 提到了可以使用单细胞转录组数据分析流程来处理文献的数据集。

还给出了一些简单代码,就是看看样本聚类情况,然后留成作业给另外一个学徒,看单细胞R包Seurat的FindAllMarkers函数对7个亚型找到的marker基因,根据传统的bulk转录组差异分析策略的差异。

先看看单细胞转录组代码

这里我们的单细胞转录组数据分析方法,基本上遵循我的全网第一个单细胞课程(基础)满一千份销量就停止发售 内容,就是一些R包的认知,包括 scater,monocle,Seurat,scran,M3Drop 需要熟练掌握它们的对象,:一些单细胞转录组R包的对象 ,分析流程也大同小异:

  • step1: 创建对象
  • step2: 质量控制
  • step3: 表达量的标准化和归一化
  • step4: 去除干扰因素(多个样本整合)
  • step5: 判断重要的基因
  • step6: 多种降维算法
  • step7: 可视化降维结果
  • step8: 多种聚类算法
  • step9: 聚类后找每个细胞亚群的标志基因
  • step10: 继续分类

不到50行代码如下:

library(Seurat)
# https://satijalab.org/seurat/mca.html
sce <- CreateSeuratObject(counts = sarc_choose, 
                          meta.data =clinic,
                          min.cells = 5, min.features = 1000, 
                          project = "sce")
head(sce@meta.data)  
# 步骤 ScaleData 的耗时取决于电脑系统配置(保守估计大于一分钟)
sce <- ScaleData(object = sce, 
                 vars.to.regress = c('nCount_RNA'), 
                 model.use = 'linear', 
                 use.umi = FALSE)
sce <- FindVariableFeatures(object = sce)
length(VariableFeatures(sce)) 
sce <- RunPCA(object = sce, pc.genes = VariableFeatures(sce))
# 下面只是展现不同降维算法而已,并不要求都使用一次。
sce <- RunICA(sce )
sce <- RunTSNE(sce )
#sce <- RunUMAP(sce,dims = 1:10)
#VizPCA( sce, pcs.use = 1:2)
DimPlot(object = sce, reduction = "pca") 
DimPlot(object = sce, reduction = "ica")
DimPlot(object = sce, reduction = "tsne",group.by  =  'short.histo')

# 针对PCA降维后的表达矩阵进行聚类 FindNeighbors+FindClusters 两个步骤。
sce <- FindNeighbors(object = sce, dims = 1:20, verbose = FALSE) 
sce <- FindClusters(object = sce, resolution = 0.8,verbose = FALSE)
# 继续tSNE可视化
DimPlot(object = sce, reduction = "tsne",
        group.by = 'RNA_snn_res.0.8',label = T)
table(sce@meta.data$short.histo,sce@meta.data$RNA_snn_res.0.8)

Idents(sce)=sce@meta.data$short.histo

sce.markers <- FindAllMarkers(object = sce, only.pos = TRUE, min.pct = 0.25, 
                              thresh.use = 0.25)
DT::datatable(sce.markers)
library(dplyr)
sce.markers=sce.markers[!is.infinite(sce.markers$avg_logFC),]
sce.markers %>% group_by(cluster) %>% top_n(2, avg_logFC)
topN <- sce.markers %>% group_by(cluster) %>% top_n(50, avg_logFC)
DoHeatmap(sce,topN$gene)

热图如下:

可以看到R包Seurat的FindAllMarkers函数对7个亚型找到的marker基因基本上都是上调基因。

检查单细胞转录组和bulk差异分析结果重合情况

首先bulk差异分析策略见: 不一定正确的多分组差异分析结果热图展现 ,其实就是我们以前在生信技能树分享的一个策略:如果你的分组比较多,差异分析策略有哪些? 把其中一个亚型跟所有的其它样本进行常规转录组差异分析而已。差异分析结果如下:

我们的单细胞转录组分析结果的,topN 这个变量,就存储着其差异分析结果:

现在要比较这两个表格!

单细胞FindAllMarkers并不是简单取差异最大的基因

通常,我们对传统bulk转录组差异分析结果,可以选取top的上下调基因进行热图可视化,如下:

但是,我们上面单细胞流程的R包Seurat的FindAllMarkers函数对ULMS亚型找到的marker基因,却并不满足这个传统bulk转录组差异分析统计学显著指标,比如logFC大于2,并且校正后的P值小于0.05

不过,也是可以看到这样的规律,就是挑选的基因可以显著的把绝大部分ULMS 亚型病人跟其他病人区分开来。

但是,这些基因,在limma的voom算法的差异分析结果里面,只有非常少的几个基因的阈值是达标的:

绝大部分基因的logFC机会没有啥意义。

其实FindAllMarkers找基因还有很多参数可以选择

有趣的是,就是没有limma的voom算法!

"wilcox" : Identifies differentially expressed genes between two groups of cells using a Wilcoxon Rank Sum test (default)
"bimod" : Likelihood-ratio test for single cell gene expression, (McDavid et al., Bioinformatics, 2013)
"roc" : Identifies 'markers' of gene expression using ROC analysis. For each gene, evaluates (using AUC) a classifier built on that gene alone, to classify between two groups of cells. An AUC value of 1 means that expression values for this gene alone can perfectly classify the two groupings (i.e. Each of the cells in cells.1 exhibit a higher level than each of the cells in cells.2). An AUC value of 0 also means there is perfect classification, but in the other direction. A value of 0.5 implies that the gene has no predictive power to classify the two groups. Returns a 'predictive power' (abs(AUC-0.5) * 2) ranked matrix of putative differentially expressed genes.
"t" : Identify differentially expressed genes between two groups of cells using the Student's t-test.
"negbinom" : Identifies differentially expressed genes between two groups of cells using a negative binomial generalized linear model. Use only for UMI-based datasets
"poisson" : Identifies differentially expressed genes between two groups of cells using a poisson generalized linear model. Use only for UMI-based datasets
"LR" : Uses a logistic regression framework to determine differentially expressed genes. Constructs a logistic regression model predicting group membership based on each feature individually and compares this to a null model with a likelihood ratio test.
"MAST" : Identifies differentially expressed genes between two groups of cells using a hurdle model tailored to scRNA-seq data. Utilizes the MAST package to run the DE testing.
"DESeq2" : Identifies differentially expressed genes between two groups of cells based on a model using DESeq2 which uses a negative binomial distribution (Love et al, Genome Biology, 2014).This test does not support pre-filtering of genes based on average difference (or percent detection rate) between cell groups. However, genes may be pre-filtered based on their minimum detection rate (min.pct) across both cell groups. To use this method, please install DESeq2, using the instructions at https://bioconductor.org/packages/release/bioc/html/DESeq2.html

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2019-10-08,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 生信技能树 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 先看看单细胞转录组代码
  • 检查单细胞转录组和bulk差异分析结果重合情况
  • 单细胞FindAllMarkers并不是简单取差异最大的基因
  • 其实FindAllMarkers找基因还有很多参数可以选择
相关产品与服务
灰盒安全测试
腾讯知识图谱(Tencent Knowledge Graph,TKG)是一个集成图数据库、图计算引擎和图可视化分析的一站式平台。支持抽取和融合异构数据,支持千亿级节点关系的存储和计算,支持规则匹配、机器学习、图嵌入等图数据挖掘算法,拥有丰富的图数据渲染和展现的可视化方案。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档