虽然转录因子分析作为单细胞转录组数据分析的3大高级分析之一名满天下,但是因为它太耗费计算资源导致绝大部分人敬而远之,我们其实也多次分享过细节教程:
如果你没有足够的计算资源,我们其实也推荐过一个简单版本的转录因子分析,就是 dorothea 这个R包,本质上是计算 Viper 得分,代码如下所示:
## We compute Viper Scores
# 0.安装R包 ----
# devtools::install_github('caleblareau/BuenColors')
# utils::install.packages(pkgs = "ggstatsplot")
# InstallData("pbmc3k")
# 1.加载R包和测试数据 ----
rm(list = ls())
library(SeuratData) #加载seurat数据集
getOption('timeout')
options(timeout = 10000)
data("pbmc3k")
sce <- pbmc3k.final
library(Seurat)
# 一个seurat对象
library(Seurat)
library(dorothea)
library(tidyverse)
# 获取包自带数据库
## We read Dorothea Regulons for Human:
dorothea_regulon_human <- get(data("dorothea_hs", package = "dorothea"))
## We obtain the regulons based on interactions with confidence level A, B and C
regulon <- dorothea_regulon_human %>%
dplyr::filter(confidence %in% c("A","B","C"))
sce <- run_viper(sce, regulon,
options = list(method = "scale", minsize = 4,
eset.filter = FALSE, cores = 1,
verbose = FALSE))
关于这个R包,DoRothEA on http://bioconductor.org/ and https://github.com/saezlab/dorothea. 有详细的文档介绍如何选择TF activity per cell population,根据 previously computed VIPER scores on DoRothEA’s regulons.
这个函数 run_viper 其实修改了我们的seurat对象,使它增加了一个 assay属性。查看如下所示:
> Assays(sce)
[1] "RNA" "dorothea"
> sce@assays$dorothea@data[1:4,1:4]
AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC AAACCGTGCTTCCG
AHR 0.7001660 -1.439884 0.7897333 0.919263
AR -2.0755431 -2.865013 -1.5363144 -1.177092
ARID2 -3.4504242 -4.014060 -3.2745958 -3.097172
ARID3A -0.7943249 -3.643424 -1.4438534 -3.931423
> dim(sce@assays$dorothea@data)
[1] 266 2638
可以看到,这个pbmc3k数据集里面的约 三千个细胞都有自己的266个转录因子的活性得分。
代码如下所示:
DefaultAssay(object = sce) <- "dorothea"
table(Idents(sce))
library(future)
# check the current active plan
plan()
plan("multiprocess", workers = 4)
plan()
sce.markers <- FindAllMarkers(object = sce,
only.pos = TRUE,
min.pct = 0.25,
thresh.use = 0.25)
DT::datatable(sce.markers)
pro='dorothea-markers-for-pbmc3k'
write.csv(sce.markers,file=paste0(pro,'_sce.markers.csv'))
save(sce.markers,file = paste0(pro, '_sce.markers.Rdata'))
虽然这个pbmc3k数据集里面只有约 三千个细胞,我们为了加快速度,还是使用了4个线程,大概40秒就可以出结果,
然后尝试热图可视化:
library(dplyr)
sce.markers$fc = sce.markers$pct.1 - sce.markers$pct.2
top10 <- sce.markers %>% group_by(cluster) %>% top_n(10, fc)
sce@assays$dorothea@data[1:4,1:4]
top10 =top10[top10$fc > 0.5,]
sce <- ScaleData(sce )
sce@assays$dorothea@scale.data[1:4,1:4]
DoHeatmap(sce,top10$gene,size=3,slot = 'scale.data')
看起来还不错,确实绝大部分的转录因子都是在不同单细胞亚群里面的特异性比较好:
特异性比较好
这个R包,DoRothEA on http://bioconductor.org/ and https://github.com/saezlab/dorothea. 有详细的文档介绍如何选择TF activity per cell population,根据 previously computed VIPER scores on DoRothEA’s regulons.
首先也是获取Viper得分矩阵 ,根据不同细胞亚群进行归纳汇总 :
## We transform Viper scores, scaled by seurat, into a data frame to better
## handling the results
viper_scores_df <- GetAssayData(sce, slot = "scale.data",
assay = "dorothea") %>%
data.frame(check.names = F) %>%
t()
viper_scores_df[1:4,1:4]
## We create a data frame containing the cells and their clusters
CellsClusters <- data.frame(cell = names(Idents(sce)),
cell_type = as.character(Idents(sce)),
check.names = F)
head(CellsClusters)
## We create a data frame with the Viper score per cell and its clusters
viper_scores_clusters <- viper_scores_df %>%
data.frame() %>%
rownames_to_column("cell") %>%
gather(tf, activity, -cell) %>%
inner_join(CellsClusters)
## We summarize the Viper scores by cellpopulation
summarized_viper_scores <- viper_scores_clusters %>%
group_by(tf, cell_type) %>%
summarise(avg = mean(activity),
std = sd(activity))
# For visualization purposes, we select the 20 most variable TFs across clusters according to our scores.
head(summarized_viper_scores)
## We select the 20 most variable TFs. (20*9 populations = 180)
highly_variable_tfs <- summarized_viper_scores %>%
group_by(tf) %>%
mutate(var = var(avg)) %>%
ungroup() %>%
top_n(180, var) %>%
distinct(tf)
highly_variable_tfs
## We prepare the data for the plot
summarized_viper_scores_df <- summarized_viper_scores %>%
semi_join(highly_variable_tfs, by = "tf") %>%
dplyr::select(-std) %>%
spread(tf, avg) %>%
data.frame(row.names = 1, check.names = FALSE)
palette_length = 100
my_color = colorRampPalette(c("Darkblue", "white","red"))(palette_length)
colnames(summarized_viper_scores_df)
整理好的数据如下所示:
> summarized_viper_scores_df[1:4,1:4]
EPAS1 FOSL1 FOSL2 GATA1
B 0.08468826 -0.08464098 -0.1715292 -0.06634766
CD14+ Mono 0.15636053 0.65938400 0.7189514 0.43845099
CD8 T 0.12689359 -0.19252488 -0.2669776 -0.20250346
DC -0.44678908 -0.12504099 -0.2844289 -0.18479202
是精心挑选了20个转录因子,它们在9个不同的单细胞亚群里面的变化最大,这样就可以挑选得到各个单细胞亚群特异性的转录因子并且进行后续可视化。
有了数据,就很容易可视化:
my_breaks <- c(seq(min(summarized_viper_scores_df), 0,
length.out=ceiling(palette_length/2) + 1),
seq(max(summarized_viper_scores_df)/palette_length,
max(summarized_viper_scores_df),
length.out=floor(palette_length/2)))
library(pheatmap)
viper_hmap <- pheatmap(t(summarized_viper_scores_df),fontsize=14,
fontsize_row = 10,
color=my_color, breaks = my_breaks,
main = "DoRothEA (ABC)", angle_col = 45,
treeheight_col = 0, border_color = NA)
可以看到, 官方文档就是选择了最朴实无华的 pheatmap ,效果如下所示:
各个单细胞亚群特异性的转录因子热图
如果你确实觉得我的教程对你的科研课题有帮助,让你茅塞顿开,或者说你的课题大量使用我的技能,烦请日后在发表自己的成果的时候,加上一个简短的致谢,如下所示:
We thank Dr.Jianming Zeng(University of Macau), and all the members of his bioinformatics team, biotrainee, for generously sharing their experience and codes.
十年后我环游世界各地的高校以及科研院所(当然包括中国大陆)的时候,如果有这样的情谊,我会优先见你。