热图一直是一种数据矩阵可视化使用率较高的展示形式,常见包含:
heatmap():用于绘制简单热图的函数;
heatmap.2():绘制增强热图的函数;
d3heatmap:用于绘制交互式热图的R包;
pheatmap是热图中使用频率比较高一个R包,ComplexHeatmap:用于绘制、注释和排列复杂热图。现在ComplexHeatmap
迎来新版本升级,支持pheatmap
参数转换。
新增的 ComplexHeatmap::pheatmap()
该功能实际上将中的所有参数映射pheatmap::pheatmap()
到中的适当参数ComplexHeatmap::Heatmap()
,这意味着可以直接将它转换为一个复杂的热图,例如如下比较
更多的比较信息及参数对照表可点击原文查看,顺着主页信息,学习了复杂热图内容。
library(devtools)install_github("jokergoo/ComplexHeatmap")
碰到以下报错一般是由于网络因素导致的,重试几次,或者直接强制。
install_github("jokergoo/ComplexHeatmap", force = TRUE)
单个热图由热图主体和热图组件组成。热图主体可以按行和列进行拆分,支持水平和垂直排列。热图组件是标题,树状图,矩阵名称和热图注释,它们放置在heamap主体的四个侧面上,并支持热图主体进行重新排序或拆分。同时,热图和注释(列注释)也可以垂直排列。
所述ComplexHeatmap包是在面向对象的方式实现。为了描述热图列表,有以下几类:
Heatmap
类:单个热图,其中包含热图主体,行/列名称,标题,树形图和行/列注释。HeatmapList
class:热图和热图注释的列表。HeatmapAnnotation
class:定义行注释和列注释的列表。热图注释可以是热图的组成部分,也可以独立于热图。还有一些内部类:
SingleAnnotation
class:定义单个行注释或列注释。该HeatmapAnnotation
对象包含对象列表 SingleAnnotation
。ColorMapping
类:从值到颜色的映射。主矩阵和注释的颜色映射由ColorMapping
类控制。AnnotationFunction
class:构造用户定义的注释。这是创建用户定义的注释图形的基础。ComplexHeatmap是在网格系统下实现的,因此用户需要了解基本的网格功能才能充分利用该软件包。
library(ComplexHeatmap)
library(circlize)
# 测试数据,基因表达谱(带基因长度信息)
expr = readRDS(system.file(package = "ComplexHeatmap", "extdata", "gene_expression.rds"))
mat = as.matrix(expr[, grep("cell", colnames(expr))])
base_mean = rowMeans(mat)
mat_scaled = t(apply(mat, 1, scale))
# 获得列名用于type标注
type = gsub("s\\d+_", "", colnames(mat))
ha = HeatmapAnnotation(type = type, annotation_name_side = "left")
# 主图 + base_mean + 基因长度散点图 + gene_type热图(右)
ht_list = Heatmap(mat_scaled, name = "expression", row_km = 5,
col = colorRamp2(c(-2, 0, 2), c("green", "white", "red")),
top_annotation = ha,
show_column_names = FALSE, row_title = NULL, show_row_dend = FALSE) +
Heatmap(base_mean, name = "base mean",
top_annotation = HeatmapAnnotation(summary = anno_summary(gp = gpar(fill = 2:6),
height = unit(2, "cm"))),
width = unit(15, "mm")) +
rowAnnotation(length = anno_points(expr$length, pch = 16, size = unit(1, "mm"),
axis_param = list(at = c(0, 2e5, 4e5, 6e5),
labels = c("0kb", "200kb", "400kb", "600kb")),
width = unit(2, "cm"))) +
Heatmap(expr$type, name = "gene type",
top_annotation = HeatmapAnnotation(summary = anno_summary(height = unit(2, "cm"))),
width = unit(15, "mm"))
# anno_block()以从k均值聚类中识别出五个聚类
ht_list = rowAnnotation(block = anno_block(gp = gpar(fill = 2:6, col = NA)),
width = unit(2, "mm")) + ht_list
# 绘图
draw(ht_list, ht_gap = unit(5, "mm"))
数据:
绘图:
右侧的信息例如基因类型(即蛋白质编码或lincRNA)等也作为热图注释或热图放置。
(以下为中文翻译搬运内容)
在此示例中,小鼠T细胞的单细胞RNA-Seq数据可视化以显示细胞的异质性。数据(mouse_scRNAseq_corrected.txt
)来自Buettner等人,2015,补充数据1,表格“ 细胞周期校正的基因表达谱 ”。作者在此对数据进行预处理:
# 删除了重复的基因
expr = read.table("data/mouse_scRNAseq_corrected.txt", sep = "\t", header = TRUE)
expr = expr[!duplicated(expr[[1]]), ]
rownames(expr) = expr[[1]]
expr = expr[-1]
expr = as.matrix(expr)
# 超过一半的细胞中未表达的基因被滤除。
expr = expr[apply(expr, 1, function(x) sum(x > 0)/length(x) > 0.5), , drop = FALSE]
该get_correlated_variable_rows()
功能在此处定义。它提取在细胞之间可变表达并与其他基因相关的特征基因。
get_correlated_variable_genes = function(mat, n = nrow(mat), cor_cutoff = 0, n_cutoff = 0) {
ind = order(apply(mat, 1, function(x) {
q = quantile(x, c(0.1, 0.9))
x = x[x < q[1] & x > q[2]]
var(x)/mean(x)
}), decreasing = TRUE)[1:n]
mat2 = mat[ind, , drop = FALSE]
dt = cor(t(mat2), method = "spearman")
diag(dt) = 0
dt[abs(dt) < cor_cutoff] = 0
dt[dt < 0] = -1
dt[dt > 0] = 1
i = colSums(abs(dt)) > n_cutoff
mat3 = mat2[i, ,drop = FALSE]
return(mat3)
}
Signature genes 被定义为一系列基因,其中每个基因与20多个基因相关,绝对相关性大于0.5。
mat2
包含每个基因表达情况,
Since single cell RNASeq data is highly variable and outliers are frequent, gene expression is only scaled within the 10th and 90th quantiles.
mat = get_correlated_variable_genes(expr, cor_cutoff = 0.5, n_cutoff = 20)
mat2 = t(apply(mat, 1, function(x) {
q10 = quantile(x, 0.1)
q90 = quantile(x, 0.9)
x[x < q10] = q10
x[x > q90] = q90
scale(x)
}))
colnames(mat2) = colnames(mat)
添加细胞周期基因和核糖核蛋白基因信息。细胞周期基因列表来自Buettner等人,2015,补充表1,表“ Union of Cyclebase and GO genes ”。核糖核蛋白基因来自 GO:0030529。基因列表存储在 mouse_cell_cycle_gene.rds
和中mouse_ribonucleoprotein.rds
。
cc = readRDS("data/mouse_cell_cycle_gene.rds")
ccl = rownames(mat) %in% cc
cc_gene = rownames(mat)[ccl]
rp = readRDS("data/mouse_ribonucleoprotein.rds")
rpl = rownames(mat) %in% rp
由于按比例缩放每个基因的表达值,一个基因相对于其他基因的表达水平已经失真,因此我们计算一个基因在所有样本中的平均表达,用于比较基因之间的表达水平。
base_mean = rowMeans(mat)
将以下信息进行可视化:
mat2
,base_mean
,rpl
,ccl
,cc_gene
,对于具有相对高表达水平(大于所有基因的25%的分位数)的细胞周期基因,基因名称显示为文本标签。在第一个热图中,基于树状聚类的两个主要组,在两个树状图的基础上对列树状图进行了铺底,以突出显示这两个亚群。
library(GetoptLong)
ht_list = Heatmap(mat2, col = colorRamp2(c(-1.5, 0, 1.5), c("blue", "white", "red")),
name = "scaled_expr", column_title = qq("relative expression for @{nrow(mat)} genes"),
show_column_names = FALSE, width = unit(8, "cm"),
heatmap_legend_param = list(title = "Scaled expr")) +
Heatmap(base_mean, name = "base_expr", width = unit(5, "mm"),
heatmap_legend_param = list(title = "Base expr")) +
Heatmap(rpl + 0, name = "ribonucleoprotein", col = c("0" = "white", "1" = "purple"),
show_heatmap_legend = FALSE, width = unit(5, "mm")) +
Heatmap(ccl + 0, name = "cell_cycle", col = c("0" = "white", "1" = "red"),
show_heatmap_legend = FALSE, width = unit(5, "mm")) +
rowAnnotation(link = anno_mark(at = which(ccl & base_mean > quantile(base_mean, 0.25)),
labels = rownames(mat)[ccl & base_mean > quantile(base_mean, 0.25)],
labels_gp = gpar(fontsize = 10), padding = unit(1, "mm"))) +
Heatmap(cor(t(mat2)), name = "cor",
col = colorRamp2(c(-1, 0, 1), c("green", "white", "red")),
show_row_names = FALSE, show_column_names = FALSE, row_dend_side = "right",
show_column_dend = FALSE, column_title = "pairwise correlation between genes",
heatmap_legend_param = list(title = "Correlation"))
ht_list = draw(ht_list, main_heatmap = "cor")
decorate_column_dend("scaled_expr", {
tree = column_dend(ht_list)$scaled_expr
ind = cutree(as.hclust(tree), k = 2)[order.dendrogram(tree)]
first_index = function(l) which(l)[1]
last_index = function(l) { x = which(l); x[length(x)] }
x1 = c(first_index(ind == 1), first_index(ind == 2)) - 1
x2 = c(last_index(ind == 1), last_index(ind == 2))
grid.rect(x = x1/length(ind), width = (x2 - x1)/length(ind), just = "left",
default.units = "npc", gp = gpar(fill = c("#FF000040", "#00FF0040"), col = NA))
})
可视化结果
以上两个例子只是指南中一小部分,不仅仅是绘图还有更多的数据处理细节,详细内容可以查看参考资料1。
主要参考资料