跟着小鱼头学单细胞测序-单细胞注释实践指南

作图丫

发布于 2022-03-29 10:49:38

3.9K0

文章被收录于专栏：作图丫作图丫

导语

GUIDE ╲

如何识别细胞类型和状态并最终创建带注释的细胞图谱是单细胞研究中的一个重难点，之前给大家介绍过相关的思路分析，这次给大家打来详细的代码解析，希望对大家的实操有所助益。

背景介绍

单细胞转录组学可以在一次实验中分析数千个细胞，并在各种组织和生物体中识别新的细胞类型、状态和动态，从而创建多细胞系统中细胞异质性的全面图谱。现在有很多创建单细胞转录组图谱的分析流程已经被开发出来，然而如何识别细胞类型和状态并最终创建带注释的细胞图谱还是一个重难点。之前我们介绍了最新发表在natureprotocols上一篇注释指南。

这里小编带大家一起学习对应的基于R语言的实践操作。

由于每个单细胞图谱注释情况都会有所不同，并且可能不需要使用所有这些工具。该教程利用公开可用的数据，涵盖的内容包括：

基于参考和标记的自动注释和手动注释
如何构建一致的集群注释集

教程内容

代码教程包括4个部分：

基于参考数据集的自动注释：使用已标记的参考数据集对查询数据集进行注释。使用工具有 scmap和 SingleR。并进一步探索如何使用 Harmony 将合并数据集作为一种注释形式。
在基于参考数据集进行注释之后，对于出现的冲突标签情况，采取“大多数”原则构建一致的集群注释集。
基于特征基因的自动注释：导入与特定细胞类型相关的标记基因列表。使用工具有SCINA。
手动注释：从查询数据集中提取标记基因和相关通路，将得到的差异表达的基因和通路与相关文献结果进行比较。使用工具有Seurat 和 cerebroApp。

该教程代码基于 R 4.0.3，以下列出的是需要安装的工具包（其中“*”表示需要通过BiocManager::install("package")安装；"**"表示需要通过devtools从GitHub中获取源代码安装）：

SingleCellExperiment_1.12.0*Seurat_3.2.2scater_1.18.3*SCINA_1.2.0devtools_2.3.2dplyr_1.0.4*scmap_1.12.0*celldex_1.0.0*SingleR_1.4.0*ggplot2_3.3.3Harmony_1.0**cerebroApp_1.3.0**msigdb_0.2.0**

代码解析

基于参考数据集的自动注释

第一步是构建参考数据集：本例子使用 SingleR 的作者创建的参考数据之一，并展示如何将其与其他工具（例如 scmap）一起使用。其他参考数据集可以在 GEO 或文献中查找。

# 设置随机数确保结果的可重复性
set.seed(9742)
# 下载 singleR 原文中的免疫细胞数据作为参考
ref <- celldex::DatabaseImmuneCellExpressionData()
# 将参考数据转化为scmap的格式对象 SingleCellExperiment；默认情况下基因名称位于名为“feature_symbol”的列中，而细胞类型标签位于名为“cell_type1”的列中'。此外，scmap 要求对参考数据进行归一化和对数转换；由于参考数据已经在SingleR中进行这些操作了，因为这里略过。
colData(ref)$cell_type1 <- colData(ref)$label.fine
rowData(ref)$feature_symbol <- rownames(ref)
ref_sce <- SingleCellExperiment::SingleCellExperiment(assays=list(logcounts=Matrix::Matrix(assays(ref)$logcounts)), 
                                                      colData=colData(ref), rowData=rowData(ref))

# 映射细胞标签
# 首先选择高变基因，这些基因将是 scmap 认为信息量最大的基因。这些基因通常具有高表达值和低dropout rate。
ref_sce <- scmap::selectFeatures(ref_sce, suppress_plot=FALSE)
# 选择前50个高变基因
rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)][1:50]
length(rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)])

# 查看scmap 选择使用的基因后，如果缺少关键标记基因，可以手动添加
my_key_markers = c("TRAC", "TRBC1", "TRBC2", "TRDC", "TRGC1", "TRGC2", "IGKC")
rowData(ref_sce)$scmap_features[rownames(ref_sce) %in% my_key_markers] <- TRUE
length(rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)])

# 同样的，我们也是手动删除可能是artefacts的基因，例如线粒体基因
mt_genes <- rownames(ref_sce)[grep("^MT-", rownames(ref_sce))]
rowData(ref_sce)$scmap_features[rownames(ref_sce) %in% mt_genes] <- FALSE
length(rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)])
scmap_feature_genes <- rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)]
length(scmap_feature_genes)

# 通过对群簇构建scmap-cluster中需要的参考数据集
ref_sce <- scmap::indexCluster(ref_sce)
#通过热图查看感兴趣的基因
cormat <- reshape2::melt(as.matrix(metadata(ref_sce)$scmap_cluster_index))
ggplot2::ggplot(cormat, ggplot2::aes(x = Var2, y = Var1, fill = value)) +
  ggplot2::geom_tile() +
  ggplot2::scale_fill_gradient2(low = "blue", high = "darkred",
                                name = "Expression value") +
  ggplot2::theme_minimal() +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, vjust = 1,
                                                     size = 18, hjust = 1),
                 axis.text.y = ggplot2::element_text(size = 15),
                 axis.title.x = ggplot2::element_blank(),
                 axis.title.y = ggplot2::element_blank())
# 储存参考数据集的类别标签， 用于之后的自动注释
scmap_cluster_reference <- metadata(ref_sce)$scmap_cluster_index

# 除了群簇的标签映射之外，也可以映射单个细胞的标签
# 构建 scmap-cell 的参考标签集
ref_sce <- scmap::indexCell(ref_sce)
scmap_cell_reference <- metadata(ref_sce)$scmap_cell_index
scmap_cell_metadata <- colData(ref_sce)

# 接下来就是注释了
# 查询数据集来自 10X genomics官网
download.file("https://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz",
              "pbmc3k_filtered_gene_bc_matrices.tar.gz")
untar("pbmc3k_filtered_gene_bc_matrices.tar.gz")

# 导入查询数据集，构建对象并进行常规标准化
data <- Seurat::Read10X("filtered_gene_bc_matrices/hg19/")
query_sce <- SingleCellExperiment::SingleCellExperiment(assays=list(counts=data))
query_seur <- Seurat::CreateSeuratObject(data)
query_sce <- Seurat::as.SingleCellExperiment(query_seur)
query_sce <- scater::logNormCounts(query_sce)
rowData(query_sce)$feature_symbol <- rownames(query_sce)

# 使用scmap-cluster 参考群簇标签进行预测
scmap_cluster_res <- scmap::scmapCluster(projection=query_sce, 
                                         index_list=list(immune1 = scmap_cluster_reference), 
                                         threshold=0.1)
# 查看预测结果
par(mar=c(13, 4, 1, 0))
barplot(table(scmap_cluster_res$combined_labs), las=2)
# 将注释信息保存到query数据集对象中
colData(query_sce)$scmap_cluster <- scmap_cluster_res$combined_labs
query_sce <- scater::runUMAP(query_sce)
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="scmap_cluster")

# 除了参考群簇标签， 也可利用单个细胞的参考标签进行预测
nearest_neighbours <- scmap::scmapCell(projection=query_sce, 
                                       index_list = list(immune1 = scmap_cell_reference), 
                                       w=10)
mode_label <- function(neighbours, metadata=scmap_cell_metadata$cell_type1) {
  freq <- table(metadata[neighbours])
  label <- names(freq)[which(freq == max(freq))]
  if (length(label) > 1) {return("ambiguous")}
  return(label)
}
scmap_cell_labs <- apply(nearest_neighbours$immune1$cells, 2, mode_label)
colData(query_sce)$scmap_cell <- scmap_cell_labs
#查看预测结果
par(mar=c(10, 4, 0, 0))
barplot(table(scmap_cell_labs), las=2)
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="scmap_cell")

下面这个示例是演示如何使用 SingleR 进行注释。

predictions <- SingleR::SingleR(test=query_sce, ref=ref, labels=ref$label.fine)
# 预测结果中显示不是所有细胞的类型都被注释了
# 查看没有被注释的细胞数目
sum(is.na(predictions$pruned.labels))
# 赋予那些没有被成功注释的细胞 “ambiguous” 标签
predictions$pruned.labels[which(is.na(predictions$pruned.labels))] <- "ambiguous"
colData(query_sce)$singleR <- predictions$pruned.labels
# 查看预测结果
par(mar=c(13, 4, 2, 0))
barplot(table(predictions$pruned.labels), las=2)
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="singleR")

另一种标签映射方法是基于参考数据和查询数据做聚类，然后参考标签映射到相邻的查询数据细胞上。下面示例演示如何通过Harmony来操作。

set.seed(2891)
# 将参考数据和查询数据转化成seurat对象
assays(ref_sce)[["counts"]] <- round(2^assays(ref_sce)[["logcounts"]]) -1
colnames(ref_sce) <- paste("cell", 1:ncol(ref_sce))
# 选取两个数据集中共有的基因集
ref_seur <- Seurat::as.Seurat(ref_sce[rownames(ref_sce) %in% rownames(query_sce),])
ref_seur@active.ident <- factor(rep("reference", ncol(ref_seur)))
query_seur <- Seurat::as.Seurat(query_sce[rownames(query_seur) %in% rownames(ref_sce),])
query_seur@active.ident <- factor(rep("query", ncol(query_seur)))
# 对参考数据做downsample操作，使得两个数据集的total umi数据相似
totalUMI <- median(query_seur@meta.data$nCount_RNA)
ref_seur@assays$RNA@counts <- Seurat::SampleUMI(ref_seur@assays$RNA@counts,
                                                max.umi=totalUMI, upsample=FALSE)
# 将两个数据集合并
merged_seur <- merge(ref_seur, query_seur)
merged_seur@meta.data$source <- merged_seur@active.ident
# 对合并后的数据标准化，选取高变基因
merged_seur <- Seurat::NormalizeData(merged_seur)
Seurat::VariableFeatures(merged_seur) <- scmap_feature_genes
# 归一化、降维和可视化
merged_seur <- Seurat::ScaleData(merged_seur)
merged_seur <- Seurat::RunPCA(merged_seur)
merged_seur <- Seurat::RunUMAP(merged_seur, dims=1:15)
Seurat::DimPlot(merged_seur, reduction="umap") + ggplot2::ggtitle("Before Integration")
# 通过Harmony消除批次效应
merged_seur <- harmony::RunHarmony(merged_seur, "source", dims.use=1:15)
merged_seur <- Seurat::RunUMAP(merged_seur, dims=1:15, reduction="harmony")
# 查看合并结果
Seurat::DimPlot(merged_seur, reduction="umap") + ggplot2::ggtitle("After Integration")
# 对合并数据聚类
merged_seur <- Seurat::FindNeighbors(merged_seur, reduction="harmony", dims=1:15)
merged_seur <- Seurat::FindClusters(merged_seur, resolution=0.5)
# 查看预测结果
Seurat::DimPlot(merged_seur, reduction="umap") + ggplot2::ggtitle("After Integration")
table(merged_seur@meta.data$label.fine, 
      merged_seur@active.ident)
# 通过上面的table对比参考数据和查询数据的标签情况后，手动给每个类别修改标签名
cluster_labs <- c("0"="ambiguous", 
                  "1"="Monocytes, CD14+", 
                  "2"="B cells, naive", 
                  "3"="T cells, CD4+, naive TREG",
                  "4"="T cells, CD4+, Th1_17",
                  "5"="NK cells",
                  "6"="T cells, CD8+, naive",
                  "7"="Monocytes, CD16+",
                  "8"="T cells, CD4+, memory TREG",
                  "9"="T cells, CD4+, naive, stimulated",
                  "10" = "T cells, CD8+, naive, stimulated")
merged_seur@meta.data$annotation <- cluster_labs[merged_seur@meta.data$RNA_snn_res.0.5]
query_sce$Harmony_lab <- merged_seur@meta.data$annotation[merged_seur@meta.data$source =="query"]
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="Harmony_lab")

构建一致的集群注释集

基于“大多数”原则对于冲突标签构建一致的集群注释集。

annotation_columns <- c("scmap_cluster", "scmap_cell", "singleR", "Harmony_lab")
get_consensus_label <- function(labels){
  labels <- labels[labels != "ambiguous"]
  if (length(labels) == 0) {return("ambiguous")}
  freq <- table(labels)
  label <- names(freq)[which(freq == max(freq))]
  if (length(label) > 1) {return("ambiguous")}
  return(label)
}
colData(query_sce)$consensus_lab <- apply(colData(query_sce)[,annotation_columns], 1, get_consensus_label)
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="consensus_lab")

基于特征基因的自动注释

SCINA是一种半监督注释工具，它基于特征基因和表达矩阵，并根据细胞类型特征基因的先验知识预测潜在标签。标记列表通常是 gmt 格式。第一步需要下载特征基因集，这里用到的 PBMC 特征基因来自于（Diaz-Mejia JJ et al.， https://zenodo.org/record/3369934#.X2PWty2z1QI)。

download.file("https://zenodo.org/record/3369934/files/pbmc_22_10x.tar.bz2",
              "pbmc_22_10x.tar.bz2")
untar("pbmc_22_10x.tar.bz2")
# 导入特征基因集
markers <- msigdb::read.gmt('pbmc_22_10x_cell_type_signature_gene_sets.gmt')
# 将查询数据集从seurat对象模式转化为 matrix 格式
exprMatrix <- as.matrix(Seurat::GetAssayData(query_seur))
# 通过 SCINA 进行预测，其中 rm_overlap = FALSE 允许使用相同的基因注释不同细胞类型，这将有利于相似细胞类型的注释
# allow_unknown = TRUE 允许无法注释的细胞存在，避免强行赋予一些 low-confident 的标签
predictions.scina = SCINA::SCINA(exp = exprMatrix, signatures = markers$genesets,
                                 rm_overlap = FALSE, allow_unknown = TRUE)
colData(query_sce)$SCINA <- predictions.scina$cell_labels
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="SCINA") +
  ggplot2::theme(legend.position = "bottom",
                 legend.text = ggplot2::element_text(size = 4))

手动注释

手动注释时第一步需要检索标记基因。如果我们没有每个细胞类型的标签或高质量的参考数据集，那么从查询数据的每个集群中提取top体征基因会很有用。可以通过Seurat来完成。

# 对查询数据做saseurat的标准流程处理，得到每个群簇的特征基因
query_seur <- Seurat::NormalizeData(query_seur) # Normalize the data
query_seur <- Seurat::FindVariableFeatures(query_seur) # Determine the variable features of the dataset
query_seur <- Seurat::ScaleData(query_seur) # Scale the data based on the variable features
query_seur <- Seurat::RunPCA(query_seur)
query_seur <- Seurat::RunTSNE(query_seur)
query_seur <- Seurat::FindClusters(query_seur, resolution = 0.5)
Seurat::DimPlot(query_seur, reduction = "UMAP")
markers_seur <- Seurat::FindAllMarkers(query_seur, only.pos = TRUE)
markers_seur
require(dplyr)
top5 <- markers_seur %>% group_by(cluster) %>%
  dplyr::slice_max(get(grep("^avg_log", colnames(markers_seur), value = TRUE)),
                   n = 5)
Seurat::DotPlot(query_seur, features = unique(top5$gene)) +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, vjust = 1,
                                                     size = 8, hjust = 1)) +
  Seurat::NoLegend()
Seurat::DoHeatmap(query_seur, features = unique(top5$gene)) +
  Seurat::NoLegend() +
  ggplot2::theme(axis.text.y = ggplot2::element_text(size = 8))

# 除了基于群簇的top特征基因， 也可以通过通过富集分析来得到基因集功能，来进一步手动注释
query_seur <- cerebroApp::getMarkerGenes(query_seur,
                                         groups = c('seurat_clusters'),
                                         assay = "RNA",
                                         organism = "hg")
# 通过cerebro得到富集通路
query_seur <- cerebroApp::getEnrichedPathways(query_seur,
                                              databases = c("GO_Biological_Process_2018",
                                                            "GO_Cellular_Component_2018",
                                                            "GO_Molecular_Function_2018",
                                                            "KEGG_2016",
                                                            "WikiPathways_2016",
                                                            "Reactome_2016",
                                                            "Panther_2016",
                                                            "Human_Gene_Atlas",
                                                            "Mouse_Gene_Atlas"),
                                              adj_p_cutoff = 0.05,
                                              max_terms = 100,
                                              URL_API = "http://amp.pharm.mssm.edu/Enrichr/enrich")
# 查看富集通路
query_seur@misc$enriched_pathways
#end

小编总结

结合注释指南的思路指导，该教程提供了具体的实践代码演示如何使用工具和资源，实用性很高，值得收藏。

Reference

Clarke, Z.A., Andrews, T.S., Atif, J. et al. Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods. Nat Protoc 16, 2749–2764 (2021). https://doi.org/10.1038/s41596-021-00534-0

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-07-10，如有侵权请联系 cloudcommunity@tencent.com 删除

https