单细胞转录组 | 拟时序分析

天意生信云

发布于 2025-03-13 21:16:37

66600

代码可运行

文章被收录于专栏：天意生信俱乐部天意生信俱乐部

运行总次数：0

代码可运行

拟时序分析概述

拟时序分析（Pseudotime analysis）是一种在单细胞RNA测序（scRNA-seq）数据中重建细胞发育轨迹的计算方法。它基于一个关键假设：在样本中捕获的细胞代表了不同发育阶段的快照，通过计算这些细胞之间的相似性，可以推断出细胞状态变化的顺序，构建出一条或多条发育轨迹。

为什么需要拟时序分析？

传统的scRNA-seq分析通常将细胞分为离散的类型，但这忽略了细胞状态是连续变化的事实。拟时序分析有以下优势：

揭示细胞分化路径和发育轨迹
识别驱动细胞状态转变的关键基因
在没有时间序列实验的情况下推断细胞发育动态
发现新的细胞亚群和中间过渡状态

怎么做拟时序分析？

拟时序分析的基本流程包括：

从预处理好的数据中创建拟时序对象
降维和聚类
学习细胞轨迹
排序细胞并计算伪时间
分析与伪时间相关的基因表达变化

目前常用的拟时序分析工具包括Monocle3、Slingshot、PAGA和Velocity等。本文将基于Monocle3进行演示，因其与Seurat的良好兼容性以及强大的轨迹推断能力。

实际操作：基于Seurat对象的拟时序分析流程

以下是从Seurat对象开始进行拟时序分析的完整流程。前提是已经有一个处理好的Seurat对象，包含标准化数据、降维结果和细胞类型注释。

步骤1：准备环境和数据

# 加载必要的包
library(Seurat)
library(monocle3)
library(SeuratWrappers)
library(ggplot2)
library(patchwork)
library(dplyr)

# 读取预处理好的Seurat对象
seurat_obj <- readRDS("your_seurat_object.rds")

# 查看对象信息
seurat_obj

步骤2：将Seurat对象转换为Monocle3对象

# 使用SeuratWrappers将Seurat对象转换为cell_data_set对象
cds <- as.cell_data_set(seurat_obj)

# 将Seurat聚类和UMAP坐标转移到cell_data_set对象
cds <- cluster_cells(cds, reduction_method = "UMAP")

这一步使用SeuratWrappers包中的as.cell_data_set函数将Seurat对象转换为Monocle3的cell_data_set对象。然后，通过cluster_cells函数将Seurat中的UMAP降维结果用于Monocle3的聚类。

tips：这里会有有一个警告信息，Monocle 3 需要细胞的分群信息（cluster partitions）来构建细胞轨迹（trajectories），所以下一步将Seurat聚类和UMAP坐标转移到cell_data_set对象。

步骤3：学习细胞轨迹

cds <- learn_graph(cds)

# Plot the trajectory with cell annotations
p <- plot_cells(cds,
                color_cells_by = "cell_type",
                label_groups_by_cluster = FALSE,
                label_leaves = FALSE,
                label_branch_points = FALSE,
                graph_label_size = 4,
                cell_size = 1)

# Customize the plot
p <- p + theme(
  text = element_text(size = 10),
  legend.text = element_text(size = 10),
  legend.title = element_text(size = 10),
  legend.position = "right"
) +
  guides(color = guide_legend(override.aes = list(size = 5)))

print(p)
ggsave("trajectory_by_celltypes.pdf", p, width = 10, height = 8)

learn_graph函数会根据细胞之间的相似性构建一个网络图，代表细胞状态之间的转换。这个网络是拟时序分析的核心，它确定了细胞如何在不同状态之间转变。可视化轨迹时，我们可以查看分支点和叶节点，这些代表细胞命运决定的关键点。

步骤4：指定起始细胞并计算伪时间

myselect <- function(cds, select.classify, my_select) {
  # Verify the column exists
  if(!select.classify %in% colnames(colData(cds))) {
    stop(paste("Column", select.classify, "not found in cell metadata"))
  }
  
  # Get cell ids matching the selected cell type
  cell_ids <- which(colData(cds)[, select.classify] == my_select)
  
  if(length(cell_ids) == 0) {
    stop(paste("No cells found with", select.classify, "=", my_select))
  }
  
  # Find closest vertices for these cells
  closest_vertex <- cds@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex
  closest_vertex <- as.matrix(closest_vertex[colnames(cds), ])
  
  # Find the most common vertex
  root_vertex <- names(which.max(table(closest_vertex[cell_ids, ])))
  root_pr_nodes <- igraph::V(principal_graph(cds)[["UMAP"]])$name[as.numeric(root_vertex)]
  
  return(root_pr_nodes)
}

# Order cells along the trajectory with B cells as the root
# First verify that "B" exists in the seurat_annotations
print(table(cds$seurat_annotations))

# Order cells using B cells as the starting point
root_node <- myselect(cds, select.classify = 'seurat_annotations', my_select = "Naive CD4 T")
cds <- order_cells(cds, root_pr_nodes = root_node)

# Visualize pseudotime
pseudotime_plot <- plot_cells(cds,
                              color_cells_by = "pseudotime",
                              label_groups_by_cluster = FALSE,
                              label_leaves = FALSE,
                              label_branch_points = FALSE,
                              cell_size = 1)

print(pseudotime_plot)
ggsave("pseudotime_trajectory.pdf", pseudotime_plot, width = 10, height = 8)

这一步至关重要，需要指定轨迹的起始细胞。理想情况下，应选择处于发育早期或最原始状态的细胞群体。可以基于先验知识选择特定的细胞类型作为起点，或通过交互式界面在UMAP图上选择。一旦选定起点，order_cells函数会计算每个细胞沿轨迹的伪时间值。本篇的测试数据我尝试了几个基因作为起始细胞都不能很好的连接所有细胞类型，所以这里作为演示流程，结果部分不具参考性。

tips：在这里，作者踩了一个小坑，在 Seurat 中，细胞注释信息通常存储在 Idents 中，而 Monocle 3 的 cell_data_set 对象需要将这些信息存储在 colData 中。因此，即使你已经通过 RenameIdents 对细胞进行了注释，Monocle 3 仍然无法直接识别 cell_type，因为它没有在 colData 中找到对应的列。

解决方法就是：在将 Seurat 对象转换为 Monocle 3 的 cell_data_set 对象之前，将 Idents 信息存储到 meta.data 中。这一步要往前放，但是是在这遇到的报错，我就放这里！

seurat_obj@meta.data$cell_type <- Idents(seurat_obj)

步骤5：分析与伪时间相关的基因


# Extract the genes that significantly change with pseudotime
time_diff_genes <- top_markers(cds)

# Manually select the top 50 genes
top_50_genes <- time_diff_genes[1:50, ]


# 提取前10个基因的名称
genes_to_plot <- top_50_genes$gene_short_name[1:10]

# 使用 plot_genes_in_pseudotime 绘制基因表达动态
dynamic_plot <- plot_genes_in_pseudotime(cds[genes_to_plot, ],
                                         ncol = 2,
                                         cell_size = 0.5)
print(dynamic_plot)
ggsave("top_dynamic_genes.pdf", dynamic_plot, width = 10, height = 12)

这一步旨在找出哪一些基因在这个轨迹中变化最大。

步骤6：轨迹分支分析

# 查看分支信息
plot_cells(cds, color_cells_by = "partition", label_groups_by_cluster = FALSE)

# 提取分支特异性的基因
branch_genes <- graph_test(cds, neighbor_graph = "principal_graph", cores = 4)
branch_genes <- branch_genes %>% filter(q_value < 0.05) %>% arrange(desc(morans_I))

# 可视化分支特异性的基因
plot_cells(cds, genes = branch_genes$gene_short_name[1:6], show_trajectory_graph = FALSE)

如果轨迹存在分支，这一步可以识别驱动不同细胞命运决定的基因。