单细胞专题 | 10.细胞周期分析

DoubleHelix

发布于 2022-11-24 10:40:25

1.6K0

发布于 2022-11-24 10:40:25

文章被收录于专栏：生物信息云

A cell cycle is a series of events that takes place in a cell as it grows and divides.即描述细胞生长、分裂整个过程中细胞变化过程。最重要的两个特点就是DNA复制、分裂成两个一样的子细胞。如下图，一般分成4个阶段

G1(gap1)：Cell increases in size(Cellular contents duplicated)
S(synthesis) ：DNA replication, each of the 46 chromosomes (23 pairs) is replicated by the cell
G2(gap2)：Cell grows more，organelles and proteins develop in preparation for cell division，为分裂做准备
M(mitosis)：'Old' cell partitions the two copies of the genetic material into the two daughter cells. And the cell cycle can begin again.

在分析单细胞数据时，同一类型的细胞往往来自于不同的细胞周期阶段，这可能对下游聚类分析，细胞类型注释产生混淆；由于细胞周期也是通过cell cycle related protein 调控，即每个阶段有显著的marker基因；通过分析细胞周期有关基因的表达情况，可以对细胞所处周期阶段进行注释；在单细胞周期分析时，通常只考虑三个阶段：G1、S、G2M。(即把G2和M当做一个phase) 主要学习两种来自scran与Seurat包鉴定细胞周期的方法介绍与演示。

(1).Seurat包

Seurat包CellCycleScoring较scran包cyclone函数最主要的区别是直接根据每个cycle，一组marker基因表达值判断。

library(Seurat)
str(cc.genes)

如上是Seurat包提供的人的细胞中分别与S期、G2M期直接相关的marker基因。CellCycleScoring即根据此，对每个细胞的S期、G2M期可能性进行打分；具体如何计算的，暂时在Seurat官方文档中没有提及。在satijalab的github(https://github.com/satijalab/seurat/issues/728))中，作者这样回复类似的提问：As we say in the vignette, the scores are computed using an algorithm developed by Itay Tirosh, when he was a postdoc in the Regev Lab (Science et al., 2016). The gene sets are also taken from his work. You can read the methods section of https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4944528/ to see how the scoring works, but essentially the method scores cells based on the expression of each gene in a signature set - after controlling for the expected expression of genes with similar abundance. 结合一些中文教程（https://www.jianshu.com/p/e4a5b5c67de1)的介绍，认为就是根据每个细胞的S期(或者G2/M期)基因集是否显著高表达，对应的score就是表示在该细胞中，S期(或者G2/M期)基因集高表达的程度(如果是负数，就认为不属于该phase)。区别于scran包的另外重要的一点就是Seurat包仅提供了人类细胞有关的cell cycle related gene，没有小鼠的。对此，作者这样回复：Thanks. We don't provide different capitalizations because this gene list was developed on a human dataset, and we don't want to create ambiguity by suggesting its created from a mouse reference dataset. **In practice however, we've found it works quite well for mouse also, and recommend the solution above.** 简言之，作者认为可以将对应人的cc.gene转换为鼠对应的基因名，当做后者的cell cycle related gene（因为鼠和人类基因的高度相似性)。提到的solution就是采用biomaRt包转换一下。这在我之前的教程中有介绍。

【生信基础 | 人-小鼠基因之间的比较】

【biomaRt包实现不同物种之间同源基因转换】

convertHumanGeneList <- function(x){
  require("biomaRt")
  human = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
  mouse = useMart("ensembl", dataset = "mmusculus_gene_ensembl")
  genesV2 = getLDS(attributes = c("hgnc_symbol"), filters = "hgnc_symbol", values = x , mart = human, attributesL = c("mgi_symbol"), martL = mouse, uniqueRows=T)
  humanx <- unique(genesV2[, 2])
  # Print the first 6 genes found to the screen
  print(head(humanx))
  return(humanx)
}
m.s.genes <- convertHumanGeneList(cc.genes$s.genes)
m.g2m.genes <- convertHumanGeneList(cc.genes$g2m.genes)

但是biomaRt包网络不是很稳定，有人也直接提供了转换后的结果（https://github.com/satijalab/seurat/issues/462），直接下载导入到R里即可，下面演示的代码就是用的该结果。

mouse.cc.gene=readRDS("H:/MedBioInfoCloud/analysis/base_files/GeneSet/mouse_cell_cycle_genes/mouse_cell_cycle_genes/mouse_cell_cycle_genes.rds")

我们继续使用前面的数据进行分析。下面文章中的：sce3

单细胞专题 | 9.如何人工注释单细胞类群？

sce3 = CellCycleScoring(object = sce3, 
                              g2m.features = mouse.cc.gene$g2m.genes, 
                              s.features = mouse.cc.gene$s.genes)
p4=VlnPlot(sce3, features = c("S.Score", "G2M.Score"), group.by = "orig.ident", 
           ncol = 2, pt.size = 0.1)

sce3@meta.data  %>% ggplot(aes(S.Score,G2M.Score))+geom_point(aes(color=Phase))+
  theme_minimal()

(2).scran包

http://bioconductor.org/packages/release/bioc/manuals/scran/man/scran.pdf scran包cyclone函数是利用‘marker基因对’表达来对细胞所在周期阶段进行预测的方法Scialdone (2015) “maker基因对”由作者根据训练集细胞（已注释了cell cycle）的基因表达特征产生，我们可以直接使用。对于每一细胞周期阶段（人/鼠）都有一组“maker基因对”集合。

# BiocManager::install("scran")
library(scran)
mm.pairs <- readRDS(system.file("exdata", "mouse_cycle_markers.rds", 
                                package="scran"))

str(mm.pairs)
head(mm.pairs$G1)

如上图，具体来说，即比较某个细胞的ENSMUSG00000000001基因表达值是否大于ENSMUSG00000001785基因表达值。如果对于所有G1期marker基因对，某个细胞的“first”列基因表达量大于对应的“second”基因的情况越多，则越有把握认为该细胞就是处于G1期（因为越符合训练集特征)。而cyclone则是通过计算score，即对于某个细胞，符合上述比较关系的marker基因对数占全部marker基因对数的比值。这里默认提供marker基因对是ensemble格式，如果表达数据提供的是其它类类型的基因ID，比如：SYMBOL，那么我们需要转化一下ID。具体参考文章【单细胞数据分析中scran包进行细胞周期分析时细胞周期marker基因的转换】

###基因转换
library(clusterProfiler)
library(org.Hs.eg.db)
# x <- names(mm.pairs)[1]
htrs <- lapply(names(hs.pairs), function(x){
  df <- hs.pairs[[x]]
  first <- bitr(hs.pairs[[x]][,1],
                fromType= "ENSEMBL", toType="SYMBOL",
                OrgDb="org.Hs.eg.db")

  second <- bitr(hs.pairs[[x]][,2],
                 fromType= "ENSEMBL", toType="SYMBOL",
                 OrgDb="org.Hs.eg.db")
  df <- df[first$ENSEMBL %in% df$first,]
  df <- df[second$ENSEMBL %in% df$second,]

  df$first <-  lapply(df$first, function(x){
    first[first$ENSEMBL == x,2][1] 
  })  %>% unlist()

  df$second <-  lapply(df$second, function(x){
    second[second$ENSEMBL == x,2][1] 
  })  %>% unlist()

  return(df)
})
names(htrs) <- names(hs.pairs)

下面是计算三个周期阶段的scores：

library(scran)
assigned <- cyclone(sce3@assays[["RNA"]]@data, pairs=htrs)
head(assigned$scores)
save(sce3,file = "data/sce3.Rdata")

MedBioInfoCloud: head(assigned$scores)
     G1     S   G2M
1 0.972 0.958 0.000
2 0.999 0.370 0.003
3 0.986 0.427 0.000
4 1.000 0.161 0.000
5 0.909 0.000 0.855
6 1.000 0.000 0.872

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2022-11-05，如有侵权请联系 cloudcommunity@tencent.com 删除

https