A cell cycle is a series of events that takes place in a cell as it grows and divides.即描述细胞生长、分裂整个过程中细胞变化过程。最重要的两个特点就是DNA复制、分裂成两个一样的子细胞。如下图,一般分成4个阶段
在分析单细胞数据时,同一类型的细胞往往来自于不同的细胞周期阶段,这可能对下游聚类分析,细胞类型注释产生混淆;由于细胞周期也是通过cell cycle related protein 调控,即每个阶段有显著的marker基因;通过分析细胞周期有关基因的表达情况,可以对细胞所处周期阶段进行注释;在单细胞周期分析时,通常只考虑三个阶段:G1、S、G2M。(即把G2和M当做一个phase) 主要学习两种来自scran与Seurat包鉴定细胞周期的方法介绍与演示。
(1).Seurat包
Seurat包CellCycleScoring较scran包cyclone函数最主要的区别是直接根据每个cycle,一组marker基因表达值判断。
library(Seurat)
str(cc.genes)
如上是Seurat包提供的人的细胞中分别与S期、G2M期直接相关的marker基因。CellCycleScoring即根据此,对每个细胞的S期、G2M期可能性进行打分;具体如何计算的,暂时在Seurat官方文档中没有提及。在satijalab的github(https://github.com/satijalab/seurat/issues/728))中,作者这样回复类似的提问:As we say in the vignette, the scores are computed using an algorithm developed by Itay Tirosh, when he was a postdoc in the Regev Lab (Science et al., 2016). The gene sets are also taken from his work. You can read the methods section of https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4944528/ to see how the scoring works, but essentially the method scores cells based on the expression of each gene in a signature set - after controlling for the expected expression of genes with similar abundance. 结合一些中文教程(https://www.jianshu.com/p/e4a5b5c67de1)的介绍,认为就是根据每个细胞的S期(或者G2/M期)基因集是否显著高表达,对应的score就是表示在该细胞中,S期(或者G2/M期)基因集高表达的程度(如果是负数,就认为不属于该phase)。 区别于scran包的另外重要的一点就是Seurat包仅提供了人类细胞有关的cell cycle related gene,没有小鼠的。对此,作者这样回复:Thanks. We don't provide different capitalizations because this gene list was developed on a human dataset, and we don't want to create ambiguity by suggesting its created from a mouse reference dataset. **In practice however, we've found it works quite well for mouse also, and recommend the solution above.** 简言之,作者认为可以将对应人的cc.gene转换为鼠对应的基因名,当做后者的cell cycle related gene(因为鼠和人类基因的高度相似性)。提到的solution就是采用biomaRt包转换一下。这在我之前的教程中有介绍。
convertHumanGeneList <- function(x){
require("biomaRt")
human = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mouse = useMart("ensembl", dataset = "mmusculus_gene_ensembl")
genesV2 = getLDS(attributes = c("hgnc_symbol"), filters = "hgnc_symbol", values = x , mart = human, attributesL = c("mgi_symbol"), martL = mouse, uniqueRows=T)
humanx <- unique(genesV2[, 2])
# Print the first 6 genes found to the screen
print(head(humanx))
return(humanx)
}
m.s.genes <- convertHumanGeneList(cc.genes$s.genes)
m.g2m.genes <- convertHumanGeneList(cc.genes$g2m.genes)
但是biomaRt包网络不是很稳定,有人也直接提供了转换后的结果(https://github.com/satijalab/seurat/issues/462),直接下载导入到R里即可,下面演示的代码就是用的该结果。
mouse.cc.gene=readRDS("H:/MedBioInfoCloud/analysis/base_files/GeneSet/mouse_cell_cycle_genes/mouse_cell_cycle_genes/mouse_cell_cycle_genes.rds")
我们继续使用前面的数据进行分析。下面文章中的:sce3
sce3 = CellCycleScoring(object = sce3,
g2m.features = mouse.cc.gene$g2m.genes,
s.features = mouse.cc.gene$s.genes)
p4=VlnPlot(sce3, features = c("S.Score", "G2M.Score"), group.by = "orig.ident",
ncol = 2, pt.size = 0.1)
sce3@meta.data %>% ggplot(aes(S.Score,G2M.Score))+geom_point(aes(color=Phase))+
theme_minimal()
http://bioconductor.org/packages/release/bioc/manuals/scran/man/scran.pdf scran包cyclone函数是利用‘marker基因对’表达来对细胞所在周期阶段进行预测的方法Scialdone (2015) “maker基因对”由作者根据训练集细胞(已注释了cell cycle)的基因表达特征产生,我们可以直接使用。对于每一细胞周期阶段(人/鼠)都有一组“maker基因对”集合。
# BiocManager::install("scran")
library(scran)
mm.pairs <- readRDS(system.file("exdata", "mouse_cycle_markers.rds",
package="scran"))
str(mm.pairs)
head(mm.pairs$G1)
如上图,具体来说,即比较某个细胞的ENSMUSG00000000001基因表达值是否大于ENSMUSG00000001785基因表达值。如果对于所有G1期marker基因对,某个细胞的“first”列基因表达量大于对应的“second”基因的情况越多,则越有把握认为该细胞就是处于G1期(因为越符合训练集特征)。而cyclone则是通过计算score,即对于某个细胞,符合上述比较关系的marker基因对数占全部marker基因对数的比值。这里默认提供marker基因对是ensemble格式,如果表达数据提供的是其它类类型的基因ID,比如:SYMBOL,那么我们需要转化一下ID。具体参考文章【单细胞数据分析中scran包进行细胞周期分析时细胞周期marker基因的转换】
###基因转换
library(clusterProfiler)
library(org.Hs.eg.db)
# x <- names(mm.pairs)[1]
htrs <- lapply(names(hs.pairs), function(x){
df <- hs.pairs[[x]]
first <- bitr(hs.pairs[[x]][,1],
fromType= "ENSEMBL", toType="SYMBOL",
OrgDb="org.Hs.eg.db")
second <- bitr(hs.pairs[[x]][,2],
fromType= "ENSEMBL", toType="SYMBOL",
OrgDb="org.Hs.eg.db")
df <- df[first$ENSEMBL %in% df$first,]
df <- df[second$ENSEMBL %in% df$second,]
df$first <- lapply(df$first, function(x){
first[first$ENSEMBL == x,2][1]
}) %>% unlist()
df$second <- lapply(df$second, function(x){
second[second$ENSEMBL == x,2][1]
}) %>% unlist()
return(df)
})
names(htrs) <- names(hs.pairs)
下面是计算三个周期阶段的scores:
library(scran)
assigned <- cyclone(sce3@assays[["RNA"]]@data, pairs=htrs)
head(assigned$scores)
save(sce3,file = "data/sce3.Rdata")
MedBioInfoCloud: head(assigned$scores)
G1 S G2M
1 0.972 0.958 0.000
2 0.999 0.370 0.003
3 0.986 0.427 0.000
4 1.000 0.161 0.000
5 0.909 0.000 0.855
6 1.000 0.000 0.872
本文分享自 MedBioInfoCloud 微信公众号,前往查看
如有侵权,请联系 cloudcommunity@tencent.com 删除。
本文参与 腾讯云自媒体同步曝光计划 ,欢迎热爱写作的你一起参与!