在[[02-用splatter模拟单细胞数据]]中,我们提过SplatParams 参数对象中的参数:
这一节来详细介绍一下。
Splat 模型过程如下:
不同的参数数值对应不同的模型估计。
另外在官方文档中:
The simulation involves the following steps:
Set up simulation object
Simulate library sizes
Simulate gene means
Simulate groups/paths
Simulate BCV adjusted cell means
Simulate true counts
Simulate dropout
Create final dataset
不过按照我的个人理解,就是首先生成各个基因的分布,接下来选择一些基因作为异常值。接着按照分组,批次等,对某些细胞的特定基因增加或减少一定的数值。
使用就行,暂时把它们当成黑箱好了。
比如splat默认值创建的信息:
params <- newSplatParams()
params
#> A Params object of class SplatParams
#> Parameters can be (estimable) or [not estimable], 'Default' or 'NOT DEFAULT'
#> Secondary parameters are usually set during simulation
#>
#> Global:
#> (Genes) (Cells) [SEED]
#> 10000 100 704180
#>
#> 29 additional parameters
#>
#> Batches:
#> [Batches] [Batch Cells] [Location] [Scale] [Remove]
#> 1 100 0.1 0.1 FALSE
#>
#> Mean:
#> (Rate) (Shape)
#> 0.3 0.6
#>
#> Library size:
#> (Location) (Scale) (Norm)
#> 11 0.2 FALSE
#>
#> Exprs outliers:
#> (Probability) (Location) (Scale)
#> 0.05 4 0.5
#>
#> Groups:
#> [Groups] [Group Probs]
#> 1 1
#>
#> Diff expr:
#> [Probability] [Down Prob] [Location] [Scale]
#> 0.1 0.5 0.1 0.4
#>
#> BCV:
#> (Common Disp) (DoF)
#> 0.1 60
#>
#> Dropout:
#> [Type] (Midpoint) (Shape)
#> none 0 -1
#>
#> Paths:
#> [From] [Steps] [Skew] [Non-linear] [Sigma Factor]
#> 0 100 0.5 0.1 0.8
nGenes
数据集包含的基因数;nCells
细胞数;seed
随机种子;batchCells
,对应各个batch 中的细胞数目;batch.rmEffect
布尔值,移除batch 效应;batch.facLoc
与batch.facScale
用来调控batch 大小,越大batch 效应越强,文档解释为:set.seed(1)
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim1 <- splatSimulate(params, batchCells = c(100, 100),
batch.facLoc = 0.05, batch.facScale = 0.05,
verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
sim2 <- splatSimulate(params, batchCells = c(100, 100),
batch.facLoc = 0.5, batch.facScale = 0.5,
verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
p1 <- plotPCA(sim1, colour_by = "Batch") + ggtitle("Small batch effects")
p2 <- plotPCA(sim2, colour_by = "Batch") + ggtitle("Big batch effects")
cowplot::plot_grid(p1,p2)
mean.shape
与mean.rate
,作者这里说明这个参数最好使用真实数据中估计的结果,而非自定设定;lib.loc
will lead to more counts per cell and increasinglib.scale
will result in more variability in the counts per cell.lib.norm
布尔值,T 将用正态分布替代log 正态分布;out.prob
异常表达值的基因,控制比例;# Lots of outliers
sim2 <- splatSimulate(out.prob = 0.2, verbose = FALSE)
ggplot(as.data.frame(rowData(sim2)),
aes(x = log10(GeneMean), fill = OutlierFactor != 1)) +
geom_histogram(bins = 100) +
ggtitle("Lots of outliers")
group.prob
每个组别细胞的比例;de.prob
控制差异基因的比例:需要注意的是,如果需要控制de.prob
及不同组别的基因差异比例,需要使用函数splatSimulateGroups
:set.seed(1)
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim1 <- splatSimulateGroups(params, batchCells = 200,
group.prob = c(0.5, 0.5),
de.prob = 0.01,
verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
sim2 <- splatSimulateGroups(params, batchCells = 200,
group.prob = c(0.5, 0.5),
de.prob = 0.4,
verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
p1 <- plotPCA(sim1, colour_by = "Group") + ggtitle("Small DE")
p2 <- plotPCA(sim2, colour_by = "Group") + ggtitle("Big DE")
cowplot::plot_grid(p1,p2)
在差异基因比例较小时,不同的group 并不能很好地分开。
de.downProb
差异基因中,下调的基因比例;de.facLoc
差异基因的强度,推测de.facScale
也对应更大的差异基因间变化的程度(文档中并没有解释);我们还可以设定多个group,以及对应每个group 独立的参数,比如多个差异基因比例,多个差异基因变化强度:
# Combination of everything
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim3 <- splatSimulateGroups(params,
group.prob = c(0.05, 0.2, 0.2, 0.2, 0.35),
de.prob = c(0.3, 0.1, 0.2, 0.01, 0.1),
de.downProb = c(0.1, 0.4, 0.9, 0.6, 0.5),
de.facLoc = c(0.6, 0.1, 0.1, 0.01, 0.2),
de.facScale = c(0.1, 0.4, 0.2, 0.5, 0.4),
verbose = FALSE)
sim3 <- logNormCounts(sim3)
sim3 <- runPCA(sim3)
plotPCA(sim3, colour_by = "Group") +
labs(title = "Different DE factors",
caption = paste(
"Group 1 is small with many very up-regulated DE genes,",
"Group 2 has the default DE parameters,\n",
"Group 3 has many down-regulated DE genes,",
"Group 4 has very few DE genes,",
"Group 5 is large with moderate DE factors")
)
bcv.common
数值越大,混杂变量影响就越大;bcv.df
BCV inverse chi-squared distribution 使用的自由度大小,会影响基因变化的程度受到表达均值的影响;sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim1 <- splatSimulateGroups(params, batchCells = 200,
group.prob = c(0.5, 0.5),
de.prob = 0.4,
verbose = FALSE, bcv.common = 0.1,
bcv.df = 1)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
sim2 <- splatSimulateGroups(params, batchCells = 200,
group.prob = c(0.5, 0.5),
de.prob = 0.4,
verbose = FALSE, bcv.common = 0.1,
bcv.df = 20)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
p1 <- plotPCA(sim1, colour_by = "Group") + ggtitle("Small bcv.df")
p2 <- plotPCA(sim2, colour_by = "Group") + ggtitle("Big bcv.df")
cowplot::plot_grid(p1,p2)
从试验看,degree of freedom 越大,混杂变量的影响越小:
dropout :
dropouts, where the data only captures a small fraction of the transcriptome of each cell.
ps:个人理解是有些基因feature 可能在某些细胞类型中完全检测不到,或检测水平非常低。有篇文献,详细的讨论了相关的内容:PMC7054558 (Embracing the dropouts in single-cell RNA-seq analysis)
dropout.mid
与dropout.shape
控制某个细胞中zero 计数的比例:dropout.type
设置dropout 情节:dropout.mid
与dropout.shape
才可以被设置。参考:单细胞系列课程-10 Trajectory inference analysis of scRNA-seq data - 简书[2]
有时候,轨迹分析可以帮助我们发现细胞不同状态变化而产生的微小差异,比如细胞在分化过程中,某些基因可能被激活,而某些基因可能被沉默。这时候,这些细胞可能很难描述它们的差异,它们并不像pbmc 中的细胞那样有成熟的分化,和明显的基因与生物上的差异,因此轨迹分析可以帮助解决这个问题。
ps:对于某些外部微小的扰动(perturbation),不也是一样吗?
我们可能无法直接得到所有的group 都可以像右图那样:
再加上可能这个perturbation 是连续性的。
我们可以直接使用splatSimulate
中的method 参数进行简单设置:
sim.paths <- splatSimulate(de.prob = 0.2, nGenes = 1000, method = "paths",
verbose = FALSE, batchCells = rep(200,2))
而如果是进一步细节设置,同样对应于分组特别的函数splatSimulateGroups
,轨迹数据生成也需要使用特别函数splatSimulatePaths
,path.from 参数可以创建一个线性(linear)或树状(branch)的模型:
params <- newSplatParams()
sim1 <- splatSimulatePaths(params, batchCells = 200,
group.prob = c(0.25, 0.25, 0.25, 0.25),
de.prob = 0.5, de.facLoc = 0.2,
path.from = c(0, 1, 2, 3),
seed = 1,
verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
p1 <- plotPCA(sim1, colour_by = "Group") + ggtitle("Linear paths")
# Branching path
sim2 <- splatSimulatePaths(params, batchCells = 200,
group.prob = c(0.25, 0.25, 0.25, 0.25),
de.prob = 0.5, de.facLoc = 0.2,
path.from = c(0, 1, 1, 3),
seed = 1,
verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
p2 <- plotPCA(sim2, colour_by = "Group") + ggtitle("Branching path")
cowplot::plot_grid(p1,p2)
比如右图中的 2、3就分叉了,因为参数中设定了相同的数字。
如果这种group 是连续变化的,在splat 中定义为step:
Interpolation is then used to create a series of steps between the start and end points. This parameter controls the number of steps along a path and therefore how discrete or smooth it is.
并不是discrete 而是smooth 的变化。
数目越大,数据集连续变化的程度也就越大:
# Few steps
sim1 <- splatSimulatePaths(params.groups, path.nSteps = 3,
de.prob = 0.5, de.facLoc = 0.2, verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
plotPCA(sim1, colour_by = "Step") + ggtitle("Few steps")
# Lots of steps
sim2 <- splatSimulatePaths(params.groups, path.nSteps = 1000,
de.prob = 0.5, de.facLoc = 0.2, verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
plotPCA(sim2, colour_by = "Step") + ggtitle("Lots of steps")
通常来说,不同step 之间的细胞数目是均匀的,我们也可以人为改变这种分布情况。比如我们想要模拟一个情况:存在少量的干细胞以及丰富的完全分化的细胞。
Setting path.skew to 0 will mean that all cells come from the end point while higher values up to 1 will skew them towards the start point.
数值越大靠近起点,越小靠近终点。
path.nonlinearProb
控制基因表达为非线性情况的概率,因为可能某个基因在不同的step 下,并非是线性的。path.sigmaFac
:
Non-linear changes along a path are achieved by building a Brownian bridge between the two end points. A Brownian bridge is Brownian motion controlled in such a way that the end points are fixed. The path.sigmaFac parameter controls how extreme each step in the Brownian motion is and therefore how much the interpolation differs from a linear path. ps:个人觉得path.sigmaFac 是用来设置非线性变化的程度的参数。
[1]
Splat simulation parameters (bioconductor.org): https://bioconductor.org/packages/devel/bioc/vignettes/splatter/inst/doc/splat_params.html
[2]
单细胞系列课程-10 Trajectory inference analysis of scRNA-seq data - 简书: https://www.jianshu.com/p/83c532234dc0