Bioinfo03-模拟单细胞数据SplatParams详细参数介绍

北野茶缸子

发布于 2022-05-19 11:25:30

6030

发布于 2022-05-19 11:25:30

文章被收录于专栏：北野茶缸子的专栏

参考：
- Splat simulation parameters (bioconductor.org)[1]

前言

在[[02-用splatter模拟单细胞数据]]中，我们提过SplatParams 参数对象中的参数：

均值：Mean parameters are estimated by fitting a gamma distribution to the mean expression levels.
Library size：Library size parameters are estimated by fitting a log-normal distribution to the library sizes.（个人理解为counts 计数的文库大小）
异常表达值：Expression outlier parameters are estimated by determining the number of outliers and fitting a log-normal distribution to their difference from the median.
BCV parameters are estimated using the estimateDisp function from the edgeR package. （混杂变量）
Dropout parameters are estimated by checking if dropout is present and fitting a logistic function to the relationship between mean expression and proportion of zeros.

这一节来详细介绍一下。

简要说明

Splat 模型过程如下：

不同的参数数值对应不同的模型估计。

The first stage generates a mean expression level for each gene. These are originally chosen from a Gamma distribution. For some genes that are selected to be outliers with high expression a factor is generated from a log-normal distribution. These factors are then multiplied by the median gene mean to create new means for those genes.
The next stage incorporates variation in the counts per cell. An expected library size (total counts) is chosen for each cell from a log-normal distribution. The library sizes are then used to scale the gene means for each cell, resulting in a range a counts per cell in the simulated dataset. The gene means are then further adjusted to enforce a relationship between the mean expression level and the variability.
The final cell by gene matrix of gene means is then used to generate a count matrix using a Poisson distribution. The result is a synthetic dataset consisting of counts from a Gamma-Poisson (or negative-binomial) distribution. An additional optional step can be used to replicate a “dropout” effect. A probability of dropout is generated using a logistic function based on the underlying mean expression level. A Bernoulli distribution is then used to create a dropout matrix which sets some of the generated counts to zero.

另外在官方文档中：

The simulation involves the following steps:
Set up simulation object
Simulate library sizes
Simulate gene means
Simulate groups/paths
Simulate BCV adjusted cell means
Simulate true counts
Simulate dropout
Create final dataset

不过按照我的个人理解，就是首先生成各个基因的分布，接下来选择一些基因作为异常值。接着按照分组，批次等，对某些细胞的特定基因增加或减少一定的数值。

使用就行，暂时把它们当成黑箱好了。

参数设置

比如splat默认值创建的信息：

params <- newSplatParams()
params
#> A Params object of class SplatParams 
#> Parameters can be (estimable) or [not estimable], 'Default' or  'NOT DEFAULT' 
#> Secondary parameters are usually set during simulation
#> 
#> Global: 
#> (Genes)  (Cells)   [SEED] 
#>   10000      100   704180 
#> 
#> 29 additional parameters 
#> 
#> Batches: 
#>     [Batches]  [Batch Cells]     [Location]        [Scale]       [Remove] 
#>             1            100            0.1            0.1          FALSE 
#> 
#> Mean: 
#>  (Rate)  (Shape) 
#>     0.3      0.6 
#> 
#> Library size: 
#> (Location)     (Scale)      (Norm) 
#>         11         0.2       FALSE 
#> 
#> Exprs outliers: 
#> (Probability)     (Location)        (Scale) 
#>          0.05              4            0.5 
#> 
#> Groups: 
#>      [Groups]  [Group Probs] 
#>             1              1 
#> 
#> Diff expr: 
#> [Probability]    [Down Prob]     [Location]        [Scale] 
#>           0.1            0.5            0.1            0.4 
#> 
#> BCV: 
#> (Common Disp)          (DoF) 
#>           0.1             60 
#> 
#> Dropout: 
#>     [Type]  (Midpoint)     (Shape) 
#>       none           0          -1 
#> 
#> Paths: 
#>         [From]         [Steps]          [Skew]    [Non-linear]  [Sigma Factor] 
#>              0             100             0.5             0.1             0.8

参数介绍

基本参数

nGenes 数据集包含的基因数；
nCells 细胞数;
seed 随机种子；

批次参数

batchCells，对应各个batch 中的细胞数目；
batch.rmEffect 布尔值，移除batch 效应；
batch.facLoc 与batch.facScale 用来调控batch 大小，越大batch 效应越强，文档解释为：
- fac.loc: Location parameter for the log-normal distribution.
- fac.scale: Scale factor for the log-normal distribution.

set.seed(1)
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim1 <- splatSimulate(params, batchCells = c(100, 100),
                      batch.facLoc = 0.05, batch.facScale = 0.05,
                      verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
sim2 <- splatSimulate(params, batchCells = c(100, 100),
                      batch.facLoc = 0.5, batch.facScale = 0.5,
                      verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)

p1 <- plotPCA(sim1, colour_by = "Batch") + ggtitle("Small batch effects")
p2 <- plotPCA(sim2, colour_by = "Batch") + ggtitle("Big batch effects")
cowplot::plot_grid(p1,p2)

细胞与基因相关参数

mean.shape 与mean.rate，作者这里说明这个参数最好使用真实数据中估计的结果，而非自定设定；
类似batch 中，library 也有对应的：
- lib.loc will lead to more counts per cell and increasing
- lib.scale will result in more variability in the counts per cell.
lib.norm 布尔值，T 将用正态分布替代log 正态分布；
out.prob 异常表达值的基因，控制比例；

# Lots of outliers
sim2 <- splatSimulate(out.prob = 0.2, verbose = FALSE)
ggplot(as.data.frame(rowData(sim2)),
       aes(x = log10(GeneMean), fill = OutlierFactor != 1)) +
    geom_histogram(bins = 100) +
    ggtitle("Lots of outliers")

分组相关参数

group.prob 每个组别细胞的比例；
de.prob 控制差异基因的比例：需要注意的是，如果需要控制de.prob 及不同组别的基因差异比例，需要使用函数splatSimulateGroups：

set.seed(1)
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim1 <- splatSimulateGroups(params, batchCells = 200,
                      group.prob = c(0.5, 0.5),
                      de.prob = 0.01,
                      verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
sim2 <- splatSimulateGroups(params, batchCells = 200,
                      group.prob = c(0.5, 0.5),
                      de.prob = 0.4,
                      verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)

p1 <- plotPCA(sim1, colour_by = "Group") + ggtitle("Small DE")
p2 <- plotPCA(sim2, colour_by = "Group") + ggtitle("Big DE")
cowplot::plot_grid(p1,p2)

在差异基因比例较小时，不同的group 并不能很好地分开。

de.downProb 差异基因中，下调的基因比例；
de.facLoc 差异基因的强度，推测de.facScale 也对应更大的差异基因间变化的程度（文档中并没有解释）；

我们还可以设定多个group，以及对应每个group 独立的参数，比如多个差异基因比例，多个差异基因变化强度：

# Combination of everything
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim3 <- splatSimulateGroups(params,
                            group.prob = c(0.05, 0.2, 0.2, 0.2, 0.35),
                            de.prob = c(0.3, 0.1, 0.2, 0.01, 0.1),
                            de.downProb = c(0.1, 0.4, 0.9, 0.6, 0.5),
                            de.facLoc = c(0.6, 0.1, 0.1, 0.01, 0.2),
                            de.facScale = c(0.1, 0.4, 0.2, 0.5, 0.4),
                            verbose = FALSE)
sim3 <- logNormCounts(sim3)
sim3 <- runPCA(sim3)
plotPCA(sim3, colour_by = "Group") +
  labs(title = "Different DE factors",
       caption = paste(
         "Group 1 is small with many very up-regulated DE genes,",
         "Group 2 has the default DE parameters,\n",
         "Group 3 has many down-regulated DE genes,",
         "Group 4 has very few DE genes,",
         "Group 5 is large with moderate DE factors")
  )

混杂变量与dropout参数

bcv.common 数值越大，混杂变量影响就越大；
bcv.df BCV inverse chi-squared distribution 使用的自由度大小，会影响基因变化的程度受到表达均值的影响；

sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim1 <- splatSimulateGroups(params, batchCells = 200,
                      group.prob = c(0.5, 0.5),
                      de.prob = 0.4,
                      verbose = FALSE, bcv.common = 0.1,
                      bcv.df = 1)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
sim2 <- splatSimulateGroups(params, batchCells = 200,
                      group.prob = c(0.5, 0.5),
                      de.prob = 0.4,
                      verbose = FALSE, bcv.common = 0.1,
                      bcv.df = 20)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)

p1 <- plotPCA(sim1, colour_by = "Group") + ggtitle("Small bcv.df")
p2 <- plotPCA(sim2, colour_by = "Group") + ggtitle("Big bcv.df")
cowplot::plot_grid(p1,p2)

从试验看，degree of freedom 越大，混杂变量的影响越小：

dropout ：

dropouts, where the data only captures a small fraction of the transcriptome of each cell.

ps：个人理解是有些基因feature 可能在某些细胞类型中完全检测不到，或检测水平非常低。有篇文献，详细的讨论了相关的内容：PMC7054558 （Embracing the dropouts in single-cell RNA-seq analysis）

dropout.mid 与dropout.shape控制某个细胞中zero 计数的比例：
- The dropout.mid parameter control the point at which the probability is equal to 0.5
- and the dropout.shape controls how it changes with increasing expression. 这两个参数需要和下面的参数相关，比如group 模式，则mid 与shape 的向量长度就是group 的数目。
dropout.type 设置dropout 情节：
- Setting it to "none" means no dropout,
- “experiment” is global dropout using the same set of parameters for every cell,
- "batch" uses the same parameters for every cell in the same batch,
- "group" uses the same parameters for every cell in the same group and
- "cell" uses a different set of parameters for every cell. 这个参数必须首先设置dropout.mid 与dropout.shape 才可以被设置。

轨迹数据

参考：单细胞系列课程-10 Trajectory inference analysis of scRNA-seq data - 简书[2]

有时候，轨迹分析可以帮助我们发现细胞不同状态变化而产生的微小差异，比如细胞在分化过程中，某些基因可能被激活，而某些基因可能被沉默。这时候，这些细胞可能很难描述它们的差异，它们并不像pbmc 中的细胞那样有成熟的分化，和明显的基因与生物上的差异，因此轨迹分析可以帮助解决这个问题。

ps：对于某些外部微小的扰动（perturbation），不也是一样吗？

我们可能无法直接得到所有的group 都可以像右图那样：

再加上可能这个perturbation 是连续性的。

path

我们可以直接使用splatSimulate中的method 参数进行简单设置：

sim.paths <- splatSimulate(de.prob = 0.2, nGenes = 1000, method = "paths",
                           verbose = FALSE, batchCells = rep(200,2))

而如果是进一步细节设置，同样对应于分组特别的函数splatSimulateGroups，轨迹数据生成也需要使用特别函数splatSimulatePaths，path.from 参数可以创建一个线性（linear）或树状（branch）的模型：

params <- newSplatParams()
sim1 <- splatSimulatePaths(params, batchCells = 200,
                           group.prob = c(0.25, 0.25, 0.25, 0.25),
                           de.prob = 0.5, de.facLoc = 0.2,
                           path.from = c(0, 1, 2, 3),
                           seed = 1,
                           verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
p1 <- plotPCA(sim1, colour_by = "Group") + ggtitle("Linear paths")
# Branching path
sim2 <- splatSimulatePaths(params, batchCells = 200,
                           group.prob = c(0.25, 0.25, 0.25, 0.25),
                           de.prob = 0.5, de.facLoc = 0.2,
                           path.from = c(0, 1, 1, 3),
                           seed = 1,
                           verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
p2 <- plotPCA(sim2, colour_by = "Group") + ggtitle("Branching path")
cowplot::plot_grid(p1,p2)

比如右图中的 2、3就分叉了，因为参数中设定了相同的数字。

step

如果这种group 是连续变化的，在splat 中定义为step：

Interpolation is then used to create a series of steps between the start and end points. This parameter controls the number of steps along a path and therefore how discrete or smooth it is.

并不是discrete 而是smooth 的变化。

数目越大，数据集连续变化的程度也就越大：

# Few steps
sim1 <- splatSimulatePaths(params.groups, path.nSteps = 3,
                           de.prob = 0.5, de.facLoc = 0.2, verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
plotPCA(sim1, colour_by = "Step") + ggtitle("Few steps")

# Lots of steps
sim2 <- splatSimulatePaths(params.groups, path.nSteps = 1000,
                           de.prob = 0.5, de.facLoc = 0.2, verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
plotPCA(sim2, colour_by = "Step") + ggtitle("Lots of steps")

通常来说，不同step 之间的细胞数目是均匀的，我们也可以人为改变这种分布情况。比如我们想要模拟一个情况：存在少量的干细胞以及丰富的完全分化的细胞。

Setting path.skew to 0 will mean that all cells come from the end point while higher values up to 1 will skew them towards the start point.

数值越大靠近起点，越小靠近终点。

path.nonlinearProb 控制基因表达为非线性情况的概率，因为可能某个基因在不同的step 下，并非是线性的。path.sigmaFac：

Non-linear changes along a path are achieved by building a Brownian bridge between the two end points. A Brownian bridge is Brownian motion controlled in such a way that the end points are fixed. The path.sigmaFac parameter controls how extreme each step in the Brownian motion is and therefore how much the interpolation differs from a linear path. ps：个人觉得path.sigmaFac 是用来设置非线性变化的程度的参数。

参考资料

[1]

Splat simulation parameters (bioconductor.org): https://bioconductor.org/packages/devel/bioc/vignettes/splatter/inst/doc/splat_params.html

[2]

单细胞系列课程-10 Trajectory inference analysis of scRNA-seq data - 简书: https://www.jianshu.com/p/83c532234dc0

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2022-04-13，如有侵权请联系 cloudcommunity@tencent.com 删除

编程算法

批量计算

linux

本文分享自北野茶缸子微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

编程算法

批量计算

linux

登录后参与评论

0 条评论

热度