最近看单细胞算法文章看到了一个包,可以很方便的根据需求生成各种模拟数据,制作ground truth 岂不是很方便?
试试就试试。
直接安装即可,很容易:
BiocManager::install("splatter")
p_load(splatter, scuttle, scater)
y1s1,这个包的logo 还挺好看的:
Splatter (splat)是如何进行单细胞数据估计的呢?
splat 模型的核心在于,利用gamma-Poisson 借助cell counts 矩阵来生成基因表达数据。
The core of the Splat model is a gamma-Poisson distribution used to generate a gene by cell matrix of counts. Mean expression levels for each gene are simulated from a gamma distribution[4] and the Biological Coefficient of Variation is used to enforce a mean-variance trend before counts are simulated from a Poisson distribution[5]. Splat also allows you to simulate expression outlier genes (genes with mean expression outside the gamma distribution) and dropout (random knock out of counts based on mean expression). Each cell is given an expected library size (simulated from a log-normal distribution) that makes it easier to match to a given dataset.
所有splat 模拟相关的参数都存储在SplatParams
对象中。
params <- newSimpleParams()
params <- newSimpleParams(nGenes = 200, nCells = 10)
> params
A Params object of class SimpleParams
Parameters can be (estimable) or [not estimable], 'Default' or 'NOT DEFAULT'
Secondary parameters are usually set during simulation
Global:
(GENES) (CELLS) [Seed]
200 10 977126
3 additional parameters
Mean:
(Rate) (Shape)
0.3 0.4
Counts:
[Dispersion]
0.1
我们姑且可以将SplatParams
对象理解为splat 模型用于创建模拟单细胞数据的一个参数对象,其包含了这个单细胞模型的全部参数信息,除了基因数、细胞数这些基本信息外,还包括如均值、批次、混杂因素、异常值等等信息。详细的内容可以参考[[SplatParams详细参数介绍]]
访问:
getParam(params, "nGenes")
#> [1] 10000
修改:
# Set multiple parameters at once (using a list)
params <- setParams(params, update = list(nGenes = 8000, mean.rate = 0.5))
# Extract multiple parameters as a list
getParams(params, c("nGenes", "mean.rate", "mean.shape"))
#> $nGenes
#> [1] 8000
#>
#> $mean.rate
#> [1] 0.5
#>
#> $mean.shape
#> [1] 0.6
# Set multiple parameters at once (using additional arguments)
params <- setParams(params, mean.shape = 0.5, de.prob = 0.2)
params
#> A Params object of class SplatParams
#> Parameters can be (estimable) or [not estimable], 'Default' or 'NOT DEFAULT'
#> Secondary parameters are usually set during simulation
#>
#> Global:
#> (GENES) (Cells) [SEED]
#> 8000 100 81261
#>
#> 29 additional parameters
#>
#> Batches:
#> [Batches] [Batch Cells] [Location] [Scale] [Remove]
#> 1 100 0.1 0.1 FALSE
#>
#> Mean:
#> (RATE) (SHAPE)
#> 0.5 0.5
#>
#> Library size:
#> (Location) (Scale) (Norm)
#> 11 0.2 FALSE
#>
#> Exprs outliers:
#> (Probability) (Location) (Scale)
#> 0.05 4 0.5
#>
#> Groups:
#> [Groups] [Group Probs]
#> 1 1
#>
#> Diff expr:
#> [PROBABILITY] [Down Prob] [Location] [Scale]
#> 0.2 0.5 0.1 0.4
#>
#> BCV:
#> (Common Disp) (DoF)
#> 0.1 60
#>
#> Dropout:
#> [Type] (Midpoint) (Shape)
#> none 0 -1
#>
#> Paths:
#> [From] [Steps] [Skew] [Non-linear] [Sigma Factor]
#> 0 100 0.5 0.1 0.8
splat 还可以让我们直接从真实的单细胞数据,singlecellexperiment(sce) 对象中进行参数估计。
创建仿真数据:
set.seed(1)
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
> sce
class: SingleCellExperiment
dim: 2000 200
metadata(0):
assays(1): counts
rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
rowData names(0):
colnames(200): Cell_001 Cell_002 ... Cell_199 Cell_200
colData names(3): Mutation_Status Cell_Cycle Treatment
reducedDimNames(0):
altExpNames(1): Spikes
splat 估计:
> params <- splatEstimate(sce)
NOTE: Library sizes have been found to be normally distributed instead of log-normal. You may want to check this is correct.
> params
A Params object of class SplatParams
Parameters can be (estimable) or [not estimable], 'Default' or 'NOT DEFAULT'
Secondary parameters are usually set during simulation
Global:
(GENES) (CELLS) [Seed]
2000 200 977126
29 additional parameters
Batches:
[BATCHES] [BATCH CELLS] [Location] [Scale]
1 200 0.1 0.1
[Remove]
FALSE
Mean:
(RATE) (SHAPE)
0.002962686167343 0.496997730070513
Library size:
(LOCATION) (SCALE) (NORM)
357331.235 11607.2332293176 TRUE
Exprs outliers:
(PROBABILITY) (Location) (Scale)
0 4 0.5
Groups:
[Groups] [Group Probs]
1 1
Diff expr:
[Probability] [Down Prob] [Location] [Scale]
0.1 0.5 0.1 0.4
BCV:
(COMMON DISP) (DOF)
0.752043426792845 11211.8933424157
Dropout:
[Type] (MIDPOINT) (SHAPE)
none 2.71153535179343 -1.37209356733765
Paths:
[From] [Steps] [Skew] [Non-linear]
0 100 0.5 0.1
[Sigma Factor]
0.8
splat 从真实(仿真)数据中参数估计过程包括以下步骤:
详细信息可以参见:Splat simulation parameters (bioconductor.org)[6]
配置好了SplatParams对象(设定完毕用于仿真的参数结果),就可以利用这个参数对象进行模拟了,也就是函数splatSimulate
。
> sim <- splatSimulate(params, nGenes = 1000,
batchCells = rep(100,10))
Getting parameters...
Creating simulation object...
Simulating library sizes...
Simulating gene means...
Simulating BCV...
Simulating counts...
Simulating dropout (if needed)...
Sparsifying assays...
Automatically converting to sparse matrices, threshold = 0.95
Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
Skipping 'BCV': estimated sparse size 1.5 * dense matrix
Skipping 'CellMeans': estimated sparse size 1.5 * dense matrix
Skipping 'TrueCounts': estimated sparse size 2.79 * dense matrix
Skipping 'counts': estimated sparse size 2.79 * dense matrix
Done!
> sim
class: SingleCellExperiment
dim: 1000 200
metadata(1): Params
assays(6): BatchCellMeans BaseCellMeans ... TrueCounts
counts
rownames(1000): Gene1 Gene2 ... Gene999 Gene1000
rowData names(4): Gene BaseGeneMean OutlierFactor GeneMean
colnames(200): Cell1 Cell2 ... Cell199 Cell200
colData names(3): Cell Batch ExpLibSize
reducedDimNames(0):
altExpNames(0):
可以看到,我们创建了一个200x1000 的单细胞对象。
> head(rowData(sim))
DataFrame with 6 rows and 4 columns
Gene BaseGeneMean OutlierFactor GeneMean
<character> <numeric> <numeric> <numeric>
Gene1 Gene1 367.75129 1 367.75129
Gene2 Gene2 183.59043 1 183.59043
Gene3 Gene3 460.64653 1 460.64653
Gene4 Gene4 5.16081 1 5.16081
Gene5 Gene5 90.97209 1 90.97209
Gene6 Gene6 81.68033 1 81.68033
> head(colData(sim))
DataFrame with 6 rows and 3 columns
Cell Batch ExpLibSize
<character> <character> <numeric>
Cell1 Cell1 Batch1 364054
Cell2 Cell2 Batch1 337135
Cell3 Cell3 Batch1 353075
Cell4 Cell4 Batch1 354925
Cell5 Cell5 Batch1 357749
Cell6 Cell6 Batch1 356384
我们还可以可视化一下,这里参考包CiteFuse中的visualiseDim
函数:
sim <- logNormCounts(sim)
# Plot PCA
sim <- runPCA(sim)
visualiseDim(sim,
dimNames = "PCA",
colour_by = "Batch")
可以看到,splatSimulate 函数输出的也就是一个sce 对象,我们可以非常方便的对其进行后续的单细胞分析。
该函数会输出以下信息:
其实,也就对应了上面提到的SplatParams对象中设计的相关参数。
除了上述单类型的细胞外(single),splatSimulate
函数还可以用于生成多个单细胞分群,或是轨迹数据:
which simulation method to use. Options are "single" which produces a single population, "groups" which produces distinct groups (eg. cell types), or "paths" which selects cells from continuous trajectories (eg. differentiation processes).
类似batchCells
参数(并不能直接指定batch数目),可以用来控制batch 的数目及各个batch 的大小。group.prob
也可以用来控制各个group 的比例及group 数目:
# group
sim.groups <- splatSimulate(group.prob = c(0.5, 0.5), method = "groups",
verbose = FALSE, batchCells = rep(1000,2))
sim.groups <- logNormCounts(sim.groups)
sim.groups <- runPCA(sim.groups)
visualiseSce(sim.groups,
dimNames = "PCA",
colour_by = "Group", shape_by = "Batch")
对应method 参数,修改为paths
模式:
# path
sim.paths <- splatSimulate(de.prob = 0.2, nGenes = 1000, method = "paths",
verbose = FALSE, batchCells = rep(200,2))
sim.paths <- logNormCounts(sim.paths)
sim.paths <- runPCA(sim.paths)
tmp.list$sim.paths.df <- colData(sim.paths)
tmp.list$sim.paths.df <- cbind(tmp.list$sim.paths.df,
reducedDim(sim.paths, "PCA")[,1:2])
tmp.list$sim.paths.df <- as.data.frame(tmp.list$sim.paths.df)
ggplot(tmp.list$sim.paths.df) +
geom_point(
aes(PC1, PC2, color = Step, shape = Batch)
) + viridis::scale_colour_viridis()
其实我上面的数据中,已经通过batchCells
指定批次了。
可以看到,pca 的方法,成功的捕获了模拟数据中分组的差异(PC1),以及由于批次存在的差异(PC2),而且很明显地分组的差异是要大于批次产生的差异的,这也和大部分的真实情况相符:
其实splat 方法也是一个套件:
Each of the Splatter simulation methods has it’s own convenience function. To simulate a single population use splatSimulateSingle() (equivalent to splatSimulate(method = "single")), to simulate groups use splatSimulateGroups() (equivalent to splatSimulate(method = "groups")) or to simulate paths use splatSimulatePaths() (equivalent to splatSimulate(method = "paths")).
足足15 个套装方法:
listSims()
#> Splatter currently contains 15 simulations
#>
#> Splat (splat)
#> DOI: 10.1186/s13059-017-1305-0 GitHub: Oshlack/splatter Dependencies:
#> The Splat simulation generates means from a gamma distribution, adjusts them for BCV and generates counts from a gamma-poisson. Dropout and batch effects can be optionally added.
#>
#> Splat Single (splatSingle)
#> DOI: 10.1186/s13059-017-1305-0 GitHub: Oshlack/splatter Dependencies:
#> The Splat simulation with a single population.
#>
#> Splat Groups (splatGroups)
#> DOI: 10.1186/s13059-017-1305-0 GitHub: Oshlack/splatter Dependencies:
#> The Splat simulation with multiple groups. Each group can have it's own differential expression probability and fold change distribution.
#>
#> Splat Paths (splatPaths)
#> DOI: 10.1186/s13059-017-1305-0 GitHub: Oshlack/splatter Dependencies:
#> The Splat simulation with differentiation paths. Each path can have it's own length, skew and probability. Genes can change in non-linear ways.
#>
#> Kersplat (kersplat)
#> DOI: GitHub: Oshlack/splatter Dependencies: scuttle, igraph
#> The Kersplat simulation extends the Splat model by adding a gene network, more complex cell structure, doublets and empty cells (Experimental).
#>
#> splatPop (splatPop)
#> DOI: 10.1186/s13059-021-02546-1 GitHub: Oshlack/splatter Dependencies: VariantAnnotation, preprocessCore
#> The splatPop simulation enables splat simulations to be generated for multiple individuals in a population, accounting for correlation structure by simulating expression quantitative trait loci (eQTL).
#>
#> Simple (simple)
#> DOI: 10.1186/s13059-017-1305-0 GitHub: Oshlack/splatter Dependencies:
#> A simple simulation with gamma means and negative binomial counts.
#>
#> Lun (lun)
#> DOI: 10.1186/s13059-016-0947-7 GitHub: MarioniLab/Deconvolution2016 Dependencies:
#> Gamma distributed means and negative binomial counts. Cells are given a size factor and differential expression can be simulated with fixed fold changes.
#>
#> Lun 2 (lun2)
#> DOI: 10.1093/biostatistics/kxw055 GitHub: MarioniLab/PlateEffects2016 Dependencies: scran, scuttle, lme4, pscl, limSolve
#> Negative binomial counts where the means and dispersions have been sampled from a real dataset. The core feature of the Lun 2 simulation is the addition of plate effects. Differential expression can be added between two groups of plates and optionally a zero-inflated negative-binomial can be used.
#>
#> scDD (scDD)
#> DOI: 10.1186/s13059-016-1077-y GitHub: kdkorthauer/scDD Dependencies: scDD
#> The scDD simulation samples a given dataset and can simulate differentially expressed and differentially distributed genes between two conditions.
#>
#> BASiCS (BASiCS)
#> DOI: 10.1371/journal.pcbi.1004333 GitHub: catavallejos/BASiCS Dependencies: BASiCS
#> The BASiCS simulation is based on a bayesian model used to deconvolve biological and technical variation and includes spike-ins and batch effects.
#>
#> mfa (mfa)
#> DOI: 10.12688/wellcomeopenres.11087.1 GitHub: kieranrcampbell/mfa Dependencies: mfa
#> The mfa simulation produces a bifurcating pseudotime trajectory. This can optionally include genes with transient changes in expression and added dropout.
#>
#> PhenoPath (pheno)
#> DOI: 10.1038/s41467-018-04696-6 GitHub: kieranrcampbell/phenopath Dependencies: phenopath
#> The PhenoPath simulation produces a pseudotime trajectory with different types of genes.
#>
#> ZINB-WaVE (zinb)
#> DOI: 10.1038/s41467-017-02554-5 GitHub: drisso/zinbwave Dependencies: zinbwave
#> The ZINB-WaVE simulation simulates counts from a sophisticated zero-inflated negative-binomial distribution including cell and gene-level covariates.
#>
#> SparseDC (sparseDC)
#> DOI: 10.1093/nar/gkx1113 GitHub: cran/SparseDC Dependencies: SparseDC
#> The SparseDC simulation simulates a set of clusters across two conditions, where some clusters may be present in only one condition.
比如这个scDD
就考虑了不同的conditions
的细胞:
The scDD simulation samples a given dataset and can simulate differentially expressed and differentially distributed genes between two conditions.
splat 提供了观察各个单细胞数据集的方法。
compareSCEs
函数接受列表对象:
set.seed(1)
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim1 <- splatSimulate(params, nGenes = 2000)
sim2 <-splatSimulate(nGenes = 2000)
sim3 <- simpleSimulate(nGenes = 2000, verbose = FALSE)
comparison <- compareSCEs(list(sce = sce, sim1 = sim1,
sim2 = sim2, sim3 = sim3))
比如我这里比较:
> head(comparison$ColData)
Dataset sum detected total PctZero
Cell_001 sce 392907 1502 402665 24.90
Cell_002 sce 398904 1509 405090 24.55
Cell_003 sce 358855 1503 365409 24.85
Cell_004 sce 378909 1527 386163 23.65
Cell_005 sce 384063 1532 389848 23.40
Cell_006 sce 370596 1522 378267 23.90
> head(comparison$RowData)
Dataset mean detected MeanCounts VarCounts CVCounts
Gene_0001 sce 11.460 48.5 11.460 950.1994 2.689817
Gene_0002 sce 78.560 92.0 78.560 9041.7049 1.210385
Gene_0003 sce 21.505 55.5 21.505 2641.8090 2.390074
Gene_0004 sce 20.780 57.0 20.780 1827.4890 2.057225
Gene_0005 sce 18.290 50.0 18.290 1979.6140 2.432633
Gene_0006 sce 191.455 99.5 191.455 46423.7367 1.125391
MedCounts MADCounts MeanCPM VarCPM CVCPM
Gene_0001 0.0 0.0000 29.69738 6420.309 2.698111
Gene_0002 46.0 63.7518 205.45104 62237.453 1.214276
Gene_0003 1.0 1.4826 56.01672 17949.158 2.391687
Gene_0004 2.0 2.9652 54.13480 12419.975 2.058656
Gene_0005 0.5 0.7413 47.92973 13753.695 2.446835
Gene_0006 119.0 126.7623 500.52564 321714.428 1.133206
MedCPM MADCPM MeanLogCPM VarLogCPM CVLogCPM
Gene_0001 0.000000 0.000000 2.183822 6.987662 1.2104552
Gene_0002 119.251041 163.890597 6.125446 7.454117 0.4457182
Gene_0003 2.607953 3.866551 2.792780 9.128253 1.0818252
Gene_0004 5.085582 7.539884 2.973293 9.256661 1.0232681
Gene_0005 1.231218 1.825403 2.613443 9.025811 1.1495559
Gene_0006 308.946214 330.608840 8.044234 3.727668 0.2400125
MedLogCPM MADLogCPM PctZero
Gene_0001 0.0000000 0.000000 51.5
Gene_0002 6.9098774 2.455602 8.0
Gene_0003 1.8511790 2.744558 44.5
Gene_0004 2.6051412 3.862382 43.0
Gene_0005 0.8958936 1.328252 50.0
Gene_0006 8.2758656 1.782199 0.5
> table(comparison$RowData$Dataset)
sce sim1 sim2 sim3
2000 2000 2000 2000
> table(comparison$ColData$Dataset)
sce sim1 sim2 sim3
200 200 100 100
详细统计每个datasets 基因与细胞的信息。以及多种绘图结果:
> names(comparison$Plots)
[1] "Means" "Variances" "MeanVar" "LibrarySizes"
[5] "ZerosGene" "ZerosCell" "MeanZeros" "VarGeneCor"
非常好看有木有,非常讲究的使用notched box plot:
使用函数diffSCEs。这里要求数据集的维度需要相同,因此重新配置。
并要求这个ref 的长度是一,所以是多比一啦。
Error in diffSCEs(list(sce = sce, sim1 = sim1, sim2 = sim2, sim3 = sim3), : Assertion on 'ref' failed: Must have length 1.
# compare some to ref
set.seed(1)
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim1 <- splatSimulate(params)
sim2 <-splatSimulate(nGenes = 2000, batchCells = 200)
sim3 <- simpleSimulate(nGenes = 2000, nCells = 200, verbose = FALSE)
difference <- diffSCEs(list(sce = sce, sim1 = sim1,
sim2 = sim2, sim3 = sim3), ref = "sce")
我们就可以和ref 进行比较了:
比如加上tpm 与fpkm 数据,对于sce 对象,直接通过scater 包的方法即可:
sim <- simpleSimulate(verbose = FALSE)
sim <- addGeneLengths(sim)
head(rowData(sim))
#> DataFrame with 6 rows and 3 columns
#> Gene GeneMean Length
#> <character> <numeric> <numeric>
#> Gene1 Gene1 0.5641399 917
#> Gene2 Gene2 0.0764411 765
#> Gene3 Gene3 2.6791742 5972
#> Gene4 Gene4 1.3782005 3491
#> Gene5 Gene5 4.0117653 15311
#> Gene6 Gene6 0.3536760 1190
tpm(sim) <- calculateTPM(sim, rowData(sim)$Length)
tpm(sim)[1:5, 1:5]
#> 5 x 5 sparse Matrix of class "dgCMatrix"
#> Cell1 Cell2 Cell3 Cell4 Cell5
#> Gene1 342.21897 . . 169.73637 170.06608
#> Gene2 . . . . .
#> Gene3 131.36922 . 187.68277 182.44101 78.34089
#> Gene4 89.89252 . 183.46630 89.17115 .
#> Gene5 30.74405 20.50798 83.66284 81.32623 40.74211
如果觉得splat 创建的sce 对象有的内容并不需要使用,可以删除一些metadata 或assay,以对对象大小进行压缩:
sim <- splatSimulate()
#> Getting parameters...
#> Creating simulation object...
#> Simulating library sizes...
#> Simulating gene means...
#> Simulating BCV...
#> Simulating counts...
#> Simulating dropout (if needed)...
#> Sparsifying assays...
#> Automatically converting to sparse matrices, threshold = 0.95
#> Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
#> Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
#> Skipping 'BCV': estimated sparse size 1.5 * dense matrix
#> Skipping 'CellMeans': estimated sparse size 1.49 * dense matrix
#> Skipping 'TrueCounts': estimated sparse size 1.65 * dense matrix
#> Skipping 'counts': estimated sparse size 1.65 * dense matrix
#> Done!
minimiseSCE(sim)
#> Minimising SingleCellExperiment...
#> Original size: 43.9 Mb
#> Removing all rowData columns
#> Removing all colData columns
#> Removing all metadata items
#> Keeping 1 assays: counts
#> Removing 5 assays: BatchCellMeans, BaseCellMeans, BCV, CellMeans, TrueCounts
#> Sparsifying assays...
#> Automatically converting to sparse matrices, threshold = 0.95
#> Skipping 'counts': estimated sparse size 1.65 * dense matrix
#> Minimised size: 5.3 Mb (12% of original)
#> class: SingleCellExperiment
#> dim: 10000 100
#> metadata(0):
#> assays(1): counts
#> rownames(10000): Gene1 Gene2 ... Gene9999 Gene10000
#> rowData names(0):
#> colnames(100): Cell1 Cell2 ... Cell99 Cell100
#> colData names(0):
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
minimiseSCE(sim, rowData.keep = "Gene", colData.keep = c("Cell", "Batch"),
metadata.keep = TRUE)
#> Minimising SingleCellExperiment...
#> Original size: 43.9 Mb
#> Keeping 1 rowData columns: Gene
#> Removing 3 rowData columns: BaseGeneMean, OutlierFactor, GeneMean
#> Keeping 2 colData columns: Cell, Batch
#> Removing 1 colData columns: ExpLibSize
#> Keeping 1 assays: counts
#> Removing 5 assays: BatchCellMeans, BaseCellMeans, BCV, CellMeans, TrueCounts
#> Sparsifying assays...
#> Automatically converting to sparse matrices, threshold = 0.95
#> Skipping 'counts': estimated sparse size 1.65 * dense matrix
#> Minimised size: 5.9 Mb (14% of original)
#> class: SingleCellExperiment
#> dim: 10000 100
#> metadata(1): Params
#> assays(1): counts
#> rownames(10000): Gene1 Gene2 ... Gene9999 Gene10000
#> rowData names(1): Gene
#> colnames(100): Cell1 Cell2 ... Cell99 Cell100
#> colData names(2): Cell Batch
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
默认会删除rowData(sce), colData(sce) and metadata(sce) 。
splat 还提供了组装比较结果中绘图内容的函数:
p1 <- makeCompPanel(comparison)
y1s1,拼的挺丑的,不展示了。
ps:不过对于多比一的图,似乎存在bug:
makeCompPanel(difference):Error in makeCompPanel(difference) : Assertion on 'comp' failed: Must have length 3, but has length 5.
个人感觉splat 套件还挺值得把玩的,后面看看可不可以个性化的调整各个sce 的batch magnitude 亦或是group 间的var 大小。
[1]
Introduction to Splatter (bioconductor.org): https://bioconductor.org/packages/devel/bioc/vignettes/splatter/inst/doc/splatter.html
[2]
Oshlack/splatter: Simple simulation of single-cell RNA sequencing data (github.com): https://github.com/Oshlack/splatter
[3]
splatPop: simulating single-cell data for populations: http://www.bioconductor.org/packages/devel/bioc/vignettes/splatter/inst/doc/splatPop.html#2_Quick_start
[4]
gamma distribution: https://en.wikipedia.org/wiki/Gamma_distribution
[5]
Poisson distribution: https://en.wikipedia.org/wiki/Poisson_distribution
[6]
Splat simulation parameters (bioconductor.org): https://bioconductor.org/packages/devel/bioc/vignettes/splatter/inst/doc/splat_params.html#27_Dropout_parameters