单细胞转录组3大R包之scater

生信技能树

发布于 2018-03-09 10:43:28

5.5K0

文章被收录于专栏：生信技能树生信技能树

scater 这个R包很强大，是McCarthy et al. 2017 发表的，包含的功能有：

Automated computation of QC metrics
Transcript quantification from read data with pseudo-alignment
Data format standardisation
Rich visualizations for exploratory analysis
Seamless integration into the Bioconductor universe
Simple normalisation methods

R包工作流程图

S4对象

主要是基于 SCESet 对象来进行下游分析，跟ExpressionSet对象类似，也是常见的3个组成：

exprs, a numeric matrix of expression values, where rows are features, and columns are cells
phenoData, an AnnotatedDataFrame object, where rows are cells, and columns are cell attributes (such as cell type, culture condition, day captured, etc.)
featureData, an AnnotatedDataFrame object, where rows are features (e.g. genes), and columns are feature attributes, such as biotype, gc content, etc.

主要就是读取scRNA上游分析处理得到的表达矩阵，加上每个样本的描述信息，形成矩阵之后。对样本进行过滤，然后对基因进行过滤。针对过滤后的表达矩阵进行各种分类的可视化。

HTML	R Script	An introduction to the scater package
HTML	R Script	Data visualisation methods in scater
HTML	R Script	Expression quantification and import
HTML	R Script	Quality control with scater
HTML	R Script	Transition from SCESet to SingleCellExperiment
PDF

测试数据

suppressPackageStartupMessages(library(scater))
data("sc_example_counts")
data("sc_example_cell_info") 

example_sce <- SingleCellExperiment(
    assays = list(counts = sc_example_counts), 
    colData = sc_example_cell_info
)

exprs(example_sce) <- log2(
    calculateCPM(example_sce, use.size.factors = FALSE) + 1
)

keep_feature <- rowSums(exprs(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]

example_sce <- calculateQCMetrics(example_sce, 
                                  feature_controls = list(eg = 1:40))

#scater_gui(example_sce)

但是真的非常好用，所有的可视化都集中在了 scater_gui 这个函数产生的shiny网页里面：

plotScater: a plot method exists for SingleCellExperiment objects, which gives an overview of expression across cells.
plotQC: various methods are available for producing QC diagnostic plots.
plotPCA: produce a principal components plot for the cells.
plotTSNE: produce a t-distributed stochastic neighbour embedding (reduced dimension) plot for the cells.
plotDiffusionMap: produce a diffusion map (reduced dimension) plot for the cells.
plotMDS: produce a multi-dimensional scaling plot for the cells.
plotReducedDim: plot a reduced-dimension representation of the cells.
plotExpression: plot expression levels for a defined set of features.
plotPlatePosition: plot cells in their position on a plate, coloured by cell metadata and QC metrics or feature expression level.
plotColData: plot cell metadata and QC metrics.
plotRowData: plot feature metadata and QC metrics.

可以充分的探索自己的数据，随便看一个可视化函数的结果：

## ----plot-expression, eval=TRUE--------------------------------------------
plotExpression(example_sce, rownames(example_sce)[1:6],
               x = "Mutation_Status", exprs_values = "exprs", 
               colour = "Treatment")

详细的QC

做QC要结合上面的可视化步骤，所有没办法自动化，只能先可视化，肉眼分辨一下哪些样本或者基因数据是需要舍弃的。

library(knitr)
opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png')
library(ggplot2)
theme_set(theme_bw(12))

## ----quickstart-load-data, message=FALSE, warning=FALSE--------------------
suppressPackageStartupMessages(library(scater))
data("sc_example_counts")
data("sc_example_cell_info")

## ----quickstart-make-sce, results='hide'-----------------------------------
gene_df <- DataFrame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts), 
                                    colData = sc_example_cell_info, 
                                    rowData = gene_df)

example_sce <- normalise(example_sce)

## Warning in .local(object, ...): using library sizes as size factors

## ----quickstart-add-exprs, results='hide'----------------------------------
exprs(example_sce) <- log2(
    calculateCPM(example_sce, use.size.factors = FALSE) + 1)

## ----filter-no-exprs-------------------------------------------------------
keep_feature <- rowSums(exprs(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]

example_sceset <- calculateQCMetrics(example_sce, feature_controls = list(eg = 1:40)) 


colnames(colData(example_sceset))

##  [1] "Cell"                                  
##  [2] "Mutation_Status"                       
##  [3] "Cell_Cycle"                            
##  [4] "Treatment"                             
##  [5] "total_features"                        
##  [6] "log10_total_features"                  
##  [7] "total_counts"                          
##  [8] "log10_total_counts"                    
##  [9] "pct_counts_top_50_features"            
## [10] "pct_counts_top_100_features"           
## [11] "pct_counts_top_200_features"           
## [12] "pct_counts_top_500_features"           
## [13] "total_features_endogenous"             
## [14] "log10_total_features_endogenous"       
## [15] "total_counts_endogenous"               
## [16] "log10_total_counts_endogenous"         
## [17] "pct_counts_endogenous"                 
## [18] "pct_counts_top_50_features_endogenous" 
## [19] "pct_counts_top_100_features_endogenous"
## [20] "pct_counts_top_200_features_endogenous"
## [21] "pct_counts_top_500_features_endogenous"
## [22] "total_features_feature_control"        
## [23] "log10_total_features_feature_control"  
## [24] "total_counts_feature_control"          
## [25] "log10_total_counts_feature_control"    
## [26] "pct_counts_feature_control"            
## [27] "total_features_eg"                     
## [28] "log10_total_features_eg"               
## [29] "total_counts_eg"                       
## [30] "log10_total_counts_eg"                 
## [31] "pct_counts_eg"                         
## [32] "is_cell_control"

colnames(rowData(example_sceset))

##  [1] "Gene"                  "is_feature_control"   
##  [3] "is_feature_control_eg" "mean_counts"          
##  [5] "log10_mean_counts"     "rank_counts"          
##  [7] "n_cells_counts"        "pct_dropout_counts"   
##  [9] "total_counts"          "log10_total_counts"

首先是基于样本的过滤，用 colData(object) 可以查看各个样本统计情况

total_counts: total number of counts for the cell (aka ‘library size’)
log10_total_counts: total_counts on the log10-scale
total_features: the number of features for the cell that have expression above the detection limit (default detection limit is zero)
filter_on_total_counts: would this cell be filtered out based on its log10-total_counts being (by default) more than 5 median absolute deviations from the median log10-total_counts for the dataset?
filter_on_total_features: would this cell be filtered out based on its total_features being (by default) more than 5 median absolute deviations from the median total_features for the dataset?
counts_feature_controls: total number of counts for the cell that come from (a set of user-defined) control features. Defaults to zero if no control features are indicated.
counts_endogenous_features: total number of counts for the cell that come from endogenous features (i.e. not control features). Defaults to total_counts if no control features are indicated.
log10_counts_feature_controls: total number of counts from control features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.
log10_counts_endogenous_features: total number of counts from endogenous features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.
n_detected_feature_controls: number of defined feature controls that have expression greater than the threshold defined in the object. *pct_counts_feature_controls: percentage of all counts that come from the defined control features. Defaults to zero if no control features are defined.

然后是基于基因的过滤，用 rowData(object) 可以查看各个基因统计情况

mean_exprs: the mean expression level of the gene/feature.
exprs_rank: the rank of the feature’s expression level in the cell.
total_feature_counts: the total number of counts mapped to that feature across all cells.
log10_total_feature_counts: total feature counts on the log10-scale.
pct_total_counts: the percentage of all counts that are accounted for by the counts mapping to the feature.
is_feature_control: is the feature a control feature? Default is FALSE unless control features are defined by the user.
n_cells_exprs: the number of cells for which the expression level of the feature is above the detection limit (default detection limit is zero).

scater一站式过滤低质量样本

scater包自己提供了一个基于PCA的QC标准，不需要自己根据文库大小，覆盖的基因数量，外源的ERCC spike-ins 含量以及线粒体DNA含量来进行人工过滤。

默认的筛选条件如下：

pct_counts_top100features
total_features
pct_counts_feature_controls
n_detected_feature_controls
log10_counts_endogenous_features
log10_counts_feature_controls

一站式QC函数如下：

dat_pca <- scater::plotPCA(dat_qc,
                  size_by = "total_features", 
                  shape_by = "use",
                  pca_data_input = "pdata",
                  detect_outliers = TRUE,
                  return_SCESet = TRUE)

还有更详细的教程，需要看

https://www.bioconductor.org/help/workflows/simpleSingleCell/
http://hemberg-lab.github.io/scRNA.seq.course/index.html

sessionInfo()

过滤只是它最基本的工具，它作为单细胞转录组3大R包，功能肯定是非常全面的，比如前面我们讲解的normalization，DEG, features selection，cluster，它都手到擒来，只不过是包装的是其它R包的函数。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2018-01-23，如有侵权请联系 cloudcommunity@tencent.com 删除

r 语言

本文分享自生信技能树微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

r 语言

登录后参与评论

0 条评论

热度