单细胞转录组3大R包之scater

scater 这个R包很强大,是McCarthy et al. 2017 发表的,包含的功能有:

  • Automated computation of QC metrics
  • Transcript quantification from read data with pseudo-alignment
  • Data format standardisation
  • Rich visualizations for exploratory analysis
  • Seamless integration into the Bioconductor universe
  • Simple normalisation methods

R包工作流程图

S4对象

主要是基于 SCESet 对象来进行下游分析,跟ExpressionSet对象类似,也是常见的3个组成:

  • exprs, a numeric matrix of expression values, where rows are features, and columns are cells
  • phenoData, an AnnotatedDataFrame object, where rows are cells, and columns are cell attributes (such as cell type, culture condition, day captured, etc.)
  • featureData, an AnnotatedDataFrame object, where rows are features (e.g. genes), and columns are feature attributes, such as biotype, gc content, etc.

主要就是读取scRNA上游分析处理得到的表达矩阵,加上每个样本的描述信息,形成矩阵之后。对样本进行过滤,然后对基因进行过滤。针对过滤后的表达矩阵进行各种分类的可视化。

最新的文档如下:

HTML

R Script

An introduction to the scater package

HTML

R Script

Data visualisation methods in scater

HTML

R Script

Expression quantification and import

HTML

R Script

Quality control with scater

HTML

R Script

Transition from SCESet to SingleCellExperiment

PDF

其GitHub上面专门开设了介绍它用法的课程:http://hemberg-lab.github.io/scRNA.seq.course/

测试数据

suppressPackageStartupMessages(library(scater))
data("sc_example_counts")
data("sc_example_cell_info") 

example_sce <- SingleCellExperiment(
    assays = list(counts = sc_example_counts), 
    colData = sc_example_cell_info
)

exprs(example_sce) <- log2(
    calculateCPM(example_sce, use.size.factors = FALSE) + 1
)

keep_feature <- rowSums(exprs(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]

example_sce <- calculateQCMetrics(example_sce, 
                                  feature_controls = list(eg = 1:40))

#scater_gui(example_sce)

但是真的非常好用,所有的可视化都集中在了 scater_gui 这个函数产生的shiny网页里面:

  • plotScater: a plot method exists for SingleCellExperiment objects, which gives an overview of expression across cells.
  • plotQC: various methods are available for producing QC diagnostic plots.
  • plotPCA: produce a principal components plot for the cells.
  • plotTSNE: produce a t-distributed stochastic neighbour embedding (reduced dimension) plot for the cells.
  • plotDiffusionMap: produce a diffusion map (reduced dimension) plot for the cells.
  • plotMDS: produce a multi-dimensional scaling plot for the cells.
  • plotReducedDim: plot a reduced-dimension representation of the cells.
  • plotExpression: plot expression levels for a defined set of features.
  • plotPlatePosition: plot cells in their position on a plate, coloured by cell metadata and QC metrics or feature expression level.
  • plotColData: plot cell metadata and QC metrics.
  • plotRowData: plot feature metadata and QC metrics.

可以充分的探索自己的数据,随便看一个可视化函数的结果:

## ----plot-expression, eval=TRUE--------------------------------------------
plotExpression(example_sce, rownames(example_sce)[1:6],
               x = "Mutation_Status", exprs_values = "exprs", 
               colour = "Treatment")

详细的QC

做QC要结合上面的可视化步骤,所有没办法自动化,只能先可视化,肉眼分辨一下哪些样本或者基因数据是需要舍弃的。

library(knitr)
opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png')
library(ggplot2)
theme_set(theme_bw(12))

## ----quickstart-load-data, message=FALSE, warning=FALSE--------------------
suppressPackageStartupMessages(library(scater))
data("sc_example_counts")
data("sc_example_cell_info")

## ----quickstart-make-sce, results='hide'-----------------------------------
gene_df <- DataFrame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts), 
                                    colData = sc_example_cell_info, 
                                    rowData = gene_df)

example_sce <- normalise(example_sce)
## Warning in .local(object, ...): using library sizes as size factors
## ----quickstart-add-exprs, results='hide'----------------------------------
exprs(example_sce) <- log2(
    calculateCPM(example_sce, use.size.factors = FALSE) + 1)

## ----filter-no-exprs-------------------------------------------------------
keep_feature <- rowSums(exprs(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]

example_sceset <- calculateQCMetrics(example_sce, feature_controls = list(eg = 1:40)) 


colnames(colData(example_sceset))
##  [1] "Cell"                                  
##  [2] "Mutation_Status"                       
##  [3] "Cell_Cycle"                            
##  [4] "Treatment"                             
##  [5] "total_features"                        
##  [6] "log10_total_features"                  
##  [7] "total_counts"                          
##  [8] "log10_total_counts"                    
##  [9] "pct_counts_top_50_features"            
## [10] "pct_counts_top_100_features"           
## [11] "pct_counts_top_200_features"           
## [12] "pct_counts_top_500_features"           
## [13] "total_features_endogenous"             
## [14] "log10_total_features_endogenous"       
## [15] "total_counts_endogenous"               
## [16] "log10_total_counts_endogenous"         
## [17] "pct_counts_endogenous"                 
## [18] "pct_counts_top_50_features_endogenous" 
## [19] "pct_counts_top_100_features_endogenous"
## [20] "pct_counts_top_200_features_endogenous"
## [21] "pct_counts_top_500_features_endogenous"
## [22] "total_features_feature_control"        
## [23] "log10_total_features_feature_control"  
## [24] "total_counts_feature_control"          
## [25] "log10_total_counts_feature_control"    
## [26] "pct_counts_feature_control"            
## [27] "total_features_eg"                     
## [28] "log10_total_features_eg"               
## [29] "total_counts_eg"                       
## [30] "log10_total_counts_eg"                 
## [31] "pct_counts_eg"                         
## [32] "is_cell_control"
colnames(rowData(example_sceset))
##  [1] "Gene"                  "is_feature_control"   
##  [3] "is_feature_control_eg" "mean_counts"          
##  [5] "log10_mean_counts"     "rank_counts"          
##  [7] "n_cells_counts"        "pct_dropout_counts"   
##  [9] "total_counts"          "log10_total_counts"

首先是基于样本的过滤,用 colData(object) 可以查看各个样本统计情况

  • total_counts: total number of counts for the cell (aka ‘library size’)
  • log10_total_counts: total_counts on the log10-scale
  • total_features: the number of features for the cell that have expression above the detection limit (default detection limit is zero)
  • filter_on_total_counts: would this cell be filtered out based on its log10-total_counts being (by default) more than 5 median absolute deviations from the median log10-total_counts for the dataset?
  • filter_on_total_features: would this cell be filtered out based on its total_features being (by default) more than 5 median absolute deviations from the median total_features for the dataset?
  • counts_feature_controls: total number of counts for the cell that come from (a set of user-defined) control features. Defaults to zero if no control features are indicated.
  • counts_endogenous_features: total number of counts for the cell that come from endogenous features (i.e. not control features). Defaults to total_counts if no control features are indicated.
  • log10_counts_feature_controls: total number of counts from control features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.
  • log10_counts_endogenous_features: total number of counts from endogenous features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.
  • n_detected_feature_controls: number of defined feature controls that have expression greater than the threshold defined in the object. *pct_counts_feature_controls: percentage of all counts that come from the defined control features. Defaults to zero if no control features are defined.

然后是基于基因的过滤,用 rowData(object) 可以查看各个基因统计情况

  • mean_exprs: the mean expression level of the gene/feature.
  • exprs_rank: the rank of the feature’s expression level in the cell.
  • total_feature_counts: the total number of counts mapped to that feature across all cells.
  • log10_total_feature_counts: total feature counts on the log10-scale.
  • pct_total_counts: the percentage of all counts that are accounted for by the counts mapping to the feature.
  • is_feature_control: is the feature a control feature? Default is FALSE unless control features are defined by the user.
  • n_cells_exprs: the number of cells for which the expression level of the feature is above the detection limit (default detection limit is zero).

scater一站式过滤低质量样本

scater包自己提供了一个基于PCA的QC标准,不需要自己根据文库大小,覆盖的基因数量,外源的ERCC spike-ins 含量以及线粒体DNA含量来进行人工过滤。

默认的筛选条件如下:

  • pct_counts_top100features
  • total_features
  • pct_counts_feature_controls
  • n_detected_feature_controls
  • log10_counts_endogenous_features
  • log10_counts_feature_controls

一站式QC函数如下:

dat_pca <- scater::plotPCA(dat_qc,
                  size_by = "total_features", 
                  shape_by = "use",
                  pca_data_input = "pdata",
                  detect_outliers = TRUE,
                  return_SCESet = TRUE)

还有更详细的教程,需要看

  • https://www.bioconductor.org/help/workflows/simpleSingleCell/
  • http://hemberg-lab.github.io/scRNA.seq.course/index.html
sessionInfo()

过滤只是它最基本的工具,它作为单细胞转录组3大R包,功能肯定是非常全面的,比如前面我们讲解的normalization,DEG, features selection,cluster,它都手到擒来,只不过是包装的是其它R包的函数。

原文发布于微信公众号 - 生信技能树(biotrainee)

原文发表时间:2018-01-23

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏数据小魔方

创意玫瑰图(Rose chart)

今天跟大家分享的图表是创意玫瑰图! ▽▼▽ 这种图表形似玫瑰,故而得名,其效果与我们常用的饼图,圆环图及雷达图类似。 ? 可以反映数据结构的比例、大小,但因其形...

35810
来自专栏生信技能树

单细胞转录组3大R包之monocle2

主要是针对单细胞转录组测序数据开发的,用来找不同细胞类型或者不同细胞状态的差异表达基因。分析起始是表达矩阵,作者推荐用比较老旧的Tophat+Cufflinks...

1.6K9
来自专栏流媒体

音频编码(一)——FFmpeg编码

这里为啥讲到了声波,讲到了我们的中学物理上的知识,因为我想大家能从根本理解后面音频编码的各种参数以及原因。当然这些知识网上都能搜到,我只是整合一下。

2163
来自专栏帮你学MatLab

R2015b 版本

R2015b 版本 MATLAB 产品系列更新: MATLAB: 新增更快运行 MATLAB® 代码的执行引擎;用于创建、分析图形和网络并实现可视化的图形...

2627
来自专栏数据小魔方

带负值的图表标签处理方法

今天跟大家分享带负值的图表标签处理方法! ▽▼▽ 在遇到某些特殊图表时,特别是一个数据系列中既有正值又有负值的情况,数据标签以及纵轴轴标签总是会相互遮挡,做出来...

3066
来自专栏菩提树下的杨过

excel中的不同类型图表叠加

上午QQ上的某好友问我:如何在excel中插入一张同时带柱状图+折线图的图表?(类似下面这样) ? 打开excel2007看了下,默认情况下插入图表时,只允许选...

2796
来自专栏专知

【干货】TensorFlow协同过滤推荐实战

【导读】本文利用TensorFlow构建了一个用于产品推荐的WALS协同过滤模型。作者从抓取数据开始对模型进行了详细的解读,并且分析了几种推荐中可能隐藏的情况及...

51911
来自专栏Soul Joy Hub

tensorflow架构

TensorFlow 又是好久没有写博客了,上班以来,感觉时间过得飞快,每天时间很紧,过得有点累,不知道自己的博客能坚持到何时,且行且珍惜。 本片博文是参...

3319
来自专栏AI研习社

代码+实战:TensorFlow Estimator of Deep CTR —— DeepFM/NFM/AFM/FNN/PNN

深度学习在 ctr 预估领域的应用越来越多,新的模型不断冒出。从 ctr 预估问题看看 f(x) 设计—DNN 篇(https://zhuanlan.zhihu...

1K8
来自专栏刘望舒

Android 人脸识别之人脸注册

2892

扫码关注云+社区