单细胞转录组3大R包之scater

scater 这个R包很强大,是McCarthy et al. 2017 发表的,包含的功能有:

  • Automated computation of QC metrics
  • Transcript quantification from read data with pseudo-alignment
  • Data format standardisation
  • Rich visualizations for exploratory analysis
  • Seamless integration into the Bioconductor universe
  • Simple normalisation methods

R包工作流程图

S4对象

主要是基于 SCESet 对象来进行下游分析,跟ExpressionSet对象类似,也是常见的3个组成:

  • exprs, a numeric matrix of expression values, where rows are features, and columns are cells
  • phenoData, an AnnotatedDataFrame object, where rows are cells, and columns are cell attributes (such as cell type, culture condition, day captured, etc.)
  • featureData, an AnnotatedDataFrame object, where rows are features (e.g. genes), and columns are feature attributes, such as biotype, gc content, etc.

主要就是读取scRNA上游分析处理得到的表达矩阵,加上每个样本的描述信息,形成矩阵之后。对样本进行过滤,然后对基因进行过滤。针对过滤后的表达矩阵进行各种分类的可视化。

最新的文档如下:

HTML

R Script

An introduction to the scater package

HTML

R Script

Data visualisation methods in scater

HTML

R Script

Expression quantification and import

HTML

R Script

Quality control with scater

HTML

R Script

Transition from SCESet to SingleCellExperiment

PDF

其GitHub上面专门开设了介绍它用法的课程:http://hemberg-lab.github.io/scRNA.seq.course/

测试数据

suppressPackageStartupMessages(library(scater))
data("sc_example_counts")
data("sc_example_cell_info") 

example_sce <- SingleCellExperiment(
    assays = list(counts = sc_example_counts), 
    colData = sc_example_cell_info
)

exprs(example_sce) <- log2(
    calculateCPM(example_sce, use.size.factors = FALSE) + 1
)

keep_feature <- rowSums(exprs(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]

example_sce <- calculateQCMetrics(example_sce, 
                                  feature_controls = list(eg = 1:40))

#scater_gui(example_sce)

但是真的非常好用,所有的可视化都集中在了 scater_gui 这个函数产生的shiny网页里面:

  • plotScater: a plot method exists for SingleCellExperiment objects, which gives an overview of expression across cells.
  • plotQC: various methods are available for producing QC diagnostic plots.
  • plotPCA: produce a principal components plot for the cells.
  • plotTSNE: produce a t-distributed stochastic neighbour embedding (reduced dimension) plot for the cells.
  • plotDiffusionMap: produce a diffusion map (reduced dimension) plot for the cells.
  • plotMDS: produce a multi-dimensional scaling plot for the cells.
  • plotReducedDim: plot a reduced-dimension representation of the cells.
  • plotExpression: plot expression levels for a defined set of features.
  • plotPlatePosition: plot cells in their position on a plate, coloured by cell metadata and QC metrics or feature expression level.
  • plotColData: plot cell metadata and QC metrics.
  • plotRowData: plot feature metadata and QC metrics.

可以充分的探索自己的数据,随便看一个可视化函数的结果:

## ----plot-expression, eval=TRUE--------------------------------------------
plotExpression(example_sce, rownames(example_sce)[1:6],
               x = "Mutation_Status", exprs_values = "exprs", 
               colour = "Treatment")

详细的QC

做QC要结合上面的可视化步骤,所有没办法自动化,只能先可视化,肉眼分辨一下哪些样本或者基因数据是需要舍弃的。

library(knitr)
opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png')
library(ggplot2)
theme_set(theme_bw(12))

## ----quickstart-load-data, message=FALSE, warning=FALSE--------------------
suppressPackageStartupMessages(library(scater))
data("sc_example_counts")
data("sc_example_cell_info")

## ----quickstart-make-sce, results='hide'-----------------------------------
gene_df <- DataFrame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts), 
                                    colData = sc_example_cell_info, 
                                    rowData = gene_df)

example_sce <- normalise(example_sce)
## Warning in .local(object, ...): using library sizes as size factors
## ----quickstart-add-exprs, results='hide'----------------------------------
exprs(example_sce) <- log2(
    calculateCPM(example_sce, use.size.factors = FALSE) + 1)

## ----filter-no-exprs-------------------------------------------------------
keep_feature <- rowSums(exprs(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]

example_sceset <- calculateQCMetrics(example_sce, feature_controls = list(eg = 1:40)) 


colnames(colData(example_sceset))
##  [1] "Cell"                                  
##  [2] "Mutation_Status"                       
##  [3] "Cell_Cycle"                            
##  [4] "Treatment"                             
##  [5] "total_features"                        
##  [6] "log10_total_features"                  
##  [7] "total_counts"                          
##  [8] "log10_total_counts"                    
##  [9] "pct_counts_top_50_features"            
## [10] "pct_counts_top_100_features"           
## [11] "pct_counts_top_200_features"           
## [12] "pct_counts_top_500_features"           
## [13] "total_features_endogenous"             
## [14] "log10_total_features_endogenous"       
## [15] "total_counts_endogenous"               
## [16] "log10_total_counts_endogenous"         
## [17] "pct_counts_endogenous"                 
## [18] "pct_counts_top_50_features_endogenous" 
## [19] "pct_counts_top_100_features_endogenous"
## [20] "pct_counts_top_200_features_endogenous"
## [21] "pct_counts_top_500_features_endogenous"
## [22] "total_features_feature_control"        
## [23] "log10_total_features_feature_control"  
## [24] "total_counts_feature_control"          
## [25] "log10_total_counts_feature_control"    
## [26] "pct_counts_feature_control"            
## [27] "total_features_eg"                     
## [28] "log10_total_features_eg"               
## [29] "total_counts_eg"                       
## [30] "log10_total_counts_eg"                 
## [31] "pct_counts_eg"                         
## [32] "is_cell_control"
colnames(rowData(example_sceset))
##  [1] "Gene"                  "is_feature_control"   
##  [3] "is_feature_control_eg" "mean_counts"          
##  [5] "log10_mean_counts"     "rank_counts"          
##  [7] "n_cells_counts"        "pct_dropout_counts"   
##  [9] "total_counts"          "log10_total_counts"

首先是基于样本的过滤,用 colData(object) 可以查看各个样本统计情况

  • total_counts: total number of counts for the cell (aka ‘library size’)
  • log10_total_counts: total_counts on the log10-scale
  • total_features: the number of features for the cell that have expression above the detection limit (default detection limit is zero)
  • filter_on_total_counts: would this cell be filtered out based on its log10-total_counts being (by default) more than 5 median absolute deviations from the median log10-total_counts for the dataset?
  • filter_on_total_features: would this cell be filtered out based on its total_features being (by default) more than 5 median absolute deviations from the median total_features for the dataset?
  • counts_feature_controls: total number of counts for the cell that come from (a set of user-defined) control features. Defaults to zero if no control features are indicated.
  • counts_endogenous_features: total number of counts for the cell that come from endogenous features (i.e. not control features). Defaults to total_counts if no control features are indicated.
  • log10_counts_feature_controls: total number of counts from control features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.
  • log10_counts_endogenous_features: total number of counts from endogenous features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.
  • n_detected_feature_controls: number of defined feature controls that have expression greater than the threshold defined in the object. *pct_counts_feature_controls: percentage of all counts that come from the defined control features. Defaults to zero if no control features are defined.

然后是基于基因的过滤,用 rowData(object) 可以查看各个基因统计情况

  • mean_exprs: the mean expression level of the gene/feature.
  • exprs_rank: the rank of the feature’s expression level in the cell.
  • total_feature_counts: the total number of counts mapped to that feature across all cells.
  • log10_total_feature_counts: total feature counts on the log10-scale.
  • pct_total_counts: the percentage of all counts that are accounted for by the counts mapping to the feature.
  • is_feature_control: is the feature a control feature? Default is FALSE unless control features are defined by the user.
  • n_cells_exprs: the number of cells for which the expression level of the feature is above the detection limit (default detection limit is zero).

scater一站式过滤低质量样本

scater包自己提供了一个基于PCA的QC标准,不需要自己根据文库大小,覆盖的基因数量,外源的ERCC spike-ins 含量以及线粒体DNA含量来进行人工过滤。

默认的筛选条件如下:

  • pct_counts_top100features
  • total_features
  • pct_counts_feature_controls
  • n_detected_feature_controls
  • log10_counts_endogenous_features
  • log10_counts_feature_controls

一站式QC函数如下:

dat_pca <- scater::plotPCA(dat_qc,
                  size_by = "total_features", 
                  shape_by = "use",
                  pca_data_input = "pdata",
                  detect_outliers = TRUE,
                  return_SCESet = TRUE)

还有更详细的教程,需要看

  • https://www.bioconductor.org/help/workflows/simpleSingleCell/
  • http://hemberg-lab.github.io/scRNA.seq.course/index.html
sessionInfo()

过滤只是它最基本的工具,它作为单细胞转录组3大R包,功能肯定是非常全面的,比如前面我们讲解的normalization,DEG, features selection,cluster,它都手到擒来,只不过是包装的是其它R包的函数。

原文发布于微信公众号 - 生信技能树(biotrainee)

原文发表时间:2018-01-23

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏ATYUN订阅号

用香蕉也能玩电脑游戏—Tensorflow对象检测接口的简单应用

Tensorflow最近发布了用于对象检测的对象检测接口(Object Detection API),能够定位和识别图像中的对象。它能够快速检测图像允许从视频帧...

3224
来自专栏漫漫深度学习路

pytorch学习笔记(八):PytTorch可视化工具 visdom

Visdom PyTorch可视化工具 本文翻译的时候把 略去了 Torch部分。 项目地址 ? 一个灵活的可视化工具,可用来对于 实时,富数据的 创建,组织和...

4875
来自专栏小詹同学

人脸检测(一)——基于单文档的应用台程序

Opencv自带训练好的人脸模型(人脸的人眼、口等器官类似),此文基于vs2013建立应用台单文档程序,具体建立过程不予详细叙述,主要记录利用的Opencv自带...

3685
来自专栏深度学习

利用RNN和LSTM生成小说题记

一、选取素材 本文选取的小说素材来自17k小说网的一篇小说《两只橙与遠太郎》,手工复制小说中的题记。 小说网址:http://www.17k.com/list/...

4657
来自专栏领域驱动设计DDD实战进阶

18-TypeScript模板方法模式

在有些情况下,一个功能在基础功能上是不会变的,算法的基本骨架也是确定的,但是在某些场景下算法的具体实现有些差异。应对这种问题,可以采用模板方法模式: abstr...

3054
来自专栏42度空间

基于规则评分的密码强度检测算法分析及实现(JavaScript)

用正则表达式做用户密码强度的通过性判定,过于简单粗暴,不但用户体验差,而且用户帐号安全性也差。那么如何准确评价用户密码的强度,保护用户帐号安全呢?本文分析介绍了...

4456
来自专栏落影的专栏

iOS开发-OpenGLES进阶教程3

教程 OpenGLES入门教程1-Tutorial01-GLKit OpenGLES入门教程2-Tutorial02-shader入门 OpenGLES入门...

3317
来自专栏YoungGy

ISLR_ANOVA

概述 核心思想 检定统计量F 结论 适用情况 Multi comparison ANOVA不同于之前的z检定,t检定,这里的零假设包含了很多个变量,具体是μ1=...

1898
来自专栏生信技能树

一篇文章学会ChIP-seq分析(下)

写在前面:《一篇文章学会ChIP-seq分析(上)》《一篇文章学会ChIP-seq分析(下)》为生信菜鸟团博客相关文章合集,共九讲内容。带领你从相关文献解读、资...

4617
来自专栏杂七杂八

Matplotlib 绘2D图

Matplotlib 是一个非常简单而又完善的开源绘图库。那么它到底有多简单呢? 基本知识 首先官方文档奉上 下面,我们通过 3 行代码绘制一张简单的折线图...

4065

扫描关注云+社区