单细胞转录组3大R包之scater

scater 这个R包很强大,是McCarthy et al. 2017 发表的,包含的功能有:

  • Automated computation of QC metrics
  • Transcript quantification from read data with pseudo-alignment
  • Data format standardisation
  • Rich visualizations for exploratory analysis
  • Seamless integration into the Bioconductor universe
  • Simple normalisation methods

R包工作流程图

S4对象

主要是基于 SCESet 对象来进行下游分析,跟ExpressionSet对象类似,也是常见的3个组成:

  • exprs, a numeric matrix of expression values, where rows are features, and columns are cells
  • phenoData, an AnnotatedDataFrame object, where rows are cells, and columns are cell attributes (such as cell type, culture condition, day captured, etc.)
  • featureData, an AnnotatedDataFrame object, where rows are features (e.g. genes), and columns are feature attributes, such as biotype, gc content, etc.

主要就是读取scRNA上游分析处理得到的表达矩阵,加上每个样本的描述信息,形成矩阵之后。对样本进行过滤,然后对基因进行过滤。针对过滤后的表达矩阵进行各种分类的可视化。

最新的文档如下:

HTML

R Script

An introduction to the scater package

HTML

R Script

Data visualisation methods in scater

HTML

R Script

Expression quantification and import

HTML

R Script

Quality control with scater

HTML

R Script

Transition from SCESet to SingleCellExperiment

PDF

其GitHub上面专门开设了介绍它用法的课程:http://hemberg-lab.github.io/scRNA.seq.course/

测试数据

suppressPackageStartupMessages(library(scater))
data("sc_example_counts")
data("sc_example_cell_info") 

example_sce <- SingleCellExperiment(
    assays = list(counts = sc_example_counts), 
    colData = sc_example_cell_info
)

exprs(example_sce) <- log2(
    calculateCPM(example_sce, use.size.factors = FALSE) + 1
)

keep_feature <- rowSums(exprs(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]

example_sce <- calculateQCMetrics(example_sce, 
                                  feature_controls = list(eg = 1:40))

#scater_gui(example_sce)

但是真的非常好用,所有的可视化都集中在了 scater_gui 这个函数产生的shiny网页里面:

  • plotScater: a plot method exists for SingleCellExperiment objects, which gives an overview of expression across cells.
  • plotQC: various methods are available for producing QC diagnostic plots.
  • plotPCA: produce a principal components plot for the cells.
  • plotTSNE: produce a t-distributed stochastic neighbour embedding (reduced dimension) plot for the cells.
  • plotDiffusionMap: produce a diffusion map (reduced dimension) plot for the cells.
  • plotMDS: produce a multi-dimensional scaling plot for the cells.
  • plotReducedDim: plot a reduced-dimension representation of the cells.
  • plotExpression: plot expression levels for a defined set of features.
  • plotPlatePosition: plot cells in their position on a plate, coloured by cell metadata and QC metrics or feature expression level.
  • plotColData: plot cell metadata and QC metrics.
  • plotRowData: plot feature metadata and QC metrics.

可以充分的探索自己的数据,随便看一个可视化函数的结果:

## ----plot-expression, eval=TRUE--------------------------------------------
plotExpression(example_sce, rownames(example_sce)[1:6],
               x = "Mutation_Status", exprs_values = "exprs", 
               colour = "Treatment")

详细的QC

做QC要结合上面的可视化步骤,所有没办法自动化,只能先可视化,肉眼分辨一下哪些样本或者基因数据是需要舍弃的。

library(knitr)
opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png')
library(ggplot2)
theme_set(theme_bw(12))

## ----quickstart-load-data, message=FALSE, warning=FALSE--------------------
suppressPackageStartupMessages(library(scater))
data("sc_example_counts")
data("sc_example_cell_info")

## ----quickstart-make-sce, results='hide'-----------------------------------
gene_df <- DataFrame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts), 
                                    colData = sc_example_cell_info, 
                                    rowData = gene_df)

example_sce <- normalise(example_sce)
## Warning in .local(object, ...): using library sizes as size factors
## ----quickstart-add-exprs, results='hide'----------------------------------
exprs(example_sce) <- log2(
    calculateCPM(example_sce, use.size.factors = FALSE) + 1)

## ----filter-no-exprs-------------------------------------------------------
keep_feature <- rowSums(exprs(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]

example_sceset <- calculateQCMetrics(example_sce, feature_controls = list(eg = 1:40)) 


colnames(colData(example_sceset))
##  [1] "Cell"                                  
##  [2] "Mutation_Status"                       
##  [3] "Cell_Cycle"                            
##  [4] "Treatment"                             
##  [5] "total_features"                        
##  [6] "log10_total_features"                  
##  [7] "total_counts"                          
##  [8] "log10_total_counts"                    
##  [9] "pct_counts_top_50_features"            
## [10] "pct_counts_top_100_features"           
## [11] "pct_counts_top_200_features"           
## [12] "pct_counts_top_500_features"           
## [13] "total_features_endogenous"             
## [14] "log10_total_features_endogenous"       
## [15] "total_counts_endogenous"               
## [16] "log10_total_counts_endogenous"         
## [17] "pct_counts_endogenous"                 
## [18] "pct_counts_top_50_features_endogenous" 
## [19] "pct_counts_top_100_features_endogenous"
## [20] "pct_counts_top_200_features_endogenous"
## [21] "pct_counts_top_500_features_endogenous"
## [22] "total_features_feature_control"        
## [23] "log10_total_features_feature_control"  
## [24] "total_counts_feature_control"          
## [25] "log10_total_counts_feature_control"    
## [26] "pct_counts_feature_control"            
## [27] "total_features_eg"                     
## [28] "log10_total_features_eg"               
## [29] "total_counts_eg"                       
## [30] "log10_total_counts_eg"                 
## [31] "pct_counts_eg"                         
## [32] "is_cell_control"
colnames(rowData(example_sceset))
##  [1] "Gene"                  "is_feature_control"   
##  [3] "is_feature_control_eg" "mean_counts"          
##  [5] "log10_mean_counts"     "rank_counts"          
##  [7] "n_cells_counts"        "pct_dropout_counts"   
##  [9] "total_counts"          "log10_total_counts"

首先是基于样本的过滤,用 colData(object) 可以查看各个样本统计情况

  • total_counts: total number of counts for the cell (aka ‘library size’)
  • log10_total_counts: total_counts on the log10-scale
  • total_features: the number of features for the cell that have expression above the detection limit (default detection limit is zero)
  • filter_on_total_counts: would this cell be filtered out based on its log10-total_counts being (by default) more than 5 median absolute deviations from the median log10-total_counts for the dataset?
  • filter_on_total_features: would this cell be filtered out based on its total_features being (by default) more than 5 median absolute deviations from the median total_features for the dataset?
  • counts_feature_controls: total number of counts for the cell that come from (a set of user-defined) control features. Defaults to zero if no control features are indicated.
  • counts_endogenous_features: total number of counts for the cell that come from endogenous features (i.e. not control features). Defaults to total_counts if no control features are indicated.
  • log10_counts_feature_controls: total number of counts from control features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.
  • log10_counts_endogenous_features: total number of counts from endogenous features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.
  • n_detected_feature_controls: number of defined feature controls that have expression greater than the threshold defined in the object. *pct_counts_feature_controls: percentage of all counts that come from the defined control features. Defaults to zero if no control features are defined.

然后是基于基因的过滤,用 rowData(object) 可以查看各个基因统计情况

  • mean_exprs: the mean expression level of the gene/feature.
  • exprs_rank: the rank of the feature’s expression level in the cell.
  • total_feature_counts: the total number of counts mapped to that feature across all cells.
  • log10_total_feature_counts: total feature counts on the log10-scale.
  • pct_total_counts: the percentage of all counts that are accounted for by the counts mapping to the feature.
  • is_feature_control: is the feature a control feature? Default is FALSE unless control features are defined by the user.
  • n_cells_exprs: the number of cells for which the expression level of the feature is above the detection limit (default detection limit is zero).

scater一站式过滤低质量样本

scater包自己提供了一个基于PCA的QC标准,不需要自己根据文库大小,覆盖的基因数量,外源的ERCC spike-ins 含量以及线粒体DNA含量来进行人工过滤。

默认的筛选条件如下:

  • pct_counts_top100features
  • total_features
  • pct_counts_feature_controls
  • n_detected_feature_controls
  • log10_counts_endogenous_features
  • log10_counts_feature_controls

一站式QC函数如下:

dat_pca <- scater::plotPCA(dat_qc,
                  size_by = "total_features", 
                  shape_by = "use",
                  pca_data_input = "pdata",
                  detect_outliers = TRUE,
                  return_SCESet = TRUE)

还有更详细的教程,需要看

  • https://www.bioconductor.org/help/workflows/simpleSingleCell/
  • http://hemberg-lab.github.io/scRNA.seq.course/index.html
sessionInfo()

过滤只是它最基本的工具,它作为单细胞转录组3大R包,功能肯定是非常全面的,比如前面我们讲解的normalization,DEG, features selection,cluster,它都手到擒来,只不过是包装的是其它R包的函数。

原文发布于微信公众号 - 生信技能树(biotrainee)

原文发表时间:2018-01-23

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏CVer

ECCV 2018 776篇论文一键下载

昨天推送了ECCV 2018 收录论文名单全公布,看到很多CVer纷纷转发至朋友圈,Amusi真的很开心!

1652
来自专栏Spark学习技巧

第2篇:数据库关系建模

第二篇:数据库关系建模 前言 ER建模环节完成后,需求就被描述成了ER图。之后,便可根据这个ER图设计相应的关系表了。 但从ER图到具体关系表的建立还需要经过两...

3116
来自专栏PPV课数据科学社区

《Kaggle项目实战》 泰坦尼克:从R开始数据挖掘(一)

摘要: 你是否为研究数据挖掘预测问题而感到兴奋?那么如何开始呢,本案例选自Kaggle上的数据竞赛的一个数据竞赛项目《泰坦尼克:灾难中的机器学习》,案例涉及一...

3126
来自专栏用户2442861的专栏

使用ffmpeg转换文件格式,及ffmpeg参数说明(转)

转换文件test.avi到test.flv ffmpeg -i test.avi -ab 56 -ar 22050 -b 500 -r 29.97 -s 3...

2701
来自专栏数据库

速来围观!——三种NCBI常见数据库

在微生物测序分析中,常常需要对未知的核酸或蛋白序列进行物种,功能或类别注释。注释方法种类较多,其中最常用的是与一些标准数据库进行相似性搜索,也就是序列比对。因此...

20010
来自专栏数据小魔方

parklines迷你图系列1——Scales

按照之前的计划,今天开始按照sparklines插件的图表分类标准开始跟大家分享详细的做法。 按照该插件在excel菜单中的顺序,先来看测量尺度(Scales)...

2776
来自专栏新智元

【TensorFlow1.2.0版发布】14大新功能,增加Intel MKL集成

【新智元导读】TensorFlow 今天发布最新版 1.2.0,公布了14大最新功能。新智元带来最新介绍,包括 API 的重要变化、contrib API的变化...

3329
来自专栏深度学习入门与实践

用Tensorflow让神经网络自动创造音乐

本文禁止转载,禁止用于各类讲座及ppt中,违者必究   前几天看到一个有意思的分享,大意是讲如何用Tensorflow教神经网络自动创造音乐。听起来好好玩有木有...

2879
来自专栏灯塔大数据

每周学点大数据 | No.47 BSP 模型下的单源最短路径

No.47期 BSP 模型下的单源最短路径 我们先来举个例子吧。单源最短路径也是一种很典型的图论问题,前面我们提到过,就是求解从一个源点到各个节点的最短距离,...

3245
来自专栏Java帮帮-微信公众号-技术文章全总结

Dubbo入门学习--负载均衡策略(4)

Dubbo入门学习--负载均衡策略 负载均衡 ? Random LoadBalance 随机,按权重设置随机概率。 在一个截面上碰撞的概率高,但调用量越大分布越...

3264

扫码关注云+社区