单细胞转录组3大R包之Seurat

牛津大学的Rahul Satija等开发的Seurat,最早公布在Nature biotechnology, 2015,文章是; Spatial reconstruction of single-cell gene expression data , 在2017年进行了非常大的改动,所以重新在biorxiv发表了文章在 Integrated analysis of single cell transcriptomic data across conditions, technologies, and species 。 功能涵盖了scRNA-seq的QC、过滤、标准化、批次效应、PCA、tNSE亚群聚类分析、差异基因分析、亚群特异性标志物鉴定等等等。

其GitHub地址是:http://satijalab.org/seurat/

给初学者提供了一个2,700 PBMC scRNA-seq dataset from 10X genomics的数据实战指导,非常容易学会: http://satijalab.org/seurat/pbmc3k_tutorial.html 数据在: https://personal.broadinstitute.org/rahuls/seurat/seurat_files_nbt.zip

同时还提供两个公共数据的实战演练教程:

  • https://www.dropbox.com/s/4d00eyd84qscyd2/IntegratedAnalysis_Examples.zip?dl=1
  • http://bit.ly/IAexpmat

下载后如下所示:

IntegratedAnalysis_Examples
├── [ 211]  INSTALL
├── [1.4K]  README.md
├── [ 170]  data
│   ├── [ 83K]  Supplementary_Table_MarrowCellData.tsv
│   ├── [547K]  Supplementary_Table_PancreasCellData.tsv
│   └── [ 561]  regev_lab_cell_cycle_genes.txt
├── [ 170]  examples
│   ├── [2.9K]  marrow_commandList.R
│   └── [2.5K]  pancreas_commandList.R
└── [ 170]  tutorial
    ├── [8.2K]  Seurat_AlignmentTutorial.Rmd
    └── [7.6M]  Seurat_AlignmentTutorial.pdf
IntegratedAnalysis_ExpressionMatrices
├── [102M]  marrow_mars.expressionMatrix.txt
├── [ 66M]  marrow_ss2.expressionMatrix.txt
├── [330M]  pancreas_human.expressionMatrix.txt
├── [ 54M]  pancreas_mouse.expressionMatrix.txt
├── [165M]  pbmc_10X.expressionMatrix.txt
└── [101M]  pbmc_SeqWell.expressionMatrix.txt

seurat的用法

这里的测试数据是经由Illumina NextSeq 500测到的2,700 single cells 表达矩阵,下载地址;

根据表达矩阵构建seurat对象

需要准备好3个输入文件

library(Seurat)
library(dplyr)
library(Matrix)
## https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz 
## 下载整个压缩包解压即可重现整个流程
# Load the PBMC dataset
list.files("~/Downloads/filtered_gene_bc_matrices/hg19/")
## [1] "barcodes.tsv" "genes.tsv"    "matrix.mtx"
pbmc.data <- Read10X(data.dir = "~/Downloads/filtered_gene_bc_matrices/hg19/")

# Examine the memory savings between regular and sparse matrices
dense.size <- object.size(x = as.matrix(x = pbmc.data))
dense.size
## 709264728 bytes
sparse.size <- object.size(x = pbmc.data)
sparse.size
## 38715120 bytes
dense.size / sparse.size
## 18.3 bytes
# Initialize the Seurat object with the raw (non-normalized data).  Keep all
# genes expressed in >= 3 cells (~0.1% of the data). Keep all cells with at
# least 200 detected genes
pbmc <- CreateSeuratObject(raw.data = pbmc.data, min.cells = 3, min.genes = 200, 
    project = "10X_PBMC")
pbmc
## An object of class seurat in project 10X_PBMC 
##  13714 genes across 2700 samples.

进行一系列的QC步骤

mito.genes <- grep(pattern = "^MT-", x = rownames(x = pbmc@data), value = TRUE)
percent.mito <- Matrix::colSums(pbmc@raw.data[mito.genes, ]) / Matrix::colSums(pbmc@raw.data)

# AddMetaData adds columns to object@meta.data, and is a great place to stash QC stats
pbmc <- AddMetaData(object = pbmc, metadata = percent.mito, col.name = "percent.mito")
VlnPlot(object = pbmc, features.plot = c("nGene", "nUMI", "percent.mito"), nCol = 3)
# GenePlot is typically used to visualize gene-gene relationships, but can be used for anything 
# calculated by the object, i.e. columns in object@meta.data, PC scores etc.
# Since there is a rare subset of cells with an outlier level of high mitochondrial percentage
# and also low UMI content, we filter these as well
par(mfrow = c(1, 2))
GenePlot(object = pbmc, gene1 = "nUMI", gene2 = "percent.mito")
GenePlot(object = pbmc, gene1 = "nUMI", gene2 = "nGene")
# We filter out cells that have unique gene counts over 2,500 or less than 200
# Note that low.thresholds and high.thresholds are used to define a 'gate'
# -Inf and Inf should be used if you don't want a lower or upper threshold.
pbmc <- FilterCells(object = pbmc, subset.names = c("nGene", "percent.mito"), low.thresholds = c(200, -Inf), high.thresholds = c(2500, 0.05))

可以看到这里选择的QC标准是 200~2500基因范围内,以及线粒体基因表达占比小于5%的才保留。

normalization

这里默认根据细胞测序文库大小进行normalization,简单的做一个log转换即可。

 summary(pbmc@raw.data[,1])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1764  0.0000 76.0000
pbmc <- NormalizeData(object = pbmc, normalization.method = "LogNormalize", 
    scale.factor = 10000)

summary(pbmc@data[,1])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1171  0.0000  5.7531

rDetection of variable genes across the single cells

pbmc <- FindVariableGenes(object = pbmc, mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.0125, x.high.cutoff = 3, y.cutoff = 0.5)
length(x = pbmc@var.genes)
## [1] 1838

Scaling the data and removing unwanted sources of variation

需要去除那些technical noise,batch effects, or even biological sources of variation (cell cycle stage)

pbmc <- ScaleData(object = pbmc, vars.to.regress = c("nUMI", "percent.mito"))
summary(pbmc@scale.data[,1])

PCA分析

pbmc <- RunPCA(object = pbmc, pc.genes = pbmc@var.genes, do.print = TRUE, pcs.print = 1:5, genes.print = 5)
## [1] "PC1"
## [1] "CST3"   "TYROBP" "FCN1"   "LST1"   "AIF1"  
## [1] ""
## [1] "PTPRCAP" "IL32"    "LTB"     "CD2"     "CTSW"   
## [1] ""
## [1] ""
## [1] "PC2"
## [1] "NKG7" "GZMB" "PRF1" "CST7" "GZMA"
## [1] ""
## [1] "CD79A"    "MS4A1"    "HLA-DQA1" "TCL1A"    "HLA-DQB1"
## [1] ""
## [1] ""
## [1] "PC3"
## [1] "PF4"   "PPBP"  "SDPR"  "SPARC" "GNG11"
## [1] ""
## [1] "CYBA"     "HLA-DPA1" "HLA-DPB1" "HLA-DRB1" "CD37"    
## [1] ""
## [1] ""
## [1] "PC4"
## [1] "IL32"   "GIMAP7" "AQP3"   "FYB"    "MAL"   
## [1] ""
## [1] "CD79A"    "HLA-DQA1" "CD79B"    "MS4A1"    "HLA-DQB1"
## [1] ""
## [1] ""
## [1] "PC5"
## [1] "FCER1A"  "LGALS2"  "MS4A6A"  "S100A8"  "CLEC10A"
## [1] ""
## [1] "FCGR3A"        "CTD-2006K23.1" "IFITM3"        "ABI3"         
## [5] "CEBPB"        
## [1] ""
## [1] ""

对PCA分析结果可以进行一系列的可视化: PrintPCA, VizPCA, PCAPlot, and PCHeatmap

# Examine and visualize PCA results a few different ways
PrintPCA(object = pbmc, pcs.print = 1:5, genes.print = 5, use.full = FALSE)
## [1] "PC1"
## [1] "CST3"   "TYROBP" "FCN1"   "LST1"   "AIF1"  
## [1] ""
## [1] "PTPRCAP" "IL32"    "LTB"     "CD2"     "CTSW"   
## [1] ""
## [1] ""
## [1] "PC2"
## [1] "NKG7" "GZMB" "PRF1" "CST7" "GZMA"
## [1] ""
## [1] "CD79A"    "MS4A1"    "HLA-DQA1" "TCL1A"    "HLA-DQB1"
## [1] ""
## [1] ""
## [1] "PC3"
## [1] "PF4"   "PPBP"  "SDPR"  "SPARC" "GNG11"
## [1] ""
## [1] "CYBA"     "HLA-DPA1" "HLA-DPB1" "HLA-DRB1" "CD37"    
## [1] ""
## [1] ""
## [1] "PC4"
## [1] "IL32"   "GIMAP7" "AQP3"   "FYB"    "MAL"   
## [1] ""
## [1] "CD79A"    "HLA-DQA1" "CD79B"    "MS4A1"    "HLA-DQB1"
## [1] ""
## [1] ""
## [1] "PC5"
## [1] "FCER1A"  "LGALS2"  "MS4A6A"  "S100A8"  "CLEC10A"
## [1] ""
## [1] "FCGR3A"        "CTD-2006K23.1" "IFITM3"        "ABI3"         
## [5] "CEBPB"        
## [1] ""
## [1] ""
VizPCA(object = pbmc, pcs.use = 1:2)
PCAPlot(object = pbmc, dim.1 = 1, dim.2 = 2)
# ProjectPCA scores each gene in the dataset (including genes not included in the PCA) based on their correlation 
# with the calculated components. Though we don't use this further here, it can be used to identify markers that 
# are strongly correlated with cellular heterogeneity, but may not have passed through variable gene selection. 
# The results of the projected PCA can be explored by setting use.full=T in the functions above
pbmc <- ProjectPCA(object = pbmc, do.print = FALSE)

最重要的就是 PCHeatmap 函数了

PCHeatmap(object = pbmc, pc.use = 1, cells.use = 500, do.balanced = TRUE, label.columns = FALSE)
PCHeatmap(object = pbmc, pc.use = 1:12, cells.use = 500, do.balanced = TRUE, label.columns = FALSE, use.full = FALSE)

找到有统计学显著性的主成分

主成分分析结束后需要确定哪些主成分所代表的基因可以进入下游分析,这里可以使用JackStraw做重抽样分析。可以用JackStrawPlot可视化看看哪些主成分可以进行下游分析。

pbmc <- JackStraw(object = pbmc, num.replicate = 100, do.print = FALSE) 
JackStrawPlot(object = pbmc, PCs = 1:12)

当然,也可以用最经典的碎石图来确定主成分。

PCElbowPlot(object = pbmc)

这个确定主成分是非常有挑战性的: - The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example. - The second implements a statistical test based on a random null model, but is time-consuming for large datasets, and may not return a clear PC cutoff. - The third is a heuristic that is commonly used, and can be calculated instantly.

在本例子里面,3种方法结果差异不大,可以在PC7~10直接挑选。

Cluster the cells

# save.SNN = T saves the SNN so that the clustering algorithm can be rerun using the same graph
# but with a different resolution value (see docs for full details)
pbmc <- FindClusters(object = pbmc, reduction.type = "pca", dims.use = 1:10, resolution = 0.6, print.output = 0, save.SNN = TRUE)

A useful feature in Seurat v2.0 is the ability to recall the parameters that were used in the latest function calls for commonly used functions. For FindClusters, we provide the function PrintFindClustersParams to print a nicely formatted formatted summary of the parameters that were chosen.

PrintFindClustersParams(object = pbmc)
## Parameters used in latest FindClusters calculation run on: 2018-01-22 07:43:31
## =============================================================================
## Resolution: 0.6
## -----------------------------------------------------------------------------
## Modularity Function    Algorithm         n.start         n.iter
##      1                   1                 100             10
## -----------------------------------------------------------------------------
## Reduction used          k.param          k.scale          prune.SNN
##      pca                 30                25              0.0667
## -----------------------------------------------------------------------------
## Dims used in calculation
## =============================================================================
## 1 2 3 4 5 6 7 8 9 10
# While we do provide function-specific printing functions, the more general function to 
# print calculation parameters is PrintCalcParams(). 

Run Non-linear dimensional reduction (tSNE)

同样也是一个函数,这个结果也可以像PCA分析一下挑选合适的PC进行下游分析。

pbmc <- RunTSNE(object = pbmc, dims.use = 1:10, do.fast = TRUE)
# note that you can set do.label=T to help label individual clusters
TSNEPlot(object = pbmc)

这一步很耗时,可以保存该对象,便于重复,以及分享交流

save(pbmc, file = "pbmc3k.rData")

Finding differentially expressed genes (cluster biomarkers)

差异分析在seurat包里面被封装成了函数:FindMarkers,有一系列参数可以选择,然后又4种找差异基因的算法:

  • ROC test (“roc”)
  • t-test (“t”)
  • LRT test based on zero-inflated data (“bimod”, default)
  • LRT test based on tobit-censoring models (“tobit”)
# find all markers of cluster 1
cluster1.markers <- FindMarkers(object = pbmc, ident.1 = 1, min.pct = 0.25)
print(x = head(x = cluster1.markers, n = 5))
##                p_val avg_logFC pct.1 pct.2    p_val_adj
## S100A9  0.000000e+00  3.827593 0.996 0.216  0.00000e+00
## S100A8  0.000000e+00  3.786535 0.973 0.123  0.00000e+00
## LGALS2  0.000000e+00  2.634722 0.908 0.060  0.00000e+00
## FCN1    0.000000e+00  2.369524 0.956 0.150  0.00000e+00
## CD14   8.129864e-290  1.949317 0.663 0.029 1.11493e-285
# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(object = pbmc, ident.1 = 5, ident.2 = c(0,3), min.pct = 0.25)
print(x = head(x = cluster5.markers, n = 5))
##                p_val avg_logFC pct.1 pct.2     p_val_adj
## GZMB   3.854665e-190  3.195021 0.955 0.084 5.286288e-186
## IGFBP7 2.967797e-155  2.175917 0.542 0.010 4.070037e-151
## GNLY   7.492111e-155  3.514718 0.961 0.143 1.027468e-150
## FGFBP2 2.334109e-150  2.559484 0.852 0.085 3.200998e-146
## FCER1G 4.819154e-141  2.280724 0.839 0.100 6.608987e-137
# find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(object = pbmc, only.pos = TRUE, min.pct = 0.25, thresh.use = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(2, avg_logFC)
## # A tibble: 16 x 7
## # Groups:   cluster [8]
##            p_val avg_logFC pct.1 pct.2     p_val_adj cluster     gene
##            <dbl>     <dbl> <dbl> <dbl>         <dbl>  <fctr>    <chr>
##  1 1.315805e-234  1.149058 0.924 0.483 1.804495e-230       0     LDHB
##  2 3.311687e-129  1.068122 0.662 0.202 4.541648e-125       0     IL7R
##  3  0.000000e+00  3.827593 0.996 0.216  0.000000e+00       1   S100A9
##  4  0.000000e+00  3.786535 0.973 0.123  0.000000e+00       1   S100A8
##  5  0.000000e+00  2.977399 0.936 0.042  0.000000e+00       2    CD79A
##  6 1.038405e-271  2.492236 0.624 0.022 1.424068e-267       2    TCL1A
##  7 8.029765e-207  2.158812 0.974 0.230 1.101202e-202       3     CCL5
##  8 1.118949e-181  2.113428 0.588 0.050 1.534527e-177       3     GZMK
##  9 1.066599e-173  2.275509 0.962 0.137 1.462733e-169       4   FCGR3A
## 10 1.996623e-123  2.151881 1.000 0.316 2.738169e-119       4     LST1
## 11 9.120707e-265  3.334634 0.955 0.068 1.250814e-260       5     GZMB
## 12 6.251673e-192  3.763928 0.961 0.131 8.573544e-188       5     GNLY
## 13 2.510362e-238  2.729243 0.844 0.011 3.442711e-234       6   FCER1A
## 14  7.037034e-21  1.965168 1.000 0.513  9.650588e-17       6 HLA-DPB1
## 15 2.592342e-186  4.952160 0.933 0.010 3.555138e-182       7      PF4
## 16 7.813553e-118  5.889503 1.000 0.023 1.071551e-113       7     PPBP

值得注意的是: The ROC test returns the ‘classification power’ for any individual marker (ranging from 0 - random, to 1 - perfect).

cluster1.markers <- FindMarkers(object = pbmc, ident.1 = 0, thresh.use = 0.25, test.use = "roc", only.pos = TRUE)

同时,该包提供了一系列可视化方法来检查差异分析的结果的可靠性:

  • VlnPlot (shows expression probability distributions across clusters)
  • FeaturePlot (visualizes gene expression on a tSNE or PCA plot) are our most commonly used visualizations
  • JoyPlot, CellPlot, and DotPlot
VlnPlot(object = pbmc, features.plot = c("MS4A1", "CD79A"))
# you can plot raw UMI counts as well
VlnPlot(object = pbmc, features.plot = c("NKG7", "PF4"), use.raw = TRUE, y.log = TRUE)
FeaturePlot(object = pbmc, features.plot = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"), cols.use = c("grey", "blue"), reduction.use = "tsne")

DoHeatmap generates an expression heatmap for given cells and genes. In this case, we are plotting the top 20 markers (or all markers if less than 20) for each cluster.

pbmc.markers %>% group_by(cluster) %>% top_n(10, avg_logFC) -> top10
# setting slim.col.label to TRUE will print just the cluster IDS instead of every cell name
DoHeatmap(object = pbmc, genes.use = top10$gene, slim.col.label = TRUE, remove.key = TRUE)

Assigning cell type identity to clusters

这个主要取决于生物学背景知识:

Cluster ID

Markers

Cell Type

0

IL7R

CD4 T cells

1

CD14, LYZ

CD14+ Monocytes

2

MS4A1

B cells

3

CD8A

CD8 T cells

4

FCGR3A, MS4A7

FCGR3A+ Monocytes

5

GNLY, NKG7

NK cells

6

FCER1A, CST3

Dendritic Cells

7

PPBP

Megakaryocytes

current.cluster.ids <- c(0, 1, 2, 3, 4, 5, 6, 7)
new.cluster.ids <- c("CD4 T cells", "CD14+ Monocytes", "B cells", "CD8 T cells", "FCGR3A+ Monocytes", "NK cells", "Dendritic cells", "Megakaryocytes")
pbmc@ident <- plyr::mapvalues(x = pbmc@ident, from = current.cluster.ids, to = new.cluster.ids)
TSNEPlot(object = pbmc, do.label = TRUE, pt.size = 0.5)

Further subdivisions within cell types

# First lets stash our identities for later
pbmc <- StashIdent(object = pbmc, save.name = "ClusterNames_0.6")

# Note that if you set save.snn=T above, you don't need to recalculate the SNN, and can simply put: 
# pbmc <- FindClusters(pbmc,resolution = 0.8)
pbmc <- FindClusters(object = pbmc, reduction.type = "pca", dims.use = 1:10, resolution = 0.8, print.output = FALSE)

# Demonstration of how to plot two tSNE plots side by side, and how to color points based on different criteria
plot1 <- TSNEPlot(object = pbmc, do.return = TRUE, no.legend = TRUE, do.label = TRUE)
plot2 <- TSNEPlot(object = pbmc, do.return = TRUE, group.by = "ClusterNames_0.6", no.legend = TRUE, do.label = TRUE)
plot_grid(plot1, plot2)
# Find discriminating markers
tcell.markers <- FindMarkers(object = pbmc, ident.1 = 0, ident.2 = 1)

# Most of the markers tend to be expressed in C1 (i.e. S100A4). However, we can see that CCR7 is upregulated in 
# C0, strongly indicating that we can differentiate memory from naive CD4 cells.
# cols.use demarcates the color palette from low to high expression
FeaturePlot(object = pbmc, features.plot = c("S100A4", "CCR7"), cols.use = c("green", "blue"))

The memory/naive split is bit weak, and we would probably benefit from looking at more cells to see if this becomes more convincing. In the meantime, we can restore our old cluster identities for downstream processing.

还有一个非常给力的用法,限于篇幅,就不介绍了,大家可以自行探索。

后面还有一个10X的单细胞实战,用的就是这个包,敬请期待。

原文发布于微信公众号 - 生信技能树(biotrainee)

原文发表时间:2018-01-24

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏专注研发

【2018】笔试题笔记

3.对关键字{10,20,8,25,35,6,18,30,5,15,28}序列进行希尔排序,取增量d =5时,排序结果为( {6,18,8,5,15,10,...

21340
来自专栏犀利豆的技术空间

Java 渲染 docx 文件,并生成 pdf 加水印

一顿google以后发现了 StackOverflow 上的这个回答:Converting docx into pdf in java 使用如下的 jar 包:

54410
来自专栏知识分享

1-关于单片机通信数据传输(中断发送,大小端,IEEE754浮点型格式,共用体,空闲中断,环形队列)

写这篇文章的目的呢,如题目所言,我承认自己是一个程序猿.....应该说很多很多学单片机的对于...先不说别的了,,无论是学51的还是32的,,,先问一下大家用串...

30950
来自专栏大数据

数控加工中心编程小技巧

对于数控加工来说,编程至关重要,直接影响到加工的质量与效率,相信大家也是对编程又爱又恨吧。那么如何迅速掌握数控加工中心的编程技巧呢?下面与老路一起学习一下吧! ...

23570
来自专栏流媒体

Android平台下使用FFmpeg进行RTMP推流(摄像头推流)

前面讲到了在Android平台下使用FFmpeg进行RTMP推流(视频文件推流),里面主要是介绍如何解析视频文件并进行推流,今天要给大家介绍如何在Android...

66940
来自专栏happyJared

Elasticsearch地理坐标类型(Geo-point)在Spring Data ES中的常见使用问题整理解答

  下文整理的几个问答,本人在实际应用中亲身经历或解决过的,主要涉及Elasticsearch地理坐标类型(Geo-point)在Java应用中的一些特殊使用场...

50110
来自专栏牛客网

字节跳动面经

只能自己给自己鼓励,不能总是按别人说的去做。有时,你甚至连为什么要这样做都分不清楚,任何的责任都可以让人振奋。但是,荣誉,那才是让你决定做还是不做一件事的原因。...

57410
来自专栏小樱的经验随笔

CTF---隐写术入门第二题 小苹果

小苹果分值:10 来源: hanyuhang 难度:易 参与人数:2159人 Get Flag:862人 答题人数:996人 解题通过率:87% flag格...

33050
来自专栏芋道源码1024

重磅:JDK 11 正式发布!东半球第二全特性解读!

千呼万唤,JDK11于2018-09-25正式发布!你是不是和笔者一样还在使用JDK8呢?甚至有些开发者还在使用JDK7!没关系,让我们先一睹JDK11的风采。

15620
来自专栏技术/开源

Enum引发的血案,反思

前几天公司产品更新版本,更新完后不少用户反应原先保存的report的一些表在新版本打开后设置突然变了,本来选的第六个,现在打开变成第四个了。领导要求赶紧查出原因...

20950

扫码关注云+社区

领取腾讯云代金券