为什么CD4阳性T细胞并不是表达CD4最多的

生信技能树

发布于 2021-10-12 12:02:27

7910

发布于 2021-10-12 12:02:27

文章被收录于专栏：生信技能树

大家已经跟着我们跑了很多次我们对官方 pbmc3k 例子，只需要自己按照如下所示链接下载 pbmc3k_filtered_gene_bc_matrices.tar.gz 并且解压即可，然后使用 Seurat 包里面的 Read10X 函数读取解压好的文件夹路径：

标准代码如下所示：

library(Seurat)
# https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
## Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")

## Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", 
                           min.cells = 3, min.features = 200)
## Identification of mithocondrial genes
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

## Filtering cells following standard QC criteria.
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & 
                 percent.mt < 5)

## Normalizing the data
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", 
                      scale.factor = 10000)

pbmc <- NormalizeData(pbmc)

## Identify the 2000 most highly variable genes
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

## In addition we scale the data
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc), 
               verbose = FALSE)
pbmc <- FindNeighbors(pbmc, dims = 1:10, verbose = FALSE)
pbmc <- FindClusters(pbmc, resolution = 0.5, verbose = FALSE)
pbmc <- RunUMAP(pbmc, dims = 1:10, umap.method = "uwot", metric = "cosine")
table(pbmc$seurat_clusters)
# pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25,  logfc.threshold = 0.25, verbose = FALSE)
DimPlot(pbmc, reduction = "umap", group.by = 'seurat_clusters',
        label = TRUE, pt.size = 0.5) 
DotPlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", 
                               "CD14", "FCER1A", "FCGR3A", 
                               "LYZ", "PPBP", "CD8A"),
        group.by = 'seurat_clusters')
## Assigning cell type identity to clusters
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T",
                     "FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
 
pbmc$cluster_by_counts=Idents(pbmc)
table(pbmc$cluster_by_counts)

虽然是进行初步生物学命名，看起来合情合理：

初步生物学命名

但是我在检查CD4基因表达量的时候，发现了很有意思的现象：

各个细胞亚群，都是有CD4基因表达的

可以看到各个细胞亚群，都是有CD4基因表达的，我们虽然命名了 Naive CD4 T和Memory CD4 T"，但是它们并没有特异性的高表达CD4基因哦！

上面的可视化代码如下所示：

sce=pbmc
sce$celltype=Idents(sce)

p1=FeaturePlot(sce,'CD4')
p2=DimPlot(sce, reduction = "umap", 
        label = TRUE, repel = T,pt.size = 0.5) + NoLegend()
p3=VlnPlot(sce,'CD4',group.by = 'celltype')
library(patchwork)
p1+p2
p1+p3

这个时候有粉丝提问，能不能在第一幅图umap里面，加上第二幅图FeaturePlot看CD4基因表达信息。文献出处是：《IL-11 is a crucial determinant of cardiovascular fibrosis》

如下所示，可以看到作者其实就是想展现IL-11这个基因呢，在其中一个fibroblasts细胞亚群里面是表达量比较高！

其中一个fibroblasts细胞亚群里面是表达IL-11这个基因

我查了一下， Seurat 包里面确实没有这个函数，不过 Seurat 包绘制的图形都是ggplot体系，所以比较容易自定义。

其实上面的图就是在umap上面叠加FeaturePlot信息，我给出来的代码如下所示：

p2=DimPlot(sce, reduction = "umap", 
        label = TRUE, repel = T,pt.size = 0.5) + NoLegend()
pos=sce@reductions$umap@cell.embeddings
pos=pos[sce@assays$RNA@counts['CD4',]>1,]
head(pos)
library(ggplot2)
p2+geom_point(aes(x=UMAP_1,y=UMAP_2), 
              shape = 21, colour = "black",
              fill = "blue", size = 0.5,  
              data = as.data.frame(pos))

效果如下所示：

在umap上面叠加FeaturePlot信息

首先需要对seurat对象有所理解

代码并不多，首先需要对seurat对象有所理解！单细胞数据看起来种类很多，有CEL-seq、MARS-seq、Drop-seq、Chromium 10x和SMART-seq的fastq数据。但是最终都是得到表达量矩阵哦，大家通常是5个R包，分别是: scater,monocle,Seurat,scran,M3Drop，需要熟练掌握它们的对象，：一些单细胞转录组R包的对象 而且分析流程也大同小异：

step1: 创建对象
step2: 质量控制
step3: 表达量的标准化和归一化
step4: 去除干扰因素(多个样本整合)
step5: 判断重要的基因
step6: 多种降维算法
step7: 可视化降维结果
step8: 多种聚类算法
step9: 聚类后找每个细胞亚群的标志基因
step10: 继续分类

如果你也对10x单细胞转录组感兴趣，参考我们的《明码标价》专栏里面的单细胞内容

单细胞转录组数据分析的标准降维聚类分群，并且进行生物学注释后的结果。可以参考前面的例子：人人都能学会的单细胞聚类分群注释 ，我们演示了第一层次的分群。

如果你对单细胞数据分析还没有基础认知，可以看基础10讲：

其次需要对ggplot语法有所了解

一张统计图就是从数据到几何对象（点、线、条形等）的图形属性（颜色、形状、大小等）的一个映射。

✦ 数据（Data），最基础的是可视化的数据和一系列图形映射（aesthetic mappings），该映射描述了数据中的变量如何映射到可见的图形属性。
✦ 几何对象（Geometric objects, geoms）代表在图中实际看到的点、线、多边形等。
✦ 统计转换（Statistical trassformations, stats）是对数据进行某种汇总，例如将数据分组创建直方图，或将一个二维的关系用线性模型进行解释。
✦ 标度（Scales）是将数据的取值映射到图形空间，例如用颜色、大小或形状来表示不同的取值，展现标度的常见做法是绘制图例和坐标轴。
✦ 坐标系（Coordinate system, coord）描述数据是如何映射到图形所在的平面，同时提供看图所需的坐标轴和网格线。
✦ 分面（faceting）如何将数据分解为子集，以及如何对子集作图并展示。
✦ 主题（theme）控制细节显示，例如字体大小和图形的背景色。

ggplot2作者亲自写的书

链接：https://ggplot2-book.org/facet.html

书名是：ggplot2: Elegant Graphics for Data Analysis 作者：Hadley Wickham

This is the online version of work-in-progress 3rd edition of “ggplot2: elegant graphics for data analysis”

虽然这本书有对应的中文译本，但是时间上相对滞后，建议直接看这个在线实时更新版本。

Getting started

1 Introduction
2 Getting started with ggplot2
3 Frequently asked questions
II Toolbox

Introduction

4 Individual geoms
5 Collective geoms
6 Statistical summaries
7 Maps
8 Annotations
9 Arranging plots

III The Grammar

10 Mastering the grammar
11 Build a plot layer by layer
12 Scales, axes and legends
13 Coordinate systems
14 Facetting
15 Themes

IV Extending ggplot2

16 Programming with ggplot2
17 ggplot2 internals
18 Writing ggplot2 extensions
19 Extension Case Study: Springs, Part 1
References

看完你一定会觉得不虚此行！至少花十天时间哦。

知识点参考卡片（速记表，小抄）

链接：https://ggplot2.tidyverse.org/reference/

内容如下：

Plot basics
Layer: geoms
Layer: stats
Layer: position adjustment
Layer: annotations
Aesthetics
Scales
Guides: axes and legends
Facetting
Facetting: labels
Coordinate systems
Themes
Programming with ggplot2
Extending ggplot2
Vector helpers
Data
Autoplot and fortify

读这个知识点参考卡片，可以检验你ggplot2语法的记忆程度。

sthda网站的ggplot核心图表示例

链接：http://www.sthda.com/english/wiki/ggplot2-essentials

书籍本身提供售卖，价格是17欧元，不过内容都是电子化了，大家直接网页浏览，就是免费的哈！

内容：

qplot(): Quick plot with ggplot2
- Scatter plots
- Bar plot
- Box plot, violin plot and dot plot
- Histogram and density plots
Box plots
- Change box plot line colors
- Change box plot fill colors
- Basic box plots
- Box plot with dots
- Change box plot colors by groups
- Change the legend position
- Change the order of items in the legend
- Box plot with multiple groups
- Functions: geom_boxplot(), stat_boxplot(), stat_summary()

··· 中间省略 25个章节

Rotate a plot: flip and reverse

Horizontal plot : coord_flip()
Reverse y axis
Functions: coord_flip(), scale_x_reverse(), scale_y_reverse()

Faceting: split a plot into a matrix of panels

Facet with one variable
Facet with two variables
Facet scales
Facet labels
facet_wrap
Functions: facet_grid(), facet_wrap(), label_both(), label_bquote(), label_parsed()

内容之丰富，起码需要五天左右时间完全follow下来。

还包括以下扩展包：

factoextra - Extract and Visualize the outputs of a multivariate analysis: PCA (Principal Component Analysis), CA (Correspondence Analysis), MCA (Multiple Correspondence Analysis) and clustering analyses.
easyggplot2: Perform and customize easily a plot with ggplot2: box plot, dot plot, strip chart, violin plot, histogram, density plot, scatter plot, bar plot, line plot, etc, …
ggplot2 - Easy way to mix multiple graphs on the same page
ggplot2: Correlation matrix heatmap. Functions: geom_raster() and geom_tile()
ggfortify: Allow ggplot2 to handle some popular R packages. These include plotting 1) Matrix; 2) Linear Model and Generalized Linear Model; 3) Time Series; 4) PCA/Clustering; 5) Survival Curve; 6) Probability distribution
GGally: GGally extends ggplot2 for visualizing correlation matrix, scatterplot plot matrix, survival plot and more.
ggRandomForests: Graphical analysis of random forests with the randomForestSRC and ggplot2 packages.
ggdendro: Create dendrograms and tree diagrams using ggplot2
ggmcmc: Tools for Analyzing MCMC Simulations from Bayesian Inference
ggthemes: Package with additional ggplot2 themes and scales
Theme used to create journal ready figures easily