自己做的单细胞注释与文献对比：注释信息完全对不上？（数据注释篇）

生信技能树

发布于 2025-06-09 11:17:57

8300

代码可运行

文章被收录于专栏：生信技能树生信技能树

运行总次数：0

代码可运行

接上一篇帖子：自己做的单细胞注释与文献对比：注释信息完全对不上？（数据下载篇）

学习的是曾老板发布在群里的第11个写作任务：

【写作任务11】，注释信息完全对不上？数据：2022-OEP003500-胃癌-Cachexia

简单回顾数据背景

这个数据对应的文献于2022年11月15号发表在Cell Discov. 杂志上，标题为《Single-cell sequencing unveils key contributions of immune cell populations in cancer-associated adipose wasting》。下面这个图是文献中的单细胞注释结果：

这个是曾老板目前做的结果与文献中的结果对比：竖着的列是老板的注释结果，横着的行是文献中的注释结果，看起来就是完全没对上啊！

样本组织类型为：「皮下脂肪组织（Subcutaneous Adipose Tissue，SAT）」和 「内脏脂肪组织（Visceral Adipose Tissue，VAT）」

「皮下脂肪组织（Subcutaneous Adipose Tissue，SAT）」

细胞组成：主要由成熟的脂肪细胞（adipocytes）构成。这些脂肪细胞体积较大，呈球形或多边形，细胞内充满了脂肪滴。除了脂肪细胞外，还包含一些成纤维细胞、血管内皮细胞、免疫细胞（如巨噬细胞）等。成纤维细胞可以合成和分泌细胞外基质成分，维持组织的结构完整性；血管内皮细胞构成组织内的血管网络，为脂肪细胞提供营养和氧气；免疫细胞则参与组织的免疫反应和炎症调节。

「内脏脂肪组织（Visceral Adipose Tissue，VAT）」

内脏脂肪组织也主要由脂肪细胞构成，不过它的脂肪细胞与皮下脂肪组织的脂肪细胞在功能和代谢特性上可能存在差异。它同样包含血管内皮细胞、免疫细胞等，而且由于其代谢活跃，免疫细胞的含量和活性可能相对更高，这使得它在炎症反应和代谢调节方面的作用更加复杂。

数据下载见前面一篇：自己做的单细胞注释与文献对比：注释信息完全对不上？（数据下载篇）

数据读取

这里作者提供的数据如下：

OED00759126_total_cells.zip # 原始表达矩阵
OED00771634_annotation.tsv # 注释文件1
OED00771635_processed_total.zip # 过滤后的表达矩阵
OED00797031_annotation.tsv # 注释文件2

有两个注释文件，很奇怪，先读取进来看看吧：

###
### Create: Jianming Zeng
### Date:  2023-12-31  
### Email: jmzeng1314@163.com
### Blog: http://www.bio-info-trainee.com/
### Forum:  http://www.biotrainee.com/thread-1376-1-1.html
### CAFS/SUSTC/Eli Lilly/University of Macau
### Update Log: 2023-12-31   First version 
### Update Log: 2024-12-09   by juan zhang (492482942@qq.com)
### 
rm(list=ls())
options(stringsAsFactors = F)
library(ggsci)
library(dplyr) 
library(future)
library(Seurat)
library(clustree)
library(cowplot)
library(data.table)
library(ggplot2)
library(patchwork)
library(stringr)
library(qs)
library(Matrix)

# 创建目录
getwd()
gse <- "OEP003500"
dir.create(gse)

###### step1: 导入数据 ######   
data <- Read10X("OEP003500/processed/")
data[1:5, 1:5]
dim(data)

phe1 <- data.table::fread('OEP003500/OED00771634_annotation.tsv',data.table = F)
phe2 <- data.table::fread('OEP003500/OED00797031_annotation.tsv',data.table = F)
dim(phe1)
dim(phe2)
head(phe1)
table(phe1$orig.ident)
table(phe1$depot)
table(phe1$cachexia.status)
table(phe1$group)
table(phe1$mannual.annotation)
identical(phe1$cell,phe2$cell)
identical(phe1$mannual.annotation, phe2$mannual.annotation)
table(phe1$mannual.annotation, phe2$mannual.annotation)
identical(phe1$cachexia.status, phe2$cachexia.status)
identical(phe1$group, phe2$group)
identical(phe1$orig.ident, phe2$orig.ident)
identical(phe1$depot, phe2$depot)

meta <- data.frame(phe1,mannual.annotation2=phe2$mannual.annotation)
head(meta)
rownames(meta) <- meta$cell
head(meta)
table(meta$mannual.annotation)
table(meta$mannual.annotation2)
meta <- meta[colnames(data),]

# 创建对象
sce.all <- CreateSeuratObject(counts = data, meta.data = meta, min.cells = 3)
sce.all

# 查看特征
as.data.frame(sce.all@assays$RNA$counts[1:10, 1:2])
head(sce.all@meta.data, 10)
table(sce.all$orig.ident)

读取进来的细胞数：31170个细胞

image-20250608233007665

不管是过滤前的细胞数还是过滤后的细胞数，跟文献中的数字有点对不上：

image-20250608232941343

然后是qc以及去批次

rm(list=ls())
library(Seurat)
library(qs)
library(ggplot2)

# 读取数据
sce.all <- qread("OEP003500/sce.all.qs")
sce.all


###### step2: QC质控 ######
## 如果过滤的太狠，就需要去修改这个过滤代码
dir.create("./1-QC")
sp <- 'human'

# 1.计算线粒体基因比例
mito_genes <- grep("^MT-", rownames(sce.all),ignore.case = T, value = T) 
# 可能是13个线粒体基因
print(mito_genes)
# sce.all=PercentageFeatureSet(sce.all, "^MT-", col.name = "percent_mito")
sce.all <- PercentageFeatureSet(sce.all, features = mito_genes, col.name = "percent_mito")
fivenum(sce.all@meta.data$percent_mito)

# 2.计算核糖体基因比例
ribo_genes <- grep("^Rp[sl]", rownames(sce.all),ignore.case = T, value = T)
print(ribo_genes)
sce.all <- PercentageFeatureSet(sce.all,  features = ribo_genes, col.name = "percent_ribo")
fivenum(sce.all@meta.data$percent_ribo)

# 3.计算红血细胞基因比例
Hb_genes <- grep("^Hb[^(p)]", rownames(sce.all),ignore.case = T,value = T)
print(Hb_genes)
sce.all <- PercentageFeatureSet(sce.all, features=Hb_genes, col.name="percent_hb")
fivenum(sce.all@meta.data$percent_hb)

# 可视化细胞的上述比例情况
# pic1
p1 <- VlnPlot(sce.all, group.by = "orig.ident", features = c("nFeature_RNA", "nCount_RNA","percent_mito", "percent_ribo", "percent_hb"), 
              pt.size = 0.02, ncol = 5) + NoLegend()
p1 
w <- length(unique(sce.all$orig.ident))/1.5+10;w
ggsave(filename="1-QC/Vlnplot1.pdf",plot=p1,width = w,height = 5)

# pic2
p2 <- VlnPlot(sce.all, group.by = "orig.ident", features = c("nFeature_RNA", "nCount_RNA","percent_mito", "percent_ribo", "percent_hb"), 
              pt.size = 0, ncol = 5) + NoLegend()
p2 
w <- length(unique(sce.all$orig.ident))/1.5+10;w
ggsave(filename="1-QC/Vlnplot2.pdf",plot=p2,width = w,height = 5)

# pic3
p3 <- FeatureScatter(sce.all, "nCount_RNA", "nFeature_RNA", group.by = "orig.ident", pt.size = 0.5) + 
  NoLegend()
ggsave(filename="1-QC/Scatterplot.pdf",plot=p3, width = 6, height = 6)


library(qs)
sce.all.filt <- sce.all
qsave(sce.all.filt, file="1-QC/sce.all.filt.qs")

QC小提琴图如下：数据是过滤后的了，文献中过滤条件比较不一般，感兴趣的可以去看看

image-20250608233320989

去批次：

rm(list=ls())
library(harmony)
library(qs)
library(Seurat)
library(clustree)

###### step3: harmony整合多个单细胞样品 ######
dir.create("2-harmony")
getwd()

# 读取数据
sce.all.filt <- qread("1-QC/sce.all.filt.qs")

# 默认 ScaleData 没有添加"nCount_RNA", "nFeature_RNA"
sce.all.filt
print(dim(sce.all.filt))
head(sce.all.filt@meta.data)

# 走标准流程
sce.all.filt <- NormalizeData(sce.all.filt, normalization.method = "LogNormalize",scale.factor = 1e4) 
sce.all.filt <- FindVariableFeatures(sce.all.filt, nfeatures = 2000)
sce.all.filt <- ScaleData(sce.all.filt)
sce.all.filt <- RunPCA(sce.all.filt, features = VariableFeatures(object = sce.all.filt))

# 降维聚类，UMAP降维
sce.all.filt <- FindNeighbors(sce.all.filt, dims = 1:20)
sce.all.filt <- FindClusters(sce.all.filt, resolution = 0.5)
head(Idents(sce.all.filt), 5)
str(sce.all.filt@meta.data)
head(sce.all.filt$RNA_snn_res.0.5)
head(sce.all.filt$seurat_clusters)

sce.all.filt <- RunUMAP(sce.all.filt,  dims = 1:20, reduction = "pca")
p <- DimPlot(sce.all.filt, reduction = "umap",label=T,group.by = "orig.ident") 
p
ggsave(filename='2-harmony/umap-by-orig.ident-before-harmony.png',plot = p, width = 6,height = 5.5)

head(sce.all.filt@meta.data)
p1 <- DimPlot(sce.all.filt, reduction = "umap",label=T,group.by = "mannual.annotation") + ggtitle("mannual.annotation")
p2 <- DimPlot(sce.all.filt, reduction = "umap",label=T,group.by = "mannual.annotation2") + ggtitle("mannual.annotation2") 
p <- p1 + p2 & NoLegend()
ggsave(filename='2-harmony/umap-by-orig.ident-before-harmony-anno.png',plot = p, width = 12,height = 5.5)
save(sce.all.filt, file = "2-harmony/sce.all.filt.Rdata")

# 运行harmony 
sce.all.filt <- RunHarmony(sce.all.filt, "orig.ident")
names(sce.all.filt@reductions)
head(sce.all.filt@meta.data)


# UMAP & TSNE 降维
sce.all.filt <- RunUMAP(sce.all.filt,  dims = 1:20, reduction = "harmony")
p <- DimPlot(sce.all.filt, reduction = "umap",label=T,group.by = "orig.ident")
p
ggsave(filename='2-harmony/umap-by-orig.ident-after-harmony.png',plot = p, width = 6,height = 5.5)



# 聚类
# 设置不同的分辨率，观察分群效果(选择哪一个？)
sce.all.filt <- FindNeighbors(sce.all.filt, reduction = "harmony", dims = 1:20) 
for (res in c(0.01, 0.05, 0.1, 0.2, 0.3, 0.5,0.8,1)) {
# graph.name = "CCA_snn",
# sce.all.filt <- FindClusters(sce.all.filt, resolution = res, algorithm = 1) 
  sce.all.filt <- FindClusters(sce.all.filt, resolution = res) 
}
colnames(sce.all.filt@meta.data)
apply(sce.all.filt@meta.data[,grep("RNA_snn",colnames(sce.all.filt@meta.data))],2,table)

# save
p_tree <- clustree(sce.all.filt@meta.data, prefix = "RNA_snn_res.")
ggsave(plot=p_tree, filename="2-harmony/Tree_diff_resolution.pdf", width = 7, height = 10)

p <- DimPlot(sce.all.filt, reduction = "umap", group.by = "RNA_snn_res.0.3",label = T) + 
  ggtitle("louvain_0.3")
ggsave(plot=p, filename="2-harmony/Dimplot_resolution_0.3.pdf",width = 6, height = 6)

p <- DimPlot(sce.all.filt, reduction = "umap", group.by = "RNA_snn_res.0.1",label = T) + 
  ggtitle("louvain_0.1")
ggsave(plot=p, filename="2-harmony/Dimplot_resolution_0.1.pdf",width = 6, height = 6)

table(sce.all.filt@active.ident) 

library(qs)
qsave(sce.all.filt, file="2-harmony/sce.all_int.qs")

作者的两个注释可以在这里直接看了：

p1 <- DimPlot(sce.all.filt, reduction = "umap",label=T,group.by = "mannual.annotation") + ggtitle("mannual.annotation")
p2 <- DimPlot(sce.all.filt, reduction = "umap",label=T,group.by = "mannual.annotation2") + ggtitle("mannual.annotation2") 
p <- p1 + p2 & NoLegend()
ggsave(filename='2-harmony/umap-after-harmony-anno.png',plot = p, width = 10,height = 5.5)

是不是很奇怪，给的注释结果，看起来mannual.annotation2稍微正常一点：

image-20250608233552025

细胞注释

我们自己来注释一下吧，样本组织类型可以看帖子上面的描述，umap图：

rm(list=ls())
library(Seurat)
library(ggplot2)
library(SCP) # https://zhanghao-njmu.github.io/SCP/index.html
library(scCustomize)
library(qs)

###### step4:  看标记基因库 ######
# 原则上分辨率是需要自己肉眼判断，取决于个人经验
sce.all.int <- qread("2-harmony/sce.all_int.qs")
sce.all.int
table(Idents(sce.all.int))
table(sce.all.int$seurat_clusters)
table(sce.all.int$RNA_snn_res.0.1) 
table(sce.all.int$RNA_snn_res.0.3) 

getwd()
dir.create('3-check-by-0.3')
select_idet <- "RNA_snn_res.0.3"
sce.all.int$RNA_snn_res.0.3
#sce.all.int$RNA_snn_res.0.3 <- factor(sce.all.int$RNA_snn_res.0.3,levels = 0:14)
sce.all.int <- SetIdent(sce.all.int, value = select_idet)
table(sce.all.int@active.ident) 

# 美化版
p <- CellDimPlot(sce.all.int, group.by = select_idet, reduction = "UMAP", label = T,label.size = 4, label_repel = T, label_insitu = T,label_point_size = 1, label_point_color =NA ,label_segment_color = NA)
p
ggsave(plot=p, filename="3-check-by-0.3/Dimplot_resolution_0.3.pdf",width = 6, height = 6)

然后结合其他的marker进行注释：

############################## 查看marker基因
# 一般基因
markers <- list(
"Immu" = "PTPRC",
"Mast cells"=c("TPSAB1"),
"ILC" = c("KIT","KLRF1"),
"Cell Cycle" = c("TOP2A","MKI67"),
"Monocyte" = c("CD68","CD163","MARCO"),
"B cells" = c("CD79A","CD79B","CD19","MS4A1"),
"Plasma cells" = c("JCHAIN","IGKC"),
"platelets" = c("PF4", "PPBP","GP9"),
"T cells"=c("CD3D","CD3E","CD3G"),
"CD4+ T cells"=c("CD4"),
"CD8+ T cells"=c("CD8A","CD8B"),
"NK" = c("NKG7","KLRG1","KLRC1"),
"DC"=c("CD1C","LAMP3") 
)
markers

p <- DotPlot(sce.all.int, features = markers, assay='RNA',group.by = select_idet,cols = c("grey", "red") ) + 
  ggtitle(select_idet) + 
  xlab("") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # 更改x轴标签角度
p[["theme"]][["strip.text"]]$angle <- 60
p
ggsave(filename = "3-check-by-0.3/Markers_dotplot.pdf", plot=p, width=12, height = 6,bg="white")


#####################################################################
##### 具有配对的肿瘤和正常样本的胃癌marker基因 GSE206785
# 创建一个空的list对象
cell_markers <- list()
# 添加每种细胞类型的marker基因
cell_markers$B_cells <- c("CD19", "MS4A1")
cell_markers$plasma_cells <- c("IGHG1", "CD79A")
cell_markers$CD4_T_cells <- c("CD3D", "CD4")
cell_markers$CD8_T_cells <- c("CD8A")
cell_markers$NK_cells <- c("NCR1", "FGFBP2")
cell_markers$myeloid_cells <- c("CD14", "CD68")
cell_markers$mast_cells <- c("TPSAB1", "TPSB2")
cell_markers$endothelial_cells <- c("RAMP2", "PECAM1")
cell_markers$fibroblasts <- c("DCN", "LUM")
cell_markers$mural_cells <- c("PDGFRB", "ACTA2")
cell_markers$glial_cells <- c("PLP1", "SOX10")
cell_markers$epithelial_cells <- c("PGC", "PGA3")
# 打印list对象
print(cell_markers)

p <- DotPlot(sce.all.int, features = cell_markers, assay='RNA',group.by = select_idet,cols = c("grey", "red") ) + 
  ggtitle(paste0(select_idet, ": GSE206785")) + 
  xlab("") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # 更改x轴标签角度
p[["theme"]][["strip.text"]]$angle <- 60
p
ggsave(filename = "3-check-by-0.3/Markers_paperGSE206785_dotplot.pdf", plot=p, width=12, height = 6,bg="white")



#############################################################
# 特发性肺纤维化病：https://pmc.ncbi.nlm.nih.gov/articles/PMC8025672/#SD3
# GSE128033: 8个特发性纤维化（IPF）样本和10个正常样本，共66500个细胞
# 创建一个列表，包含不同细胞类型的marker基因
cell_markers <- list(
"Club cells" = "SCGB1A1",
"Alveolar type I cells (AT1)" = "AGER",
"Alveolar type II cells (AT2)" = "SFTPC",
"Ciliated cells" = "FOXJ1",
"Basal airway cells" = "KRT5",
"Goblet cells" = "MUC5B",
"Fibroblasts" = c("COL1A1"),
"Smooth muscle cells" = "DES",
"Endothelial cells" = "VWF",
"Lymphatic endothelial cells" = "LYVE1",
"Pericytes" = "RGS5",
"Macrophages" = c("AIF1", "CD163"),
"Dendritic cells" = "CD1C",
"Mast cells" = "TPSAB1",
"T lymphocytes" = c("CD3D","CD3E","CD3G", "CD8A"),
"B lymphocytes" = c("MS4A1", "IGKC", "MZB1"),
"NK cells" = "GNLY"
)
# 打印列表查看结果
print(cell_markers)

p <- DotPlot(sce.all.int, features = cell_markers, assay='RNA',group.by = select_idet,cols = c("grey", "red") ) + 
  ggtitle(paste0(select_idet, ": GSE128033")) + 
  xlab("") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # 更改x轴标签角度
p[["theme"]][["strip.text"]]$angle <- 90
p[["theme"]][["strip.text"]]$hjust <- 0
p
ggsave(filename = "3-check-by-0.3/Markers_paperGSE128033_dotplot.pdf", plot=p, width=12, height = 8,bg="white")


## 上面是正常肺，下面是两种都有
cell_markers <- list(
"Club cells" = c("SCGB1A1", "SCGB3A2"),
"Alveolar type I (AT1) cells" = "AGER",
"Goblet cells" = "MUC5B",
"Alveolar type II (AT2) cells" = "SFTPC",
"Ciliated cells" = "FOXJ1",
"Basal airway cells" = "KRT5",
"Fibroblasts" = c("COL1A1", "COL1A2", "PDGFRA"),
"Smooth muscle cells" = c("DES", "ACTG2"),
"Endothelial cells" = "VWF",
"Pericytes" = "RGS5",
"Lymphatic endothelial cells" = "LYVE1",
"Macrophages" = c("AIF1", "CD163"),
"Dendritic cells" = "CD1C",
"T lymphocytes" = c("CD3D", "CD8A"),
"B lymphocytes" = c("CD79A", "MS4A1", "IGKC", "IGHA1", "IGHG3", "MZB1"),
"NK cells" = c("GNLY", "NKG7", "GZMB", "PRF1", "CST7"),
"ILC1/NK cells" = c("CCL3", "CCL4", "CCL5"),
"Mast cells" = c("TPSAB1", "CPA3", "MS4A2")
)
# 打印列表查看结果
print(cell_markers)
p <- DotPlot(sce.all.int, features = cell_markers, assay='RNA',group.by = select_idet,cols = c("grey", "red") ) + 
  ggtitle(paste0(select_idet, ": GSE128033")) + 
  xlab("") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # 更改x轴标签角度
p[["theme"]][["strip.text"]]$angle <- 90
p[["theme"]][["strip.text"]]$hjust <- 0
p
ggsave(filename = "3-check-by-0.3/Markers_paperGSE128033_dotplot-1.pdf", plot=p, width=15, height = 8,bg="white")

###########################################
## 中心粒细胞
myeloids = list(
  Mac=c("C1QA","C1QB","C1QC","SELENOP","RNASE1","DAB2","LGMN","PLTP","MAF","SLCO2B1"),
  mono=c("VCAN","FCN1","CD300E","S100A12","EREG","APOBEC3A","STXBP2","ASGR1","CCR2","NRG1"),
  neutrophils =  c("FCGR3B","CXCR2","SLC25A37","G0S2","CXCR1","ADGRG3","PROK2","STEAP4","CMTM2" ),
  pDC = c("GZMB","SCT","CLIC3","LRRC26","LILRA4","PACSIN1","CLEC4C","MAP1A","PTCRA","C12orf75"),
  DC1 = c("CLEC9A","XCR1","CLNK","CADM1","ENPP1","SNX22","NCALD","DBN1","HLA-DOB","PPY"),
  DC2=c( "CD1C","FCER1A","CD1E","AL138899.1","CD2","GPAT3","CCND2","ENHO","PKIB","CD1B"),
  DC3 =  c("HMSD","ANKRD33B","LAD1","CCR7","LAMP3","CCL19","CCL22","INSM1","TNNT2","TUBB2B")
)
#myeloids <- lapply(myeloids, function(x) str_to_title(x))
print(myeloids)
p <- DotPlot(sce.all.int, features = myeloids, assay='RNA',group.by = select_idet,cols = c("grey", "red") ) + 
  ggtitle(paste0(select_idet, ": myeloids")) + 
  xlab("") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # 更改x轴标签角度
p[["theme"]][["strip.text"]]$angle <- 90
p[["theme"]][["strip.text"]]$hjust <- 0
p
ggsave(filename = "3-check-by-0.3/Markers_myeloids_dotplot.pdf", plot=p, width=15, height = 8,bg="white")

文献中的marker如下

image-20250608233907758

哎，CD68还写成了C68...

#####################################################################
################################ 本数据marker：OEP003500
cell_types <- list(
  Progenitor_cells = c("CD34", "PDGFRA", "THY1"),
  T_cells = c("CD3D", "CD3E", "CD3G"),
  Macrophages_Monocytes = c("CD163", "CD68", "AIF1", "CYBB"),
  B_cells = c("CD19", "CD79A", "MS4A1"),
  Neutrophils = c("FCGR3B", "MNDA"),
  Natural_killer_cells = c("GNLY", "NCAM1"),
  Plasmacytoid_dendritic_cells = c("IL3RA", "CLEC4C"),
  Endothelial_cells = c("PECAM1", "CDH5"),
  Smooth_muscle_cells = c("PLN", "CASQ2", "ACTA2")
)
cell_types <- lapply(cell_types, function(x) str_to_upper(x))
print(cell_types)

p <- DotPlot(sce.all.int, features = cell_types, assay='RNA',group.by = select_idet,cols = c("grey", "red") ) + 
  ggtitle(paste0(select_idet, ": OEP003500")) + 
  xlab("") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # 更改x轴标签角度
p[["theme"]][["strip.text"]]$angle <- 90
p[["theme"]][["strip.text"]]$hjust <- 0
p
ggsave(filename = "3-check-by-0.3/Markers_OEP003500_dotplot.pdf", plot=p, width=15, height = 8,bg="white")

image-20250608234029479

最终注释结果如下：

image-20250608235510905

与文献的比较：

# 查看每个样本中的细胞分类数据
gplots::balloonplot(table(sce.all.int$celltype,sce.all.int$mannual.annotation2))

image-20250608235707309

今天分享到这~

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2025-06-08，如有侵权请联系 cloudcommunity@tencent.com 删除

数据

cell

data

int