本教程翻译自:https://colab.research.google.com/github/Teichlab/celltypist/blob/main/notebook/celltypist_tutorial.ipynb#scrollTo=regular-tourism
相关文章还是预印本,但是已经有一些引用。我会把里面的例子跑一边,汇报给大家。
后面是正文。
该笔记通过从内置 CellTypist 模型或用户训练的自定义模型中检索最可能的细胞类型标签来展示 scRNA-seq 查询数据的细胞类型分类。
本笔记仅介绍主要步骤和关键参数。如果您想了解更多信息,请参阅详细的用法。https://github.com/Teichlab/celltypist#usage
!pip install celltypist
import scanpy as sc
import celltypist
from celltypist import models
adata_2000 = sc.read('celltypist_demo_folder/demo_2000_cells.h5ad', backup_url = 'https://celltypist.cog.sanger.ac.uk/Notebook_demo_data/demo_2000_cells.h5ad')
表达矩阵 (adata_2000.X) 作为 log1p 标准化表达进行预处理(并且需要),每个细胞计数为 10,000(该矩阵也可以存储在 .raw.X 中)。
如果没有标准化的话,参考scanpy教程,用如下代码标准化
sc.pp.normalize_total(adata_2000, target_sum=1e4)
sc.pp.log1p(adata_2000)
adata_2000.X.expm1().sum(axis = 1)
matrix([[10000. ],
[10000.002],
[10000. ],
...,
[10000. ],
[10000. ],
[10000. ]], dtype=float32)
一些预先分配的细胞类型标签也在数据中,稍后将与来自 CellTypist 的预测标签进行比较。
adata_2000.obs
cell_type
cell1 Plasma cells
cell2 Plasma cells
cell3 Plasma cells
cell4 Plasma cells
cell5 Plasma cells
... ...
cell1996 Neutrophil-myeloid progenitor
cell1997 Neutrophil-myeloid progenitor
cell1998 Neutrophil-myeloid progenitor
cell1999 Neutrophil-myeloid progenitor
cell2000 Neutrophil-myeloid progenitor
[2000 rows x 1 columns]
在本节中,我们展示了将细胞类型标签从内置模型转移到查询数据集的过程。
下载最新的 CellTypist 模型。
# Enabling `force_update = True` will overwrite existing (old) models.
models.download_models(force_update = True)
📜 Retrieving model list from server https://celltypist.cog.sanger.ac.uk/models/models.json
📚 Total models in list: 7
📂 Storing models in /Users/cystone/.celltypist/data/models
💾 Downloading model [1/7]: Immune_All_Low.pkl
💾 Downloading model [2/7]: Immune_All_High.pkl
💾 Downloading model [3/7]: Immune_All_PIP.pkl
💾 Downloading model [4/7]: Immune_All_AddPIP.pkl
💾 Downloading model [5/7]: Cells_Intestinal_Tract.pkl
💾 Downloading model [6/7]: Cells_Lung_Airway.pkl
💾 Downloading model [7/7]: Nuclei_Lung_Airway.pkl
所有模型都存储在models.models_path 中。
models.models_path
'/Users/******/.celltypist/data/models'
了解模型及其代表的内容。
models.models_description()
👉 Detailed model information can be found at `https://www.celltypist.org/models`
model \
0 Immune_All_Low.pkl
1 Immune_All_High.pkl
2 Immune_All_PIP.pkl
3 Immune_All_AddPIP.pkl
4 Cells_Intestinal_Tract.pkl
5 Cells_Lung_Airway.pkl
6 Nuclei_Lung_Airway.pkl
description
0 immune sub-populations combined from 20 tissue...
1 immune populations combined from 20 tissues of...
2 immune cell types combined from 16 adult human...
3 immune cell types combined from >20 human tiss...
4 intestinal cells from fetal, pediatric and adu...
5 cell populations from scRNA-seq of five locati...
6 cell populations from snRNA-seq of five locati...
选择您想要使用的模型,例如,所有组织组合在一起的模型,其中包含低层次(高分辨率)细胞类型/亚型。
# Indeed, the `model` argument defaults to `Immune_All_Low.pkl`.
model = models.Model.load(model = 'Immune_All_Low.pkl')
该模型包含 91 个细胞状态。
model.cell_types
array(['B cells', 'CD16+ NK cells', 'CD16- NK cells', 'CD8a/a',
'CD8a/b(entry)', 'CMP', 'Classical monocytes', 'Cycling B cells',
'Cycling DCs', 'Cycling NK cells', 'Cycling T cells',
'Cycling gamma-delta T cells', 'Cycling monocytes',
'Cytotoxic T cells', 'DC', 'DC precursor', 'DC1', 'DC2', 'DC3',
'Double-negative thymocytes', 'Double-positive thymocytes', 'ELP',
'ETP', 'Early MK', 'Early erythroid', 'Early lymphoid/T lymphoid',
'Endothelial cells', 'Epithelial cells', 'Erythrocytes',
'Fibroblasts', 'Follicular B cells', 'Follicular helper T cells',
'GMP', 'Germinal center B cells', 'Granulocytes', 'HSC/MPP',
'Helper T cells', 'Hofbauer cells', 'ILC', 'ILC precursor', 'ILC1',
'ILC2', 'ILC3', 'Immature B cells', 'Kidney-resident macrophages',
'Kupffer cells', 'Late erythroid', 'MAIT cells', 'MEMP', 'MNP',
'Macrophages', 'Mast cells', 'Megakaryocyte precursor',
'Megakaryocyte-erythroid-mast cell progenitor',
'Megakaryocytes/platelets', 'Memory B cells',
'Memory CD4+ cytotoxic T cells', 'Mid erythroid', 'Migratory DCs',
'Mono-mac', 'Monocyte precursor', 'Monocytes', 'Myelocytes',
'NK cells', 'NKT cells', 'Naive B cells',
'Neutrophil-myeloid progenitor', 'Neutrophils',
'Non-classical monocytes', 'Plasma cells', 'Pre-B cells',
'Pre-pro-B cells', 'Pro-B cells', 'Promyelocytes',
'Regulatory T cells', 'T cells', 'T(agonist)',
'Tcm/Naive cytotoxic T cells', 'Tcm/Naive helper T cells',
'Tem/Effector cytotoxic T cells', 'Tem/Effector helper T cells',
'Tem/Effector helper T cells PD1+', 'Transitional B cells',
'Transitional DC', 'Transitional NK', 'Treg(diff)',
'Type 1 helper T cells', 'Type 17 helper T cells',
'gamma-delta T cells', 'pDC', 'pDC precursor'], dtype=object)
我们看到里面其实是含有一些Epithelial cells和Endothelial cells、Fibroblasts等。免疫细胞的分类非常详细。
一些模型元信息。
model.description
{'date': '2021-10-27 15:20:55.163288',
'details': 'immune sub-populations combined from 20 tissues of 19 studies',
'url': 'https://celltypist.cog.sanger.ac.uk/models/Pan_Immune_CellTypist/v1/Immune_All_Low.pkl',
'source': 'https://doi.org/10.1101/2021.04.28.441762',
'version': 'v1',
'number_celltypes': 91}
将细胞类型标签从此模型传输到查询数据集。
# Not run; predict cell identities using this loaded model.
#predictions = celltypist.annotate(adata_2000, model = model, majority_voting = True)
# Alternatively, just specify the model name (recommended as this ensures the model is intact every time it is loaded).
predictions = celltypist.annotate(adata_2000, model = 'Immune_All_Low.pkl', majority_voting = True)
🔬 Input data has 2000 cells and 18950 genes
🔗 Matching reference genes in the model
🧬 3278 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
👀 Can not detect a neighborhood graph, construct one before the over-clustering
OMP: Info #271: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
⛓️ Over-clustering input data with resolution set to 5
🗳️ Majority voting the predictions
✅ Majority voting done!
默认情况下(majority_voting = False),CellTypist 将独立推断每个查询细胞的身份。这会生成原始预测的细胞类型标签,并且通常会在几秒或几分钟内完成,具体取决于查询数据的大小。您还可以打开多数投票分类器 (majority_voting = True),它会在过度聚类方法后以增加运行时间为代价细化本地子簇内的细胞身份。
结果包括预测的细胞类型标签(predicted_labels)、过度聚类结果(over_clustering)和局部子聚类中多数投票后的预测标签(majority_voting)。请注意,在predicted_labels 中,每个查询细胞通过在给定模型的所有可能单细胞型中选择最可能的细胞来获得其推断标签。
predictions.predicted_labels
predicted_labels over_clustering \
cell1 Plasma cells 42
cell2 Plasma cells 12
cell3 Plasma cells 38
cell4 Plasma cells 2
cell5 Plasma cells 2
... ... ...
cell1996 Neutrophil-myeloid progenitor 37
cell1997 Neutrophil-myeloid progenitor 27
cell1998 Neutrophil-myeloid progenitor 29
cell1999 Neutrophil-myeloid progenitor 27
cell2000 Neutrophil-myeloid progenitor 10
majority_voting
cell1 Plasma cells
cell2 Plasma cells
cell3 gamma-delta T cells
cell4 Plasma cells
cell5 Plasma cells
... ...
cell1996 Neutrophil-myeloid progenitor
cell1997 Neutrophil-myeloid progenitor
cell1998 Neutrophil-myeloid progenitor
cell1999 Neutrophil-myeloid progenitor
cell2000 Neutrophil-myeloid progenitor
[2000 rows x 3 columns]
将预测结果转换为 AnnData。
# Get an `AnnData` with predicted labels embedded into the cell metadata columns.
adata = predictions.to_adata()
与adata_2000相比,新的adata在adata.obs中多了一些预测信息(predicted_labels、over_clustering、majority_voting和conf_score)。值得注意的是,所有这些列都可以通过在 to_adata 中设置前缀来使用特定字符串作为前缀。
adata.obs
cell_type predicted_labels \
cell1 Plasma cells Plasma cells
cell2 Plasma cells Plasma cells
cell3 Plasma cells Plasma cells
cell4 Plasma cells Plasma cells
cell5 Plasma cells Plasma cells
... ... ...
cell1996 Neutrophil-myeloid progenitor Neutrophil-myeloid progenitor
cell1997 Neutrophil-myeloid progenitor Neutrophil-myeloid progenitor
cell1998 Neutrophil-myeloid progenitor Neutrophil-myeloid progenitor
cell1999 Neutrophil-myeloid progenitor Neutrophil-myeloid progenitor
cell2000 Neutrophil-myeloid progenitor Neutrophil-myeloid progenitor
over_clustering majority_voting conf_score
cell1 42 Plasma cells 0.996814
cell2 12 Plasma cells 0.995119
cell3 38 gamma-delta T cells 0.991911
cell4 2 Plasma cells 0.995159
cell5 2 Plasma cells 0.995717
... ... ... ...
cell1996 37 Neutrophil-myeloid progenitor 0.724281
cell1997 27 Neutrophil-myeloid progenitor 0.977658
cell1998 29 Neutrophil-myeloid progenitor 0.843348
cell1999 27 Neutrophil-myeloid progenitor 0.955746
cell2000 10 Neutrophil-myeloid progenitor 0.989907
[2000 rows x 5 columns]
除了添加这个元信息之外,在过度聚类期间构建的邻域图也存储在 adata 中(如果 AnnData 中已经存在预先计算的邻域图,则将跳过此图构建步骤)。
# If the UMAP or any cell embeddings are already available in the `AnnData`, skip this command.
sc.tl.umap(adata)
可视化预测结果。
sc.pl.umap(adata, color = ['cell_type', 'predicted_labels', 'majority_voting'], legend_loc = 'on data')
这是我运行出来的图:(丑哭了!)
原网页的图:
实际上,您可能不需要像上面那样将 celltypist.annotate 输出的预测显式转换为 AnnData。一种更有用的方法是使用可视化函数 celltypist.dotplot,它将 CellTypist 预测结果(例如这里的多数投票)与 AnnData 中预定义的细胞类型(这里是 cell_type)进行定量比较。您还可以将 use_as_prediction 的值更改为 Predicted_labels 以将原始预测结果与预定义的细胞类型进行比较。
celltypist.dotplot(predictions, use_as_reference = 'cell_type', use_as_prediction = 'majority_voting')
对于每种预定义的细胞类型(点图中的每一列),该图显示了如何将其“分解”为 CellTypist 预测的不同细胞类型(行)。
自定义模型的教程我就不转载了。
可以根据每种细胞类型的驱动基因来检查每个模型。请注意,这些基因仅依赖于模型,例如训练数据集。
model = models.Model.load(model = 'celltypist_demo_folder/model_from_immune2000.pkl')
model.cell_types
array(['DC1', 'Endothelial cells', 'Follicular B cells', 'Kupffer cells',
'Macrophages', 'Mast cells', 'Neutrophil-myeloid progenitor',
'Plasma cells', 'gamma-delta T cells', 'pDC'], dtype=object)
提取跨细胞类型的基因权重矩阵。
weights = model.classifier.coef_
weights.shape
(10, 16201)
肥大细胞的三大驱动基因。
mast_cell_weights = weights[model.cell_types == 'Mast cells']
top_3_genes = model.features[mast_cell_weights.argpartition(-3, axis = None)[-3:]]
top_3_genes
array(['CPA3', 'TPSAB1', 'TPSB2'], dtype=object)
# Check expression of the three genes in the training set.
sc.pl.violin(adata_2000, top_3_genes, groupby = 'cell_type', rotation = 90)
mast_cell_weights = weights[model.cell_types == 'Plasma cells']
top_3_genes = model.features[mast_cell_weights.argpartition(-3, axis = None)[-3:]]
top_3_genes
array(['SSR4', 'TNFRSF17', 'FKBP11'], dtype=object)
sc.pl.violin(adata_2000, top_3_genes, groupby = 'cell_type', rotation = 90)
我顺便看了一下pDC细胞
mast_cell_weights = weights[model.cell_types == 'pDC']
top_3_genes = model.features[mast_cell_weights.argpartition(-3, axis = None)[-3:]]
top_3_genes
array(['CCDC50', 'LILRA4', 'PLD4'], dtype=object)
sc.pl.violin(adata_2000, top_3_genes, groupby = 'cell_type', rotation = 90)
感觉还不错,比R语言里面的singleR的可能准确一些,注释免疫细胞时可以考虑用一下,singleR注释免疫细胞真是一言难尽。
CellTypist是一个py程序,我们比较熟悉的还是R。
为了与R里面我们熟悉的seurat对象对接,我利用reticulate包写了一些R代码,可以嫁接移植,岂不美哉!
install.packages("reticulate")
library(reticulate)
scanpy = import("scanpy")
celltypist = import("celltypist")
pandas <- import("pandas")
numpy = import("numpy")
celltypist$models$download_models(force_update = F)
#如果第一次运行需要下载训练好的模型数据
# 你应该已经准备了一个seurat对象,这里我的是sce
adata = scanpy$AnnData(X = numpy$array(as.matrix(t(sce[['RNA']]@counts))),
obs = pandas$DataFrame(sce@meta.data),
var = pandas$DataFrame(data.frame(gene = rownames(sce[['RNA']]@counts),
row.names = rownames(sce[['RNA']]@counts)))
)
model = celltypist$models$Model$load(model = 'Immune_All_Low.pkl')
model$cell_types
scanpy$pp$normalize_total(adata, target_sum=1e4)
scanpy$pp$log1p(adata)
predictions = celltypist$annotate(adata, model = 'Immune_All_Low.pkl', majority_voting = T)
predictions$predicted_labels %>% head()
predicted_labels over_clustering majority_voting
P1T-I-AAACCTGAGACAAAGG Tem/Effector cytotoxic T cells 199 Tem/Effector cytotoxic T cells
P1T-I-AAACCTGGTACCAGTT Tem/Effector cytotoxic T cells 176 Tem/Effector cytotoxic T cells
P1T-I-AAAGATGAGTGGAGAA Regulatory T cells 121 Regulatory T cells
P1T-I-AAAGATGGTATTCGTG Tem/Effector cytotoxic T cells 176 Tem/Effector cytotoxic T cells
P1T-I-AAATGCCAGTGCGATG Regulatory T cells 106 Regulatory T cells
P1T-I-AACACGTCATCGTCGG Regulatory T cells 248 Regulatory T cells
把这些信息加入到seurat对象中去:
sce = AddMetaData(sce, predictions$predicted_labels)
各细胞群的特征基因图:
查看了一下gamma-delta T cells的marker: