这次继续之前的 SATURN:跨物种的单细胞数据集整合 还没跑完的部分
之前生成蛋白嵌入的部分还遗留了一小部分:
之前的脚本 extract.py 对每个蛋白序列的嵌入向量都存了一个单独的 .pt 文件在 embedding/Homo_sapiens.GRCh38.pep.all_clean.fa_esm1b 和 embedding/Mus_musculus.GRCm39.pep.all_clean.fa_esm1b 文件夹下。接下来创建 gene symbol 与对应蛋白质 ID 的对应关系:
python ../../SATURN-main/protein_embeddings/map_gene_symbol_to_protein_ids.py \
--fasta_path proteome/Homo_sapiens.GRCh38.pep.all.fa \
--save_path proteome/Homo_sapiens.GRCh38.gene_symbol_to_protein_ID.json
python ../../SATURN-main/protein_embeddings/map_gene_symbol_to_protein_ids.py \
--fasta_path proteome/Mus_musculus.GRCm39.pep.all.fa \
--save_path proteome/Mus_musculus.GRCm39.gene_symbol_to_protein_ID.json
对每个基因,都聚合(取平均)它对应的所有蛋白质嵌入,处理一个基因有多个蛋白质异构体的情况,生成最终的 gene symbol 的嵌入向量文件:
#!/bin/bash
#SBATCH --job-name=convert_human
#SBATCH --output=logs/convert_human_%j.log
#SBATCH --error=logs/convert_human_%j.err
#SBATCH --mem=30G
#SBATCH --cpus-per-task=16
python ../../SATURN-main/protein_embeddings/convert_protein_embeddings_to_gene_embeddings.py \
--embedding_dir embedding/Homo_sapiens.GRCh38.pep.all_clean.fa_esm1b \
--gene_symbol_to_protein_ids_path proteome/Homo_sapiens.GRCh38.gene_symbol_to_protein_ID.json \
--embedding_model ESM1b \
--save_path embedding/Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM1b.pt
#!/bin/bash
#SBATCH --job-name=convert_mouse
#SBATCH --output=logs/convert_mouse_%j.log
#SBATCH --error=logs/convert_mouse_%j.err
#SBATCH --mem=30G
#SBATCH --cpus-per-task=16
python ../../SATURN-main/protein_embeddings/convert_protein_embeddings_to_gene_embeddings.py \
--embedding_dir embedding/Mus_musculus.GRCm39.pep.all_clean.fa_esm1b \
--gene_symbol_to_protein_ids_path proteome/Mus_musculus.GRCm39.gene_symbol_to_protein_ID.json \
--embedding_model ESM1b \
--save_path embedding/Mus_musculus.GRCm39.gene_symbol_to_embedding_ESM1b.pt
结果存储在 embedding/Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM1b.pt 和 embedding/Mus_musculus.GRCm39.gene_symbol_to_embedding_ESM1b.pt 中。
从 cellxgene 数据库中选择了人和小鼠肺部的单细胞数据集:
wget https://datasets.cellxgene.cziscience.com/5c27e357-4786-46d4-9635-cb7b37cce452.h5ad -O human_lung.h5ad
wget https://datasets.cellxgene.cziscience.com/5f566cc5-2818-4d5e-bb70-1d8c55f0570b.h5ad -O mouse_lung.h5ad
简单查看一下数据:
import scanpy as sc
import pandas as pd
human = sc.read("human_lung.h5ad")
human
AnnData object with n_obs × n_vars = 5499 × 29371
obs: 'assay_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'is_primary_data', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'author_age', 'author_GA_at_birth', 'author_weight', 'author_weight_percentile', 'author_cause_of_death', 'author_type_of_death', 'author_health_status', 'Sample', 'author_cell_type', 'author_cluster', 'donor_id', 'suspension_type', 'tissue_type', 'cell_type', 'assay', 'disease', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'
uns: 'citation', 'donor_id_colors', 'organism', 'organism_ontology_term_id', 'schema_reference', 'schema_version', 'title'
obsm: 'X_tsne'
mouse = sc.read("mouse_lung.h5ad")
mouse
AnnData object with n_obs × n_vars = 5218 × 21025
obs: 'FACS.selection', 'age', 'cell', 'free_annotation', 'method', 'donor_id', 'subtissue', 'n_genes', 'n_counts', 'louvain', 'leiden', 'assay_ontology_term_id', 'disease_ontology_term_id', 'cell_type_ontology_term_id', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'sex_ontology_term_id', 'is_primary_data', 'suspension_type', 'tissue_type', 'cell_type', 'assay', 'disease', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
var: 'n_cells', 'means', 'dispersions', 'dispersions_norm', 'highly_variable', 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'
uns: 'age_colors', 'citation', 'leiden', 'louvain', 'neighbors', 'organism', 'organism_ontology_term_id', 'pca', 'schema_reference', 'schema_version', 'title'
obsm: 'X_pca', 'X_tsne', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
因为前面生成蛋白嵌入向量的时候得到的是 gene symbol 的嵌入向量,所以 h5ad 数据的 .var_names 需要是 gene symbol,这里需要处理一下:
human.var["ensembl_id"] = human.var_names
human.var_names = human.var["feature_name"].astype(str)
human.var.index.name = "gene_symbol"
mouse.var["ensembl_id"] = mouse.var_names
mouse.var_names = mouse.var["feature_name"].astype(str)
mouse.var.index.name = "gene_symbol"
两个数据集之间细胞类型注释的精细度不一致:
set(human.obs["cell_type"])
{'B cell',
'T cell',
'endothelial cell',
'fibroblast of lung',
'macrophage',
'myofibroblast cell',
'pericyte',
'pulmonary alveolar type 1 cell',
'pulmonary alveolar type 2 cell',
'smooth muscle cell',
'stromal cell'}
set(mouse.obs["cell_type"])
{'B cell',
'CD4-positive, alpha-beta T cell',
'CD8-positive, alpha-beta T cell',
'T cell',
'adventitial cell',
'bronchial smooth muscle cell',
'classical monocyte',
'club cell',
'dendritic cell',
'endothelial cell of lymphatic vessel',
'fibroblast of lung',
'intermediate monocyte',
'leukocyte',
'lung macrophage',
'lymphocyte',
'mature NK T cell',
'multiciliated columnar cell of tracheobronchial tree',
'myeloid dendritic cell',
'natural killer cell',
'neutrophil',
'non-classical monocyte',
'pericyte',
'plasma cell',
'plasmacytoid dendritic cell',
'pulmonary alveolar type 1 cell',
'pulmonary alveolar type 2 cell',
'pulmonary interstitial fibroblast',
'pulmonary neuroendocrine cell',
'regulatory T cell',
'respiratory basal cell',
'smooth muscle cell of the pulmonary artery',
'vein endothelial cell'}
所以需要手动地对齐一下,整合的时候主要针对大类进行整合就好了:
unified_cell_types = {
'Endothelial cell': ['endothelial cell', 'vein endothelial cell', 'endothelial cell of lymphatic vessel'],
'Pericyte': ['pericyte'],
'Fibroblast': ['myofibroblast cell', 'fibroblast of lung', 'pulmonary interstitial fibroblast'],
'Smooth muscle cell': ['smooth muscle cell', 'bronchial smooth muscle cell', 'smooth muscle cell of the pulmonary artery'],
'Stromal cell': ['stromal cell', 'adventitial cell'],
'Macrophage': ['macrophage', 'lung macrophage'],
'T cell': ['T cell', 'CD8-positive, alpha-beta T cell', 'CD4-positive, alpha-beta T cell', 'regulatory T cell', 'mature NK T cell'],
'Pulmonary alveolar type 1 cell': ['pulmonary alveolar type 1 cell'],
'Pulmonary alveolar type 2 cell': ['pulmonary alveolar type 2 cell'],
'B cell': ['B cell', 'plasma cell'],
'Dendritic cell': ['myeloid dendritic cell', 'dendritic cell', 'plasmacytoid dendritic cell'],
'Monocyte': ['classical monocyte', 'non-classical monocyte', 'intermediate monocyte'],
'Neutrophil': ['neutrophil'],
'NK cell': ['natural killer cell'],
'Epithelial': ['multiciliated columnar cell of tracheobronchial tree', 'respiratory basal cell', 'club cell', 'pulmonary neuroendocrine cell'],
'Other immune cell': ['leukocyte', 'lymphocyte']
}
mapping_dict = {}
for unified_type, original_types in unified_cell_types.items():
for original_type in original_types:
mapping_dict[original_type] = unified_type
把原有的 .obs["cell_type"] 存到另一列,这是可有可无的,也可以在后面 train_saturn.py 的时候指定细胞类型注释的列名。
human.obs["original_cell_type"] = human.obs["cell_type"].copy()
mouse.obs["original_cell_type"] = mouse.obs["cell_type"].copy()
mapped_cell_types = []
unmapped_cells = 0
for cell_type in human.obs['cell_type']:
if cell_type in mapping_dict:
mapped_cell_types.append(mapping_dict[cell_type])
else:
mapped_cell_types.append('Unclassified')
unmapped_cells += 1
human.obs['cell_type'] = mapped_cell_types
unmapped_cells
0
mapped_cell_types = []
unmapped_cells = 0
for cell_type in mouse.obs['cell_type']:
if cell_type in mapping_dict:
mapped_cell_types.append(mapping_dict[cell_type])
else:
mapped_cell_types.append('Unclassified')
unmapped_cells += 1
mouse.obs['cell_type'] = mapped_cell_types
unmapped_cells
0
保存一下数据:
human.write("data/human.h5ad")
mouse.write("data/mouse.h5ad")
SATURN 的输入需要准备两个 csv 文件,一个是细胞类型对齐的文件:
human_cell_type = list(human.obs["cell_type"].value_counts().keys())
mouse_cell_type = list(mouse.obs["cell_type"].value_counts().keys())
all_cell_types = set(human_cell_type + mouse_cell_type)
records = []
records.append(["mouse_cell_type", "human_cell_type"])
for cell_type in all_cell_types:
human_value = cell_type if cell_type in human_cell_type else ''
mouse_value = cell_type if cell_type in mouse_cell_type else ''
records.append([mouse_value, human_value])
df = pd.DataFrame(records[1:], columns=records[0])
df.to_csv("human_mouse_cell_type_map.csv")

还有一个运行文件,包含数据集路径、gene symbol 嵌入向量的路径和物种信息:
import os
current_dir = os.getcwd()
embedding_dir = "06-SATURN/02-SATURN_human_mouse/02-protein_embedding/embedding"
csv_df = pd.DataFrame(columns=["path", "species", "embedding_path"])
csv_df["species"] = ["human", "mouse"]
csv_df["path"] = [os.path.join(current_dir, "data/human.h5ad"), os.path.join(current_dir, "data/mouse.h5ad")]
human_embedding_path = os.path.join(embedding_dir, "Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM1b.pt")
mouse_embedding_path = os.path.join(embedding_dir, "Mus_musculus.GRCm39.gene_symbol_to_embedding_ESM1b.pt")
csv_df["embedding_path"] = [human_embedding_path, mouse_embedding_path]
csv_df.to_csv("human_mouse_run.csv", index=False)
#!/bin/bash
#SBATCH --job-name=train_saturn
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:1
#SBATCH --mem=200G
#SBATCH --output=logs/train_saturn_%j.log
#SBATCH --error=logs/train_saturn_%j.err
export PYTHONUNBUFFERED=1
python ../../SATURN-main/train-saturn.py \
--in_data=human_mouse_run.csv \
--in_label_col=cell_type \
--ref_label_col=cell_type \
--num_macrogenes=2000 \
--hv_genes=8000 \
--centroids_init_path=saturn_results/human_mouse_centroids.pkl \
--score_adata \
--ct_map_path=human_mouse_cell_type_map.csv \
--work_dir=.
其中,--in_label_col 和 --ref_label_col 表示 h5ad 数据集中用来存储细胞类型标签的列名,--num_macrogenes 表示 macrogene 的数目默认为 2000,--hv_genes 默认选择 8000 HVG,--centroids_init_path 表示 macrogene 的初始中心,第一次跑的时候需要先计算,然后把结果存在 saturn_results/human_mouse_centroids.pkl 里(需要确保 saturn_results 文件夹存在),第二次跑的时候就可以直接跳过这部分的计算。--work_dir 会把结果存储在 ./saturn_results 里。
--score_adata 主要是控制是否进行使用物种 A(human_mouse_cell_type_map.csv 第一列)进行训练,然后预测物种 B (human_mouse_cell_type_map.csv 第二列)的细胞类型标签这一过程,如果 --score_adata=False 就会跳过这一环节,而且也就不需要 human_mouse_cell_type_map.csv 这一文件。
最终在 ./saturn_results 里的结果包括:
human_mouse_centroids.pkl
test256_data_human_mouse_org_saturn_seed_0_celltype_id.pkl
test256_data_human_mouse_org_saturn_seed_0_ep_25.h5ad
test256_data_human_mouse_org_saturn_seed_0_ep_50.h5ad
test256_data_human_mouse_org_saturn_seed_0_epoch_scores.csv
test256_data_human_mouse_org_saturn_seed_0_genes_to_macrogenes.pkl
test256_data_human_mouse_org_saturn_seed_0.h5ad
test256_data_human_mouse_org_saturn_seed_0_pretrain.h5ad
test256_data_human_mouse_org_saturn_seed_0_triplets.csv
得到了一些输出文件:
test256_data_human_mouse_org_saturn_seed_0.h5ad:输出的 AnnData 数据 ,包含最终的细胞嵌入。test256_data_human_mouse_org_saturn_seed_0_genes_to_macrogenes.pkl:每个基因到每个 macrogene 的权重映射。
以及一些额外的文件:
ep_25.h5ad 和 ep_50.h5ad 是 metric learning(也就是上一次原理说的微调部分进行跨物种细胞类型对齐)里 epoch 分别为 25 和 50 的中间结果,因为这里没有设置 --epochs 所以它默认等于 50,因此 ep_50.h5ad 的结果和最终结果是没有差别的。epoch_scores.csv 是用于判断当前 epoch 计算是否需要提前停止的分数文件。pretrain.h5ad 就是只经过预训练阶段的细胞嵌入结果。
triplets.csv 是每一个 epoch 里挖掘出的三元组(锚点,正样本,负样本)细胞的计数。
celltype_id.pkl 是细胞类型名称到数值 ID 的映射字典,用于确保细胞类型标签在不同物种间的一致性。
读取一下结果:
import scanpy as sc
import pandas as pd
import pickle
adata = sc.read("saturn_results/test256_data_human_mouse_org_saturn_seed_0.h5ad")
adata
AnnData object with n_obs × n_vars = 10717 × 256
obs: 'labels', 'labels2', 'ref_labels', 'species'
obsm: 'macrogenes'
labels 表示 species_celltype ,labels2 表示在 labels 的基础上去掉了物种的前缀,这里 labels2 和 ref_labels 的内容是一样的,在别的情况下, ref_labels 可能是更加保守的细胞大类,species 表示物种(human/mouse)。
adata.obsm["macrogenes"] 是一个矩阵,行数等于细胞数,列数等于宏基因的数目,这个矩阵代表了每个细胞在宏基因上的表达水平。
sc.pp.pca(adata)
sc.pl.pca(adata, color="species", title="Species")
sc.pl.pca(adata, color="labels2", title="Cell Type")


sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color="species", title="Species")
sc.pl.umap(adata, color="labels2", title="Cell Type")


可以在宏基因层面上执行差异表达:
with open("saturn_results/test256_data_human_mouse_org_saturn_seed_0_genes_to_macrogenes.pkl", "rb") as f:
macrogene_weights = pickle.load(f)
macrogene_weights
{'human_SAMD11': array([6.0659357e-07, 1.6696581e-04, 1.2915584e-06, ..., 7.9972560e-06,
4.0127437e-05, 5.9080173e-07], dtype=float32),
'human_KLHL17': array([1.0999464e-06, 4.6917098e-06, 5.1430852e-06, ..., 1.3106070e-06,
1.3870905e-06, 5.2875851e-07], dtype=float32),
'human_PERM1': array([5.3866364e-07, 2.3613821e-04, 1.2086666e-06, ..., 1.5812535e-05,
2.0359473e-04, 2.6407813e-06], dtype=float32),
...}
macrogene_weights 是一个字典,键为基因,值为该基因对 2000 个宏基因的权重。
新建一个名为 macrogene_adata 的 AnnData 对象:
macrogene_adata = sc.AnnData(adata.obsm["macrogenes"])
macrogene_adata.obs = adata.obs
以单核细胞为例,找出单核细胞与其他细胞类型相比差异表达的基因:
sc.tl.rank_genes_groups(macrogene_adata, groupby="labels2", groups=["Monocyte"], method="wilcoxon")
sc.pl.rank_genes_groups_dotplot(macrogene_adata)

下面的数字是指代的 macrogene,数字表示它们在 2000 个宏基因中的顺序。