SATURN跨物种单细胞数据集整合实现 · 续

生信菜鸟团

发布于 2025-11-19 19:38:44

350

之前生成蛋白嵌入的部分还遗留了一小部分：

蛋白嵌入向量的生成 · 续

之前的脚本 extract.py 对每个蛋白序列的嵌入向量都存了一个单独的 .pt 文件在 embedding/Homo_sapiens.GRCh38.pep.all_clean.fa_esm1b 和 embedding/Mus_musculus.GRCm39.pep.all_clean.fa_esm1b 文件夹下。接下来创建 gene symbol 与对应蛋白质 ID 的对应关系：

python ../../SATURN-main/protein_embeddings/map_gene_symbol_to_protein_ids.py \
 --fasta_path proteome/Homo_sapiens.GRCh38.pep.all.fa \
 --save_path proteome/Homo_sapiens.GRCh38.gene_symbol_to_protein_ID.json

python ../../SATURN-main/protein_embeddings/map_gene_symbol_to_protein_ids.py \
 --fasta_path proteome/Mus_musculus.GRCm39.pep.all.fa \
 --save_path proteome/Mus_musculus.GRCm39.gene_symbol_to_protein_ID.json

对每个基因，都聚合（取平均）它对应的所有蛋白质嵌入，处理一个基因有多个蛋白质异构体的情况，生成最终的 gene symbol 的嵌入向量文件：

#!/bin/bash
#SBATCH --job-name=convert_human
#SBATCH --output=logs/convert_human_%j.log
#SBATCH --error=logs/convert_human_%j.err
#SBATCH --mem=30G
#SBATCH --cpus-per-task=16

python ../../SATURN-main/protein_embeddings/convert_protein_embeddings_to_gene_embeddings.py \
 --embedding_dir embedding/Homo_sapiens.GRCh38.pep.all_clean.fa_esm1b \
 --gene_symbol_to_protein_ids_path proteome/Homo_sapiens.GRCh38.gene_symbol_to_protein_ID.json \
 --embedding_model ESM1b \
 --save_path embedding/Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM1b.pt

#!/bin/bash
#SBATCH --job-name=convert_mouse
#SBATCH --output=logs/convert_mouse_%j.log
#SBATCH --error=logs/convert_mouse_%j.err
#SBATCH --mem=30G
#SBATCH --cpus-per-task=16

python ../../SATURN-main/protein_embeddings/convert_protein_embeddings_to_gene_embeddings.py \
 --embedding_dir embedding/Mus_musculus.GRCm39.pep.all_clean.fa_esm1b \
 --gene_symbol_to_protein_ids_path proteome/Mus_musculus.GRCm39.gene_symbol_to_protein_ID.json \
 --embedding_model ESM1b \
 --save_path embedding/Mus_musculus.GRCm39.gene_symbol_to_embedding_ESM1b.pt

结果存储在 embedding/Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM1b.pt 和 embedding/Mus_musculus.GRCm39.gene_symbol_to_embedding_ESM1b.pt 中。

SATURN进行跨物种数据整合

数据获取与预处理

从 cellxgene 数据库中选择了人和小鼠肺部的单细胞数据集：

wget https://datasets.cellxgene.cziscience.com/5c27e357-4786-46d4-9635-cb7b37cce452.h5ad -O human_lung.h5ad
wget https://datasets.cellxgene.cziscience.com/5f566cc5-2818-4d5e-bb70-1d8c55f0570b.h5ad -O mouse_lung.h5ad

简单查看一下数据：

import scanpy as sc
import pandas as pd

human = sc.read("human_lung.h5ad")
human

AnnData object with n_obs × n_vars = 5499 × 29371
    obs: 'assay_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'is_primary_data', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'author_age', 'author_GA_at_birth', 'author_weight', 'author_weight_percentile', 'author_cause_of_death', 'author_type_of_death', 'author_health_status', 'Sample', 'author_cell_type', 'author_cluster', 'donor_id', 'suspension_type', 'tissue_type', 'cell_type', 'assay', 'disease', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'
    uns: 'citation', 'donor_id_colors', 'organism', 'organism_ontology_term_id', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_tsne'

mouse = sc.read("mouse_lung.h5ad")
mouse

AnnData object with n_obs × n_vars = 5218 × 21025
    obs: 'FACS.selection', 'age', 'cell', 'free_annotation', 'method', 'donor_id', 'subtissue', 'n_genes', 'n_counts', 'louvain', 'leiden', 'assay_ontology_term_id', 'disease_ontology_term_id', 'cell_type_ontology_term_id', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'sex_ontology_term_id', 'is_primary_data', 'suspension_type', 'tissue_type', 'cell_type', 'assay', 'disease', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'n_cells', 'means', 'dispersions', 'dispersions_norm', 'highly_variable', 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'
    uns: 'age_colors', 'citation', 'leiden', 'louvain', 'neighbors', 'organism', 'organism_ontology_term_id', 'pca', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_pca', 'X_tsne', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

因为前面生成蛋白嵌入向量的时候得到的是 gene symbol 的嵌入向量，所以 h5ad 数据的 .var_names 需要是 gene symbol，这里需要处理一下：

human.var["ensembl_id"] = human.var_names
human.var_names = human.var["feature_name"].astype(str)
human.var.index.name = "gene_symbol"
mouse.var["ensembl_id"] = mouse.var_names
mouse.var_names = mouse.var["feature_name"].astype(str)
mouse.var.index.name = "gene_symbol"

两个数据集之间细胞类型注释的精细度不一致：

set(human.obs["cell_type"])

{'B cell',
'T cell',
'endothelial cell',
'fibroblast of lung',
'macrophage',
'myofibroblast cell',
'pericyte',
'pulmonary alveolar type 1 cell',
'pulmonary alveolar type 2 cell',
'smooth muscle cell',
'stromal cell'}

set(mouse.obs["cell_type"])

{'B cell',
'CD4-positive, alpha-beta T cell',
'CD8-positive, alpha-beta T cell',
'T cell',
'adventitial cell',
'bronchial smooth muscle cell',
'classical monocyte',
'club cell',
'dendritic cell',
'endothelial cell of lymphatic vessel',
'fibroblast of lung',
'intermediate monocyte',
'leukocyte',
'lung macrophage',
'lymphocyte',
'mature NK T cell',
'multiciliated columnar cell of tracheobronchial tree',
'myeloid dendritic cell',
'natural killer cell',
'neutrophil',
'non-classical monocyte',
'pericyte',
'plasma cell',
'plasmacytoid dendritic cell',
'pulmonary alveolar type 1 cell',
'pulmonary alveolar type 2 cell',
'pulmonary interstitial fibroblast',
'pulmonary neuroendocrine cell',
'regulatory T cell',
'respiratory basal cell',
'smooth muscle cell of the pulmonary artery',
'vein endothelial cell'}

所以需要手动地对齐一下，整合的时候主要针对大类进行整合就好了：

unified_cell_types = {
    'Endothelial cell': ['endothelial cell',  'vein endothelial cell', 'endothelial cell of lymphatic vessel'],
    'Pericyte': ['pericyte'],
    'Fibroblast': ['myofibroblast cell', 'fibroblast of lung', 'pulmonary interstitial fibroblast'],
    'Smooth muscle cell': ['smooth muscle cell', 'bronchial smooth muscle cell', 'smooth muscle cell of the pulmonary artery'],
    'Stromal cell': ['stromal cell', 'adventitial cell'],
    'Macrophage': ['macrophage', 'lung macrophage'],
    'T cell': ['T cell', 'CD8-positive, alpha-beta T cell', 'CD4-positive, alpha-beta T cell', 'regulatory T cell', 'mature NK T cell'],
    'Pulmonary alveolar type 1 cell': ['pulmonary alveolar type 1 cell'],
    'Pulmonary alveolar type 2 cell': ['pulmonary alveolar type 2 cell'],
    'B cell': ['B cell', 'plasma cell'],
    'Dendritic cell': ['myeloid dendritic cell', 'dendritic cell', 'plasmacytoid dendritic cell'],
    'Monocyte': ['classical monocyte', 'non-classical monocyte', 'intermediate monocyte'],
    'Neutrophil': ['neutrophil'],
    'NK cell': ['natural killer cell'],
    'Epithelial': ['multiciliated columnar cell of tracheobronchial tree', 'respiratory basal cell', 'club cell', 'pulmonary neuroendocrine cell'],
    'Other immune cell': ['leukocyte', 'lymphocyte']
}

mapping_dict = {}
for unified_type, original_types in unified_cell_types.items():
    for original_type in original_types:
        mapping_dict[original_type] = unified_type

把原有的 .obs["cell_type"] 存到另一列，这是可有可无的，也可以在后面 train_saturn.py 的时候指定细胞类型注释的列名。

human.obs["original_cell_type"] = human.obs["cell_type"].copy()
mouse.obs["original_cell_type"] = mouse.obs["cell_type"].copy()

mapped_cell_types = []
unmapped_cells = 0

for cell_type in human.obs['cell_type']:
    if cell_type in mapping_dict:
        mapped_cell_types.append(mapping_dict[cell_type])
    else:
        mapped_cell_types.append('Unclassified')
        unmapped_cells += 1
human.obs['cell_type'] = mapped_cell_types
unmapped_cells

mapped_cell_types = []
unmapped_cells = 0

for cell_type in mouse.obs['cell_type']:
    if cell_type in mapping_dict:
        mapped_cell_types.append(mapping_dict[cell_type])
    else:
        mapped_cell_types.append('Unclassified')
        unmapped_cells += 1
mouse.obs['cell_type'] = mapped_cell_types
unmapped_cells

保存一下数据：

human.write("data/human.h5ad")
mouse.write("data/mouse.h5ad")

SATURN 的输入需要准备两个 csv 文件，一个是细胞类型对齐的文件：

human_cell_type = list(human.obs["cell_type"].value_counts().keys())
mouse_cell_type = list(mouse.obs["cell_type"].value_counts().keys())
all_cell_types = set(human_cell_type + mouse_cell_type)

records = []
records.append(["mouse_cell_type", "human_cell_type"])

for cell_type in all_cell_types:
    human_value = cell_type if cell_type in human_cell_type else ''
    mouse_value = cell_type if cell_type in mouse_cell_type else ''
    records.append([mouse_value, human_value])

df = pd.DataFrame(records[1:], columns=records[0])
df.to_csv("human_mouse_cell_type_map.csv")

还有一个运行文件，包含数据集路径、gene symbol 嵌入向量的路径和物种信息：

import  os
current_dir = os.getcwd()
embedding_dir = "06-SATURN/02-SATURN_human_mouse/02-protein_embedding/embedding"

csv_df = pd.DataFrame(columns=["path", "species", "embedding_path"])
csv_df["species"] = ["human", "mouse"]
csv_df["path"] = [os.path.join(current_dir, "data/human.h5ad"), os.path.join(current_dir, "data/mouse.h5ad")]
human_embedding_path = os.path.join(embedding_dir, "Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM1b.pt")
mouse_embedding_path = os.path.join(embedding_dir, "Mus_musculus.GRCm39.gene_symbol_to_embedding_ESM1b.pt")
csv_df["embedding_path"] = [human_embedding_path, mouse_embedding_path]
csv_df.to_csv("human_mouse_run.csv", index=False)

运行 SATURN

#!/bin/bash
#SBATCH --job-name=train_saturn
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:1
#SBATCH --mem=200G
#SBATCH --output=logs/train_saturn_%j.log
#SBATCH --error=logs/train_saturn_%j.err

export PYTHONUNBUFFERED=1

python ../../SATURN-main/train-saturn.py \
        --in_data=human_mouse_run.csv \
        --in_label_col=cell_type \
        --ref_label_col=cell_type \
        --num_macrogenes=2000 \
        --hv_genes=8000 \
        --centroids_init_path=saturn_results/human_mouse_centroids.pkl \
        --score_adata \
        --ct_map_path=human_mouse_cell_type_map.csv \
        --work_dir=.

其中，--in_label_col 和 --ref_label_col 表示 h5ad 数据集中用来存储细胞类型标签的列名，--num_macrogenes 表示 macrogene 的数目默认为 2000，--hv_genes 默认选择 8000 HVG，--centroids_init_path 表示 macrogene 的初始中心，第一次跑的时候需要先计算，然后把结果存在 saturn_results/human_mouse_centroids.pkl 里（需要确保 saturn_results 文件夹存在），第二次跑的时候就可以直接跳过这部分的计算。--work_dir 会把结果存储在 ./saturn_results 里。

--score_adata 主要是控制是否进行使用物种 A（human_mouse_cell_type_map.csv 第一列）进行训练，然后预测物种 B （human_mouse_cell_type_map.csv 第二列）的细胞类型标签这一过程，如果 --score_adata=False 就会跳过这一环节，而且也就不需要 human_mouse_cell_type_map.csv 这一文件。

最终在 ./saturn_results 里的结果包括：

human_mouse_centroids.pkl
test256_data_human_mouse_org_saturn_seed_0_celltype_id.pkl
test256_data_human_mouse_org_saturn_seed_0_ep_25.h5ad
test256_data_human_mouse_org_saturn_seed_0_ep_50.h5ad
test256_data_human_mouse_org_saturn_seed_0_epoch_scores.csv
test256_data_human_mouse_org_saturn_seed_0_genes_to_macrogenes.pkl
test256_data_human_mouse_org_saturn_seed_0.h5ad
test256_data_human_mouse_org_saturn_seed_0_pretrain.h5ad
test256_data_human_mouse_org_saturn_seed_0_triplets.csv

得到了一些输出文件：

test256_data_human_mouse_org_saturn_seed_0.h5ad：输出的 AnnData 数据，包含最终的细胞嵌入。test256_data_human_mouse_org_saturn_seed_0_genes_to_macrogenes.pkl：每个基因到每个 macrogene 的权重映射。

以及一些额外的文件：

ep_25.h5ad 和 ep_50.h5ad 是 metric learning（也就是上一次原理说的微调部分进行跨物种细胞类型对齐）里 epoch 分别为 25 和 50 的中间结果，因为这里没有设置 --epochs 所以它默认等于 50，因此 ep_50.h5ad 的结果和最终结果是没有差别的。epoch_scores.csv 是用于判断当前 epoch 计算是否需要提前停止的分数文件。pretrain.h5ad 就是只经过预训练阶段的细胞嵌入结果。

triplets.csv 是每一个 epoch 里挖掘出的三元组（锚点，正样本，负样本）细胞的计数。

celltype_id.pkl 是细胞类型名称到数值 ID 的映射字典，用于确保细胞类型标签在不同物种间的一致性。

SATURN 可视化

读取一下结果：

import scanpy as sc
import pandas as pd
import pickle

adata = sc.read("saturn_results/test256_data_human_mouse_org_saturn_seed_0.h5ad")
adata

AnnData object with n_obs × n_vars = 10717 × 256
    obs: 'labels', 'labels2', 'ref_labels', 'species'
    obsm: 'macrogenes'

labels 表示 species_celltype ，labels2 表示在 labels 的基础上去掉了物种的前缀，这里 labels2 和 ref_labels 的内容是一样的，在别的情况下， ref_labels 可能是更加保守的细胞大类，species 表示物种（human/mouse）。

adata.obsm["macrogenes"] 是一个矩阵，行数等于细胞数，列数等于宏基因的数目，这个矩阵代表了每个细胞在宏基因上的表达水平。

sc.pp.pca(adata)

sc.pl.pca(adata, color="species", title="Species")
sc.pl.pca(adata, color="labels2", title="Cell Type")

sc.pp.neighbors(adata)
sc.tl.umap(adata)

sc.pl.umap(adata, color="species", title="Species")
sc.pl.umap(adata, color="labels2", title="Cell Type")

可以在宏基因层面上执行差异表达：

with open("saturn_results/test256_data_human_mouse_org_saturn_seed_0_genes_to_macrogenes.pkl", "rb") as f:
    macrogene_weights = pickle.load(f)
macrogene_weights

{'human_SAMD11': array([6.0659357e-07, 1.6696581e-04, 1.2915584e-06, ..., 7.9972560e-06,
        4.0127437e-05, 5.9080173e-07], dtype=float32),
 'human_KLHL17': array([1.0999464e-06, 4.6917098e-06, 5.1430852e-06, ..., 1.3106070e-06,
        1.3870905e-06, 5.2875851e-07], dtype=float32),
 'human_PERM1': array([5.3866364e-07, 2.3613821e-04, 1.2086666e-06, ..., 1.5812535e-05,
        2.0359473e-04, 2.6407813e-06], dtype=float32),
...}

macrogene_weights 是一个字典，键为基因，值为该基因对 2000 个宏基因的权重。

新建一个名为 macrogene_adata 的 AnnData 对象：

macrogene_adata = sc.AnnData(adata.obsm["macrogenes"])
macrogene_adata.obs = adata.obs

以单核细胞为例，找出单核细胞与其他细胞类型相比差异表达的基因：

sc.tl.rank_genes_groups(macrogene_adata, groupby="labels2", groups=["Monocyte"], method="wilcoxon")
sc.pl.rank_genes_groups_dotplot(macrogene_adata)

下面的数字是指代的 macrogene，数字表示它们在 2000 个宏基因中的顺序。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2025-11-14，如有侵权请联系 cloudcommunity@tencent.com 删除

cell

本文分享自生信菜鸟团微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度