CellPhoneDB 更新4.0 版速度更快

原创

生信探索

发布于 2023-03-31 13:01:23

1.2K0

之前版本的CellPhoneDB依赖的anndata不兼容导致使用h5ad的文件作为count matrix输入报错，没想到CellPhoneDB更新到了4.0解决了这个问题，而且运行速度超级快几分钟就跑完了10几万细胞的主要流程

cd ~
mamba create -n cpdb python=3.8
mamba activate cpdb
mamba install -y ipykernel numpy pandas scikit-learn
pip install cellphonedb gseapy -i https://pypi.tuna.tsinghua.edu.cn/simple

下载数据库文件

https://www.cellphonedb.org

Count data

如果是人的基因就直接使用adata_noramlised_annotated.h5ad文件就可以不需要下边的步骤
如果是小鼠基因，则需要把基因名转换为人的基因名
输入的count文件也可以是文本文件，但是h5ad文件速度更快

from gseapy import Biomart
import pandas as pd

bm = Biomart()
m2h_df = bm.query(dataset='mmusculus_gene_ensembl',
               attributes=['ensembl_gene_id','external_gene_name',
                           'hsapiens_homolog_ensembl_gene',
                           'hsapiens_homolog_associated_gene_name'])
m2h=m2h_df.dropna(subset=['hsapiens_homolog_associated_gene_name'])
m2h=m2h.loc[:,['external_gene_name','hsapiens_homolog_associated_gene_name']]

import anndata

adata = anndata.read_h5ad('adata_noramlised_annotated.h5ad')
adata_raw=adata.raw.to_adata()
bdata=anndata.AnnData(X=adata_raw.X,
    obs=pd.DataFrame({'cell_type':adata_raw.obs.CellType2},index=adata_raw.obs_names),
    var=pd.DataFrame(index=adata_raw.var_names)
    )
merged =pd.merge(bdata.var,m2h,left_index=True,right_on='external_gene_name')
bdata=bdata[:,merged.external_gene_name]
bdata.var_names=merged.hsapiens_homolog_associated_gene_name.values
bdata.write_h5ad('adata_for_cellphonedb.h5ad',compression='lzf')

Meta data

meta_file = pd.DataFrame({'Cell':bdata.obs.index,'cell_type':bdata.obs.cell_type})
meta_file.to_csv("meta_file.csv",index=False)

删除不需要的变量

del adata,meta_file,bdata,merged,m2h

cpdb_statistical_analysis_method

from cellphonedb.src.core.methods import cpdb_statistical_analysis_method

deconvoluted, means, pvalues, significant_means = cpdb_statistical_analysis_method.call(
    cpdb_file_path = './cellphonedb.zip',            # mandatory: CellPhoneDB database zip file.
    meta_file_path = "./meta_file.csv",              # mandatory: tsv file defining barcodes to cell label.
    counts_file_path = './adata_for_cellphonedb.h5ad',# mandatory: normalized count matrix.
    counts_data = 'hgnc_symbol',                     # defines the gene annotation in counts matrix.
    microenvs_file_path = None,                      # optional (default: None): defines cells per microenvironment.
    iterations = 1000,                               # denotes the number of shufflings performed in the analysis.
    threshold = 0.1,                                 # defines the min % of cells expressing a gene for this to be employed in the analysis.
    threads = 8,                                     # number of threads to use in the analysis.
    debug_seed = 42,                                 # debug randome seed. To disable >=0.
    result_precision = 3,                            # Sets the rounding for the mean values in significan_means.
    pvalue = 0.05,                                   # P-value threshold to employ for significance.
    subsampling = False,                             # To enable subsampling the data (geometri sketching).
    subsampling_log = False,                         # (mandatory) enable subsampling log1p for non log-transformed data inputs.
    subsampling_num_pc = 100,                        # Number of componets to subsample via geometric skectching (dafault: 100).
    subsampling_num_cells = 1000,                    # Number of cells to subsample (integer) (default: 1/3 of the dataset).
    separator = '|',                                 # Sets the string to employ to separate cells in the results dataframes "cellA|CellB".
    debug = False,                                   # Saves all intermediate tables employed during the analysis in pkl format.
    output_path = './',                              # Path to save results.
    output_suffix = None                             # Replaces the timestamp in the output files by a user defined string in the  (default: None).
    )

文件目录

├── adata.h5ad
├── meta_file.csv
├── statistical_analysis_deconvoluted_03_30_2023_11:11:04.txt
├── statistical_analysis_means_03_30_2023_11:11:04.txt
├── statistical_analysis_pvalues_03_30_2023_11:11:04.txt
└── statistical_analysis_significant_means_03_30_2023_11:11:04.txt

输出文件和可视化可以参考

https://mp.weixin.qq.com/s/FG8oQJEoM1BRclcaSFC3mw

Reference

https://cellphonedb.readthedocs.io/en/latest/RESULTS-DOCUMENTATION.html

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

python

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

python

登录后参与评论

暂无评论

CellPhoneDB 更新4.0 版速度更快

CellPhoneDB 更新4.0 版速度更快

下载数据库文件

Count data

Meta data

cpdb_statistical_analysis_method

Reference

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐