PhyloPhlAn 3.0 微生物组系统发育分析

生信菜鸟团

发布于 2020-06-02 15:39:17

8.5K10

代码可运行

运行总次数：0

代码可运行

目前已有许多软件算法可用于微生物基因组和宏基因组数据的系统发育研究，比如 PhyloPhlAn，PhyloSift，ezTree，GToTree，AMPHORA 等等。但绝大多数方法都或多或少的存在一些局限，例如现在还没有一种方法可以选择不同的基因组区域以进行最佳分类，也不能充分整合公共数据库进行分析。基于上述痛点，PhyloPhlAn 最近迎来了一次大升级，新版本不但对之前的版本完全重写还增加了很多新功能。

GitHub：https://github.com/biobakery/phylophlan

PhyloPhlAn 3.0 可整合超过 80,000 个分离基因组和150,000 个 MAG 分析新生成的微生物基因组，进行从菌株到门水平的系统发育分析。它可以根据进化枝自动选择信息最丰富的基因座，生成多序列比对，突变率和系统发育的结果。

软件安装

建议直接用 Conda 创建新的软件环境安装：

conda create -n "phylophlan" -c bioconda phylophlan=3.0

安装完成后需要生成配置文件才可使用，用 phylophlan_write_default_configs.sh脚本生成默认配置文件：

phylophlan_write_default_configs.sh [output_folder]

验证 PhyloPhlAn 是否安装正确：

phylophlan --version

输出：

PhyloPhlAn version 3.0.51 (11 May 2020)

基本用法

phylophlan -i <input_folder> \
    -d <database> \
    --diversity <low-medium-high> \
    -f <configuration_file>

输入文件夹 `<input_folder>`

设置输入基因组 .fna 和/或氨基酸序列数据 .faa 的文件夹，可用 .gz 或 .bz2 压缩格式。

我们可用配置文件指定每一步所需的工具，自定义 pipeline 中的每个步骤以构建树（标记基因识别、多重序列比对、连接或基因树推断、系统发育重建）。这些步骤应根据数据库中存在的标记类型和分析中使用的输入进行调整:

•当 marker 和输入都是核酸数据时，将对核酸进行系统发育分析，配置文件应说明使用相应工具和参数•当 marker 是蛋白质，输入为基因组和蛋白质时，软件将翻译序列。如果输入为基因组，用户可以决定指定 --force_nucletides 参数进行核酸的系统发育分析。可用 phylophlan_write_config_file 脚本的 --force_nucleotides 生成配置文件

数据库 `<database>`

设置要使用的 marker 数据库名称。

PhyloPhlAn 3.0 能够自动下载两个原核生物通用标记数据库:

1.PhyloPhlAn (-d phylophlan, 400 universal marker genes) presented in Segata, N et al. NatComm 4:2304 (2013)[1]2.AMPHORA2 (-d amphora2, 136 universal marker genes) presented in Wu M, Scott AJ Bioinformatics 28.7 (2012)[2]

除了上面这两个数据库外，PhyloPhlAn 还支持建立自定义数据库。

`--diversity`

可使用 low, medium, high，用于设置构建的系统发育类型。

Diversity	Description
low	for species- and strain-level phylogenies
medium	for genus- and family-level phylogenies
high	for tree-of-life and higher-ranked taxonomic levels phylogenies

配置文件 `<configuration_file>`

设置配置文件路径。

用 phylophlan_write_default_configs.sh脚本可生成四个默认的配置文件：

•supermatrix_aa.cfg•supermatrix_nt.cfg•supertree_aa.cfg•supertree_nt.cfg

自定义配置文件

可用 phylophlan_write_config_file 脚本自定义配置文件。下面是一个用于创建自定义配置文件的示例，对于超矩阵 nt，使用 diamond 代替 blastn ，用 muscle 代替 mafft：

python phylophlan_write_config_file \
    -o custom_config_nt.cfg \
    -d n \
    --db_dna makeblastdb \
    --map_dna diamond \
    --msa muscle \
    --trim trimal \
    --tree1 fasttree \
    --tree2 raxml

•-o: 输出文件名•-d: 设置该配置文件针对的数据库类型

并行计算

可用 --nproc 参数指定 CPU 数量：

phylophlan -i <input_folder> \
    -d <database> \
    --diversity <low-med-high> \
    -f <configuration_file> \
    --nproc <N>

注意，无论用 --nproc 指定多少个 cpu，RAxML 的运行都不会调用超过 20 个 CPU，因为超过 20 个 CPU 并不会缩短重建系统发育树所需的计算时间；FastTree 只会使用 3 CPUs。

软件输出

默认输出文件夹为 <input_folder>_<database> （可用 --output_folder 参数自定义输出文件夹）。<input_folder>_<database>/tmp 临时文件夹用于存放分析期间生成的所有中间文件和临时文件。

根据配置文件以及执行的分析类型，输出文件名会有所不同。

例如，用默认的 supermatrix_aa.cfg 配置文件，输出文件为:

Filename	Description
RAxML_bestTree.input_folder_refined.tre	用 RAxML 生成的最终进化树
input_folder.tre	由 FastTree 生成的进化树
input_folder.aln	用于建树的多重序列比对数据上面的只是一些最基本的参数介绍，要想真正学会这个软件还是要用示例教程实战一波。

上面的只是一些最基本的参数介绍，要想真正学会这个软件还是要用示例教程实战一波。PhyloPhlAn 官方根据研究方向的不同提供了五篇教程分别为，建议大家详细阅读：

•Phylogenetic characterization of isolate genomes of a given species (S. aureus)[3]•Prokaryotes Tree of life reconstruction[4]•Metagenomic analysis of the Ethiopian cohort[5]•High-resolution phylogeny of genomes and MAGs of a known species (E. coli)[6]•Phylogenetic characterization of an unknown SGB from the Proteobacteria phylum[7]

这里因为篇幅有限，我只根据第三篇教程（ Metagenomic analysis of the Ethiopian cohort ）进行翻译和介绍。

示例 | 宏基因组学应用埃塞俄比亚队列的宏基因组学分析

示例数据下载：

git clone https://github.com/biobakery/phylophlan.git

激活分析环境：

conda activate phylophlan

进入教程路径 examples/03_metagenomic

cd phylophlan/phylophlan/examples/03_metagenomic

Step.1 下载 Ethiopian genome bins

原始数据位于：PRJNA504891[8].

下载分箱数据：

wget https://www.dropbox.com/s/fuafzwj67tguj31/ethiopian_mags.tar.bz2?dl=1 -O ethiopian_mags.tar.bz2
mkdir -p input_metagenomic
tar -xjf ethiopian_mags.tar.bz2 -C input_metagenomic/

Step 2. 为分箱注释分类标签

用 2019年1月发布的 SGB （species-level genome bins）数据库（MetaRefSGB, *Pasolli E, et al. Cell 176.3 (2019)[9]*）为每个分箱注释其最接近的 SGB。

phylophlan_metagenomic -i input_metagenomic \
    -o output_metagenomic \
    --nproc 4 \
    -n 1 \
    -d SGB.Jan19 \
    --verbose 2>&1 | tee logs/phylophlan_metagenomic.log

用上面的命令，对于每个分箱我们只对输出最接近的 SGB ( -n 1 )。若基因组分箱与已报告的 SGB 之间有5% 的 Mash[10] 距离，我们可将该分箱归为其中的一部分，并指定 SGB 的分类标签。

若出现 mash 未安装的报错，需要下载mash，并将其加入环境变量：https://github.com/marbl/Mash/releases SGB.Jan19 数据会自动下载。

Step 3. 绘制埃塞俄比亚宏基因组中排名前 21 的 SGB 热图

这步需要用到映射文件，即描述分箱与其对应宏基因组样本的信息，示例的映射文件为 bin2meta.tsv。

phylophlan_draw_metagenomic -i output_metagenomic.tsv \
    -o output_heatmap \
    --map bin2meta.tsv \
    --top 20 \
    --verbose 2>&1 | tee logs/phylophlan_draw_metagenomic.log

命令会生成两张热图：

1.第一张热图显示在埃塞俄比亚人群中发现的前21个 SGB 的存在/缺失情况；2.第二张热图显示每个宏基因组样本中有多少 uSGBs、 kSGBs 和未分配的分箱。

下一步可以怎么分析

对于这个埃塞俄比亚队列，我们还可以进一步关注某些特定已知或未知的 SGB。

比如，我们可以聚焦到一些常见的肠道共生菌群 —— 大肠杆菌，把包括在 kSGB 10068 中的 8 个分箱构建系统发育树进行分析。具体步骤可参考：第四篇教程 High-resolution phylogeny of genomes and MAGs of a known species (E. coli)。

此外，我们还可以研究在该人群中最普遍的未知 SGB（uSGB 19436），进一步探究他们与参考基因组的关系。具体步骤可参考：第五篇教程 Phylogenetically characterization of an unknown SGB from the Proteobacteria phylum。

引用链接

[1] Segata, N et al. NatComm 4:2304 (2013): https://www.nature.com/articles/ncomms3304 [2] Wu M, Scott AJ Bioinformatics 28.7 (2012): https://academic.oup.com/bioinformatics/article/28/7/1033/210898 [3] Phylogenetic characterization of isolate genomes of a given species (S. aureus): https://github.com/biobakery/biobakery/wiki/PhyloPhlAn-3.0:-Example-01:-S.-aureus [4] Prokaryotes Tree of life reconstruction: https://github.com/biobakery/biobakery/wiki/PhyloPhlAn-3.0:-Example-02:-Tree-of-life [5] Metagenomic analysis of the Ethiopian cohort: https://github.com/biobakery/biobakery/wiki/PhyloPhlAn-3.0:-Example-03:-Metagenomic-application [6] High-resolution phylogeny of genomes and MAGs of a known species (E. coli): https://github.com/biobakery/biobakery/wiki/PhyloPhlAn-3.0:-Example-04:-E.-coli [7] Phylogenetic characterization of an unknown SGB from the Proteobacteria phylum: https://github.com/biobakery/biobakery/wiki/PhyloPhlAn-3.0:-Example-05:-Proteobacteria [8] PRJNA504891: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA504891 [9] Pasolli E, et al. Cell 176.3 (2019): https://www.cell.com/cell/fulltext/S0092-8674(19)30001-7?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS0092867419300017%3Fshowall%3Dtrue [10] Mash: http://mash.readthedocs.org/

每周文献分享

https://www.yuque.com/biotrainee/weeklypaper

肿瘤外显子分析指南

https://www.yuque.com/biotrainee/wes

生物统计从理论到实践

https://www.yuque.com/biotrainee/biostat

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2020-05-24，如有侵权请联系 cloudcommunity@tencent.com 删除

https

本文分享自生信菜鸟团微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度