基因组分析中多物种同源基因的鉴定和筛选

OrthoMCL能做什么

Orthologs are homologs separated by speciation events. Paralogs are homologs separated by duplication events. Detection of orthologs is becoming much more important with the rapid progress in genome sequencing.

OrthoMCL is a genome-scale algorithm for grouping orthologous protein sequences. It provides not only groups shared by two or more species/genomes, but also groups representing species-specific gene expansion families. So it serves as an important utility for automated eukaryotic genome annotation. OrthoMCL starts with reciprocal best hits within each genome as potential in-paralog/recent paralog pairs and reciprocal best hits across any two genomes as potential ortholog pairs. Related proteins are interlinked in a similarity graph. Then MCL (Markov Clustering algorithm, Van Dongen 2000; www.micans.org/mcl) is invoked to split mega-clusters. This process is analogous to the manual review in COG construction. MCL clustering is based on weights between each pair of proteins, so to correct for differences in evolutionary distance the weights are normalized before running MCL.

OrthoMCL is similar to the INPARANOID algorithm (Remm, Storm et al. 2001), but is extended to cluster orthologs from multiple species. OrthoMCL clusters are coherent with groups identified by EGO (Lee, Sultana et al. 2002), and an analysis using EC number suggests a high degree of reliability (Li, Stoeckert et al. 2003).

In a recent assessment (Chen, et al. 2007), the performance of seven widely used orthology detection algorithms, representing three kinds of strategies (phylogeny-based, evolutionary distance-based and BLAST-based), are evaluated using the statistical technique Latent Class Analysis (LCA). LCA is useful when there are large data sets available but no gold standard. The results show an overall trade-off between sensitivity and specificity among these algorithms, with INPARANOID and OrthoMCL as the two best methods having both False Positive (FP) and False Negative (FN) error rates lower than 20%.

安装和使用

  • 统一配置环境变量,一劳永逸
    • export PERL5LIB=${PERL5LIB}:~/perl5lib/加到~/.bashrc
    • export PATH=${PATH}:~/bin 加到 ~/.bashrc
    • 环境变量配置:在系统中新建目录 ~/bin,将其完整路径加入到环境变量。
    • PERL5LIB配置:在系统中新建目录 ~/perl5lib,将其完整路径加入到环境变量。
    • 更新环境变量配置 source ~/.bashrc
  • mcl安装 wget http://www.micans.org/mcl/src/mcl-latest.tar.gz tar xvzf mcl-latest.tar.gz cd mcl-latest ./configure --prefix=`pwd`/../mcl_bin make make install ln -s `pwd`/../mcl_bin/bin/* ~/bin/
  • orthoMCL安装 wget http://orthomcl.org/common/downloads/software/v2.0/orthomclSoftware-v2.0.9.tar.gz tar xvzf orthomclSoftware-v2.0.9.tar.gz cd orthomclSoftware-v2.0.9 ln -s `pwd`/bin/* ~/bin/ ln -s `pwd`/lib/perl/* ~/perl5lib
  • 配置Mysql数据库
    • 新建名字为orthomcl的数据库 CREATE DATABASE orthomcl;
    • 新建用户orthomcl,密码为152108, 该用户对数据库orthomcl有完全操作 权限 GRANT SELECT,INSERT,UPDATE,DELETE,CREATE VIEW, CREATE,INDEX,DROP on orthomcl.* TO 'orthomcl'@'localhost' IDENTIFIED BY '152108'; FLUSH PRIVILEGES;
    • 若启动失败,查看log文件 /var/log/mysqld.log中的错误信息。
    • /usr/libexec/mysqld: Can't change dir to [Error code 13]
    • 确保datadir的所有上层目录有x属性
    • 若依然启动不了,在终端运行setenforce 0 关闭SELINUX
    • 查看mysql服务 service mysqld status
    • 关掉mysql服务 service mysqld stop
    • 移动数据库目录到目标位置
    • mkdir ~/mysql; chown mysql:mysql ~/mysql
    • mv /var/lib/mysql/* ~/mysql/
    • /etc/my.cnf文件中修改datadir~/mysql
    • mysql -uroot登录mysql数据库
    • 在mysql操作界面依次输入sql语句 SET PASSWORD=PASSWORD("passwd"); FLUSH PRIVILEGES;
    • yum install mysql mysql-server
    • 安装mysql数据库
    • 设置mysql根用户的密码
    • 因为OrthoMCL运行时需要较大的存储空间,而我的根目录下空间不够, 因此需要更换数据库目录;如果根目录下空间足够,则不需要这部分操 作。
    • 修改/etc/my.cnf配置文件 [mysqld] datadir=~/mysql #[OPTIMIZATION] ##Set this value to 50% of available RAM if your environment permits. myisam_sort_buffer_size=60G ##[OPTIMIZATION] ##This value should be at least 50% of free hard drive space. Use #caution if setting it to 100% of free space however. Your hard disk #may fill up! myisam_max_sort_file_size=200G ##[OPTIMIZATION] ##Our default of 2G is probably fine for this value. Change this value #only if you are using a machine with little resources available. read_buffer_size=2G
    • 启动mysql服务 service mysqld start
    • 新建用户和数据库
    • centos7中使用mariadb取代了mysql, 但所有命令的执行相同 (忽略掉这一段) yum install mariadb mariadb-server systemctl start mariadb ==> 启动mariadb systemctl enable mariadb ==> 开机自启动 mysql_secure_installation ==> 设置 root密码等相关 mysql -uroot -pPASSWD ==> 测试登录!
  • 配置OrthoMCL工作文件
    • orthomclInstallSchema orthomcl.config inst_schema.log species
    • 建一个目录 (~/orthmcl_work),存储OrthoMCL配置文件
    • 拷贝orthomclSoftware-v2.0.9/doc/OrthoMCLEngine/Main/orthomcl.config.template 到~/orthmcl_work,重命名为orthomcl.config
    • 修改内容为: # this config assumes a mysql database named 'orthomcl'. # Adjust according to your situation. dbVendor=mysql #Databsename: orthmcl dbConnectString=dbi:mysql:orthomcl #Database username dbLogin=orthomcl #Database password dbPassword=152108 # Change strings as you like similarSequencesTable=SimilarSequences orthologTable=Ortholog inParalogTable=InParalog coOrthologTable=CoOrtholog #Standards interTaxonMatchView=InterTaxonMatch percentMatchCutoff=50 evalueExponentCutoff=-5 oracleIndexTblSpc=NONE
    • 生成数据表:
  • 创建OrthoMCL输入文件
    • orthomclFilterFasta orthlMCL 10 20
    • OrthoMCL的输入文件为fasta格式文件,其中fasta序列的名字格式为>taxoncode|unique_prot_id。序列名称为空格或下划线分开的两列, 第一列为3到4个字母的物种代码,第二列为蛋白序列的唯一ID。
    • 通常一个基因选择一条代表性蛋白序列。
    • 这些文件使用统一后缀.fasta,并存储于同一文件夹orthlMCL下 (这个文件夹下只能存储fasta格式序列,不然运行 orthomclBlastParser时会报错)。
    • 序列过滤,允许最短的蛋白长度为10,stop codons最大比例为20%, 默认得到goodProteins.fasta
    • 将得到的goodProteins.fasta与orthoMCL的数据合并, 得到orthoMCL.fa
    • 通常我们需要准备研究物种及其多个近缘或者有代表性物种的蛋白质序列 ,因此可不与orthoMCL数据库中的蛋白质序列合并,直接用我们的goodProteins.fasta作为orthoMCL.fa
  • 序列BLAST makeblastdb -in orthoMCL.fa -dbtype prot -title orthomcl \ -out orthomcl -logfile orthomcl.log` blastp -db orthomcl -query goodProteins.fasta -seg yes \ -out orthomcl.blastout -evalue 1e-5 -outfmt 7 -num_threads 70`
  • 略却其它步骤,都整合到一个bash脚本中。
  • 整合的分析脚本orthoMcl.sh ```bash Usage: /MPATHB/self/NGS/orthoMcl.sh options Function: This script is used to perform orthoMcl analysis using MySql, MCL and orthomcl. Before running this script, one must have one mysql database and a mysql user which can perform operation on this database.

OPTIONS: -d Mysql database name (using user_name as prefix to avoid duplication) [Necessary] -u Mysql database username [Necessary] -p Mysql database password [Necessary] -s Target species of this analysis (Any representing string is OK, the shorter the better) [Necessary] -D A directory containing FASTA files for all proteins. [Necessary] -S Sequences downloaded from orthMCL website. [Optional, not used anymore] -t Number of threads for blast. [Default 50]

* [`parseOrthoMclResult.py`](https://github.com/Tong-Chen/NGS/blob/master/parseOrthoMclResult.py)解析orthoMCL的输出结果,主要是`groups.xls`文件
    * 获得每个物种各个基因簇中基因数目的矩阵。
    * 提取在所有物种中
      都只有一个拷贝的基因,提交给工具
      [`orthoMclPhyloGenetic.py`](https://github.com/Tong-Chen/NGS/blob/master/orthoMclPhyloGenetic.py)用于做进化分析。
    * 提取特定物种特有的基因簇。
    * 提取多个物种共有相对于其它物种特异的基因簇。
    * 提取某物种特异扩增或缺失的基因家族。

    * `parseOrthoMclResult.py`
  Program description:
      This is designed to parse orthmcl results.

      Input file format:

      cluster_name<colon><any blank>spe1<vertical_line>prot1<any blank>spe2<verticial_line>prot2<any blank>.....

      C10000: Aco|Aco000153.1 Aco|Aco004369.1 Aco|Aco010005.1
      C10001: Aco|Aco000153.1 Cla|Cla004369.1 Dec|Dec010005.1

      Tasks:

      1. Get a matrix showing the number of proteins in each cluster.
      2. Extract single gene clusters and their sequences in all given
      species. In the output nucleotide file, ending stop codon (TAA,
      TAG, TGA) will be removed for compatible with
      `translatorx_vLocal.pl` and `trimal`.
      3. Extract species specific clusters for given species.
      4. Extract gene-expansion clusters for given species.
      5. Extract multiple-species specific clusters.

  Usage: parseOrthoMclResult.py -i file

  Options:
    -h, --help            show this help message and exit
    -i FILEIN, --input-file=FILEIN
                          Output of `orthomclMclToGroups`.
    -t MAIN_SPE, --target-species=MAIN_SPE
                          Specify the `species` name used for extracting species
                          specific clusters or specially expanded clusters.
    -E EXCLUDE_WHEN_READING, --exclude-all=EXCLUDE_WHEN_READING
                          Comma or blank separated strings representing species
                          excluded when reading in the result. It will affect
                          all tasks. Default including all species.
    -e EXCLUDE_SINGLE_CONSERVE, --exclude-2=EXCLUDE_SINGLE_CONSERVE
                          Comma or blank separated strings representing species
                          should not be considered when performing task <2>.
                          Default including all species.
    -s SPECIFIC_MULTIPLE, --specific-multiple-5=SPECIFIC_MULTIPLE
                          Comma or blank separated strings representing multiple
                          species used for task <5>. Default muting task 5.
    -P DIR_PROT, --directory-prot=DIR_PROT
                          Directory containing all protein sequences used for
                          `orthoMcl.sh`. All sequences have a suffix `.fasta`.
    -N DIR_NUCL, --directory-nucl=DIR_NUCL
                          Directory containing all nucleotide sequences used for
                          `orthoMcl.sh`. All sequences have a suffix `.fasta`.
    -o OUTP, --output-prefix=OUTP
                          Prefix for output files.
    -v, --verbose         Show process information
    -d, --debug           Debug the program
  ```

原文发布于微信公众号 - 生信宝典(Bio_data)

原文发表时间:2017-05-31

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏JAVA高级架构

结合RPC框架通信谈 netty如何解决TCP粘包问题

因为自己造一个RPC框架的轮子时,需要解决TCP的粘包问题,特此记录,希望方便他人。这是我写的RPC框架的 GitHub地址 https://github.co...

11830
来自专栏菩提树下的杨过

mybatis: 利用多数据源实现分库存储

之前写过一篇mybatis 使用经验小结 提到过多数据源的处理方式,虽然简单但是姿势不太优雅,今天介绍一些更美观的办法: spring中有一个AbstractR...

25250
来自专栏JetpropelledSnake

Django学习笔记之利用Form和Ajax实现注册功能

18550
来自专栏牛客网

京东提前批研发面经

【每日一语】真实人生中,我们往往在大势底定无可更改时才迟迟进场,却又在胜败未分的浑沌中提早离席。——翁贝托·埃科《开头与结尾》

12920
来自专栏高性能服务器开发

从零学习开源项目系列(四)LogServer源码探究

这是从零学习开源项目的第四篇,上一篇是《从零学习开源项目系列(三) CSBattleMgr服务源码研究》,这篇文章我们一起来学习LogServer,中文意思可能...

27220
来自专栏点滴积累

Cesium中Clock控件及时间序列瓦片动态加载

前言 前面已经写了两篇博客介绍Cesium,一篇整体上简单介绍了Cesium如何上手,还有一篇介绍了如何将Cesium与分布式地理信息处理框架Geotrelli...

51540
来自专栏nimomeng的自我进阶

OC优化指南

a) Reusing UITableViewCell:利用cellWithTableView:cellIdentifier:nibName: b)...

16610
来自专栏blackheart的专栏

[信息安全] 4.一次性密码 && 身份认证三要素

在信息安全领域,一般把Cryptography称为密码,而把Password称为口令。日常用户的认知中,以及我们开发人员沟通过程中,绝大多数被称作密码的东西其...

33160
来自专栏hrscy

RxSwift - Why

官方建议总是使用 .addDisposableTo(disposeBag) 即使对于简单绑定来说那不是必要的。

16820
来自专栏iOS开发日记

Object-C特性埋点

Objective-C是一门简单的语言,95%是C。只是在语言层面上加了些关键字和语法。真正让Objective-C如此强大的是它的运行时。它很小但却很强大。它...

46260

扫码关注云+社区

领取腾讯云代金券