Orthologs are homologs separated by speciation events. Paralogs are homologs separated by duplication events. Detection of orthologs is becoming much more important with the rapid progress in genome sequencing.
OrthoMCL is a genome-scale algorithm for grouping orthologous protein sequences. It provides not only groups shared by two or more species/genomes, but also groups representing species-specific gene expansion families. So it serves as an important utility for automated eukaryotic genome annotation. OrthoMCL starts with reciprocal best hits within each genome as potential in-paralog/recent paralog pairs and reciprocal best hits across any two genomes as potential ortholog pairs. Related proteins are interlinked in a similarity graph. Then MCL (Markov Clustering algorithm, Van Dongen 2000; www.micans.org/mcl) is invoked to split mega-clusters. This process is analogous to the manual review in COG construction. MCL clustering is based on weights between each pair of proteins, so to correct for differences in evolutionary distance the weights are normalized before running MCL.
OrthoMCL is similar to the INPARANOID algorithm (Remm, Storm et al. 2001), but is extended to cluster orthologs from multiple species. OrthoMCL clusters are coherent with groups identified by EGO (Lee, Sultana et al. 2002), and an analysis using EC number suggests a high degree of reliability (Li, Stoeckert et al. 2003).
In a recent assessment (Chen, et al. 2007), the performance of seven widely used orthology detection algorithms, representing three kinds of strategies (phylogeny-based, evolutionary distance-based and BLAST-based), are evaluated using the statistical technique Latent Class Analysis (LCA). LCA is useful when there are large data sets available but no gold standard. The results show an overall trade-off between sensitivity and specificity among these algorithms, with INPARANOID and OrthoMCL as the two best methods having both False Positive (FP) and False Negative (FN) error rates lower than 20%.
export PERL5LIB=${PERL5LIB}:~/perl5lib/
加到~/.bashrc
中export PATH=${PATH}:~/bin
加到 ~/.bashrc
中~/bin
,将其完整路径加入到环境变量。~/perl5lib
,将其完整路径加入到环境变量。source ~/.bashrc
orthomcl
的数据库 CREATE DATABASE orthomcl;
orthomcl
,密码为152108
, 该用户对数据库orthomcl
有完全操作 权限
GRANT SELECT,INSERT,UPDATE,DELETE,CREATE VIEW, CREATE,INDEX,DROP on orthomcl.* TO 'orthomcl'@'localhost' IDENTIFIED BY '152108'; FLUSH PRIVILEGES;/var/log/mysqld.log
中的错误信息。/usr/libexec/mysqld: Can't change dir to [Error code 13]
x
属性setenforce 0
关闭SELINUXservice mysqld status
service mysqld stop
mkdir ~/mysql; chown mysql:mysql ~/mysql
mv /var/lib/mysql/* ~/mysql/
/etc/my.cnf
文件中修改datadir
为~/mysql
mysql -uroot
登录mysql数据库SET PASSWORD=PASSWORD("passwd");
FLUSH PRIVILEGES;
yum install mysql mysql-server
/etc/my.cnf
配置文件
[mysqld] datadir=~/mysql #[OPTIMIZATION] ##Set this value to 50% of available RAM if your environment permits. myisam_sort_buffer_size=60G ##[OPTIMIZATION] ##This value should be at least 50% of free hard drive space. Use #caution if setting it to 100% of free space however. Your hard disk #may fill up! myisam_max_sort_file_size=200G ##[OPTIMIZATION] ##Our default of 2G is probably fine for this value. Change this value #only if you are using a machine with little resources available. read_buffer_size=2Gservice mysqld start
orthomclInstallSchema orthomcl.config inst_schema.log species
orthomclFilterFasta orthlMCL 10 20
>taxoncode|unique_prot_id
。序列名称为空格或下划线分开的两列, 第一列为3到4个字母的物种代码,第二列为蛋白序列的唯一ID。.fasta
,并存储于同一文件夹orthlMCL
下 (这个文件夹下只能存储fasta格式序列,不然运行 orthomclBlastParser
时会报错)。goodProteins.fasta
。goodProteins.fasta
与orthoMCL的数据合并, 得到orthoMCL.fa
。goodProteins.fasta
作为orthoMCL.fa
。orthoMcl.sh
Usage:
/MPATHB/self/NGS/orthoMcl.sh options
Function:
This script is used to perform orthoMcl analysis using MySql, MCL and
orthomcl.
Before running this script, one must have one mysql database and a
mysql user which can perform operation on this database.
OPTIONS:
-d Mysql database name (using user_name as prefix to avoid
duplication) [Necessary]
-u Mysql database username [Necessary]
-p Mysql database password [Necessary]
-s Target species of this analysis
(Any representing string is OK, the shorter the better)
[Necessary]
-D A directory containing FASTA files for all proteins.
[Necessary]
-S Sequences downloaded from orthMCL website.
[Optional, not used anymore]
-t Number of threads for blast. [Default 50]
parseOrthoMclResult.py
解析orthoMCL的输出结果,主要是groups.xls
文件orthoMclPhyloGenetic.py
用于做进化分析。parseOrthoMclResult.py
Program description:
This is designed to parse orthmcl results.
Input file format:
cluster_name<colon><any blank>spe1<vertical_line>prot1<any blank>spe2<verticial_line>prot2<any blank>.....
C10000: Aco|Aco000153.1 Aco|Aco004369.1 Aco|Aco010005.1
C10001: Aco|Aco000153.1 Cla|Cla004369.1 Dec|Dec010005.1
Tasks:
1. Get a matrix showing the number of proteins in each cluster.
2. Extract single gene clusters and their sequences in all given
species. In the output nucleotide file, ending stop codon (TAA,
TAG, TGA) will be removed for compatible with
`translatorx_vLocal.pl` and `trimal`.
3. Extract species specific clusters for given species.
4. Extract gene-expansion clusters for given species.
5. Extract multiple-species specific clusters.
Usage: parseOrthoMclResult.py -i file
Options:
-h, --help show this help message and exit
-i FILEIN, --input-file=FILEIN
Output of `orthomclMclToGroups`.
-t MAIN_SPE, --target-species=MAIN_SPE
Specify the `species` name used for extracting species
specific clusters or specially expanded clusters.
-E EXCLUDE_WHEN_READING, --exclude-all=EXCLUDE_WHEN_READING
Comma or blank separated strings representing species
excluded when reading in the result. It will affect
all tasks. Default including all species.
-e EXCLUDE_SINGLE_CONSERVE, --exclude-2=EXCLUDE_SINGLE_CONSERVE
Comma or blank separated strings representing species
should not be considered when performing task <2>.
Default including all species.
-s SPECIFIC_MULTIPLE, --specific-multiple-5=SPECIFIC_MULTIPLE
Comma or blank separated strings representing multiple
species used for task <5>. Default muting task 5.
-P DIR_PROT, --directory-prot=DIR_PROT
Directory containing all protein sequences used for
`orthoMcl.sh`. All sequences have a suffix `.fasta`.
-N DIR_NUCL, --directory-nucl=DIR_NUCL
Directory containing all nucleotide sequences used for
`orthoMcl.sh`. All sequences have a suffix `.fasta`.
-o OUTP, --output-prefix=OUTP
Prefix for output files.
-v, --verbose Show process information
-d, --debug Debug the program