Evolutionary Distance Among Species For Orthology Analysis
3
1
Entering edit mode
11.6 years ago

I am using orthomcl on 4-5 species (complete proteomes) to try to find groups of orthologous genes. Will the evolutionary distances among these species affect my results? If species A and species B are in the same genus and all other species are in separate phyla, will orthomcl be biased towards putting A and B exclusively in an orthologous group?

From my results, it seems like I am getting proportionally more exclusive A-B groups, which makes sense since they are closer together. But when I blast some of the genes in the A-B group, I am getting decent hits to the other species I used in orthomcl.

The algorithm doesn't seem to be described very well in their paper and the source code for the orthology finding is basically a set of messy SQL calls. There does seem to be some kind of a weighing procedure to normalize the blast scores. Does anyone have any thoughts or suggestion for alternative method/software?

orthomcl • 3.6k views
ADD COMMENT
1
Entering edit mode

I think orthomcl is essentially clustering based on similarities calculated from all-to-all blast results. Could you try some phylogeny-based methods? I image that would give you some A and B lineage-specific duplications.

ADD REPLY
0
Entering edit mode
11.6 years ago
Asaf 10k

I don't know if the software is available but maybe you find SYNERGY useful. Another algorithm for finding clusters of orthologous proteins is oma-browser where you can find precomputed clusters.

ADD COMMENT
0
Entering edit mode
11.6 years ago

The original algorithm for Orthomcl didn't do any adhoc weighing of the species in the sets, although this might have changed. There is an alternative to orthomcl that does take into account ingroup and outgroup species, which is to use hcluster_sg to do the clustering for you blast scores:

http://treesoft.svn.sourceforge.net/viewvc/treesoft/branches/lh3/hcluster/

Click on the Download GNU tarball at the bottom of the page.

The input file is an A.B.C format where protein A and B are followed by the blast score or evalue (scaled from, say, 0-100) and another file, optionally, which is the "categories" file. This software allows you to define these "categories", see exam-1.cat as an example. In these categories, you can split your sets into species that are very close together and species that can be called outgroups, and outgroups can also have different levels. So ingroups for close subgroups, then outgroups of different levels, will be taken into account when doing the clustering, so that you are not leaving too many outgroup proteins behind just because they are more distant in the phylogenetic tree than the ingroup species.

The hcluster_sg software was (I think still is) the software used in the EnsemblCompara GeneTrees pipeline: it scales really well and it's used for trees that encompass the whole tree of life, including eukarya, prokarya and archaea, and produces very decent protein clusters given the right categories.

ADD COMMENT
0
Entering edit mode
11.6 years ago
qiyunzhu ▴ 430

I have tried several of the popular programs but none works perfectly for my data. I decided to use the result of OrthoMCL because I guess it is the most popular one.

Here is an old but handy review of orthology-identification algorithms and programs. Hope you will find it useful!

Kuzniar A, van Ham RC, Pongor S, Leunissen JA. The quest for orthologs: finding the corresponding gene across genomes. Trends Genet. 2008 Nov;24(11):539-51. http://www.ncbi.nlm.nih.gov/pubmed/18819722

ADD COMMENT

Login before adding your answer.

Traffic: 2584 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6