I have 2 fasta files containing all of the proteins from 2 distance organisms (A Spirochaetes
and a Firmicutes
). I want to map the genes from the Firmicutes
to it's best hit in the Spirochaetes
.
What is the best way to do this and the most accepted way?
I'm very familiar with Python
and my first thought was to use skbio
and do a pairwise alignment for all of the proteins.(http://scikitbio.org/docs/0.4.1/generated/skbio.alignment.StripedSmithWaterman.html). However, since it's local alignment then it may give me a high score for a single domain which is not what I want.
I then thought about using BioPython
and the blast
wrapper but I don't know how to specify the query database and a length threshold (http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc87).
I quickly checked onto OMA documentation, it looks promising. I'm wondering, have you compared your tool with the standard reciprocal blast approach? Why your tool is better? Besides, as i understand output of the tool is a table of pairwaise relations. Is there an option to directly extract all orthologs pairs from initial files (e.g. in FASTA format)?
Conceptually, the main limitation of reciprocal best hit is that it cannot cope with one-to-many or many-to-many orthology, which exist whenever a gene has duplicated after the speciation of interest. More discussion here:
https://academic.oup.com/gbe/article/5/10/1800/520875/Bidirectional-Best-Hits-Miss-Many-Orthologs-in
Now, if you are interested in the relative performance of different orthology inference methods, including OMA and reciprocal blast hit, please refer to this paper:
http://www.nature.com/nmeth/journal/v13/n5/full/nmeth.3830.html