Hi,
I would like to find the orthologs/inparalogs for a set of human genes wide range of species concerning evolutionary distance: bacteria/archaea, fungi, plants...
Initially I've thought of a pairwise method such as Inparanoid, but its code performance is not good enough for a large number of proteomes (unless there's another version suitable for MPI or multithreading). I could use other methods based on groups of orthologs (OGs), but I am not quite interested in OGs, since I want to map human genes to distant and closer species/groups.
Any suggestions are very wellcome!
Thanks a lot in advance
Because of the definition of orthology and paralogy, building a phylogenetic tree is the proper way to go but that's usually computationally expensive. The main strategy to avoid this is to rely on pairwise alignment and clustering (e.g. OrthoMCL, COG, InParanoid ...). However, without building a phylogenetic tree, detection of inparalogs may be difficult/inaccurate and distant homologies are hard to detect with pairwise alignments. For some ideas on how to proceed, have a look at the TreeFam paper, the approach is used by EnsEMBL so it seems scalable enough. For any serious phylogeny work, you would need a compute farm. My suggestion would be to start from some existing resource and add the missing pieces. For example, you could start from existing Treefam trees and add the genes from the missing species.