Background: I have ~ 18K proteins from 30+ plant species. The only thing common to them is the presence of a certain domain ~ 50aa , which is also length variant. So domain-based phylogeny is ruled out. This domain is very promiscuous, i.e. found as several domain combinations.
For ortholog inference, I binned the proteins based on their protein domain architectures (PDAs), and ran OrthoFinder within each bin. OrthoFinder processes BLAST hits using the standard MCL software. It also aligns these putative orthologs using MAFFT -LINSi, option and returns its tree using FastTree (approximate ML method).
Problem: Even within each PDA bin, proteins are quite length variant (due to recombination events). As a result I think, the alignments returned by OrthoFinder are extremely gappy, suggesting that proteins of disparate lengths, and therefore likely different evolutionary origins, are being incorrectly classified together in the same orthologous group. Is there a way to prevent / compensate for this?
My specific questions are:
- Should I pre-process my dataset differently before OrthoFinder step? And should it involve PDA-based binning or not? For example, cluster them based on length using UCLUST may be? Before or after PDA-based binning?
- I have not used OrthoMCL, so I wonder if it too will suffer from the same sort of mis-classification? AND / OR
- Should I post-process my alignment instead, using something like GBlocks?
- OrthoFinder uses MAFFT *L-INS-i (probably most accurate; recommended for <200 sequences; iterative refinement method incorporating local pairwise alignment information). Perhaps I should force OrthoFinder to use *E-INS-i (suitable for sequences containing large unalignable regions; recommended for <200 sequences) instead?
The short version of my question:
For short promiscuous domain itself of variable length, in plant proteins also of variable length and different domain architectures, what might be the best bioinformatics pipeline to accurately infer orthologs?