I am doing a phylogenetic analyses of two sets of proteins (A and B) that are functionally very closely related and share a large degree of sequence similarity. I have identified from various species protein sequences I want to include in the analyses (for both proteins). Each separate sequence was included based on their similarity / BLAST results to the known and characterized (functionally) proteins (A and B) in Arabidopsis. I am worried that some of the species included might however represent paralogs and not orthologs. Is there any analyses where I can "plug and play" the data that I have and see whether it comes out as orthologs (hypothetically then an orthologous group for protein A and one for protein B). I do not want to do an analyses where I search a database for orthologs, I want to ID it in the sequences I already have in my dataset (which were included obviously based on certain pre selected criteria).
You can use OrthoMCL for this purpose.
Thank you for the answer. I have read up along similar lines, but was not quite sure whether it was the best approach. Will give it a try though.
I guess it would definitely help you.
Just to give a very brief introduction about how it works. It will take a set of protein sequences (let's say proteome from three different species) and perform homology-based clustering: first by running BLAST (for sequence similarity) and then clustering (using MCL program). Finally, it will predict the list of paralogous and orthologous proteins and stratify them.
Note that this is a shortcut which I would be wary of using in this case. Strictly speaking, from their very definition, paralogy and orthology can only be inferred from a phylogenetic tree. I would add the sequences to be tested to the relevant multiple sequence alignment and rebuild a phylogenetic tree from it then infer the relationships.
Perhaps I might also just mention: Not all the species / sequences we include might be from fully sequenced and annotated genomes. Is OrthoMCL not too "specialised" in that regard as it relies on these assumptions???