I'm trying to map human orthologs to the allotetraploid Xenopus Laevis. I've tried RBH Blast, and looked at a number of other Ortholog finding softwares; but, they all seem to work on finding a one-to-one relationships. This, is problematic for a Laevis, which has 2 copies of almost every gene. Can anyone recommend a workflow that is capable of handling many-to-one relationships, or have any bright ideas?
I've also tried Xenbase's manually curated human orthology, but I need ensembl ids. And converting from xenbase to entrez to ensembl is very messy.
It depends on your workflow. If you are using FASTa sequences as a starting point, you only need to filter out near-identical sequences which will hopefully get rid of all duplicated proteins. This can be done using CD-HIT:
When two or more sequences share >=95% identity, this program will remove everything but the longest sequence in that cluster. After this step you do the orthology finding as usual.
Two problems, The WGD is ancient, many of the duplicated genes have low sequence identity, > 70%. But still have functionally identical roles. And I need to know the human ortholog for both gene copies.
CD-HIT can cluster at 70% identity, and even down to 40%.
When you find a human ortholog for one of the two protein copies, presumably you have found it for the other copy as well. It is a simple functional transfer. CD-HIT creates
.clustr
files which tell you what proteins were grouped together.I am developing a software fot finding local alignments. Could you please tell me one (or more) of the sequences from the Xenoplus Laevis? I'd like to check if the results could be helpful for you. I have had success aligning some highly diverged species. Then maybe I could think a worlflow...
You can use OrthoFinder, which will give you one-to-one, one-to-many, many-to-one, and many-to-many.