Selection Of Orthologs For Building Phylogenetic Tree
2
2
Entering edit mode
10.7 years ago
Pappu ★ 2.1k

I am facing difficulties selecting orthologous sequences for phylogenetic tree construction. Searching the trEMBL database yields too many hits. For example, since human has known three paralogs to be expressed, closely related species should have equivalent number of orthologs. However sequence searches in Uniprot yields too few or too many hits. I guess I would have to manually review all sequences which would be painful.

phylogenetic-tree • 3.5k views
ADD COMMENT
0
Entering edit mode

Not sure..but, can you use BioMart?

ADD REPLY
6
Entering edit mode
10.7 years ago
DG 7.3k

You definitely can not assume that a closely related species has the same number of orthologs when compared to human paralogs. We know that even when comparing the Great Apes that there are differences in the number of paralogs and differential gain/loss (mostly loss) between species. As you expand out phylogenetically, this becomes even more apparent.

There are many ways to approach this problem, and it partly depends on your taxonomic selection. If you are mostly working within the vertebrates with whole genomes sequenced you could rely on Ensembl's definitions for paralogs/orthologs for instance to gather all of the 1:1 orthologs for a given sequence in other organisms. This isn't perfect, but in my experience those relationships are fairly well curated.

Of course generally, increased taxon sampling gives you better ability to make broader statements about the biological question at hand. I have found using OrthoMCL a good way to cluster large numbers of sequences into orthologous groups, although it can sometimes be a little overzealous. You can also use its precomputed orthoMCL group assignments on sequences you are interested in and already have defined (for instance those trEMBL hits).

If you are working with lots of sequences and really broad taxonomic coverage, paralog removal for getting accurate phylogenies is basically a whole field of study in and of itself, it is a hard problem.

ADD COMMENT
0
Entering edit mode

Thank you. I don't see any option in ensembl for downloading ortholog sequences. In Biomart I obtain many transcript variants of one otholog in one species at a time. I am wondering how to download all the ortholog CDS so that I can check manually.

ADD REPLY
0
Entering edit mode

Well you can link a transcript ID to its genomic DNA gene ID. Otherwise you can take the longest canonical transcript sequence as (probably) being representative of the CDS. As an alternative you can, with a biomart query, get the IDs used by other databases (Uniprot, etc) to use. I tend to work almost exclusively with protein sequences though when I am doing phylogenetics.

ADD REPLY
0
Entering edit mode

Could you cite some papers which review the state of art on the problem of paralog removal in broad taxonomic groups? Thanks.

ADD REPLY
1
Entering edit mode
9.5 years ago
cdsouthan ★ 1.9k

The trees are pre-cooked for you in TreeFam and Ensembl Gene tree so you need to learn to read them to detect the possible gain/loss events. You can then extract and sequence-confirm whatever you want to re-tree. The main caveat is than a high proportion of the ORFs from genomic pipelines (and some in TrEMBL) are incorrect (e.g. often truncated) see http://cdsouthan.blogspot.se/2014/01/a-tale-of-two-targets-bace1-and-bace2.html

ADD COMMENT

Login before adding your answer.

Traffic: 1705 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6