I would like to have your opinion about an issue I am dealing with quite often during my work.
Many times I create phylogenetic trees for proteins I am working with. For some of them, the phylogenetic relationship with other proteins has been well studied. Therefore, it is quite easy to go back to the literature and choose protein sequences to use as outgroup.
In other cases there are no studies available concerning the phylogenetic relationship of the proteins. How would you choose then outgroup sequences to root the tree?
I often analyze the domain composition of the protein and then I pick as outgroup proteins that have similar catalytic domains, but that are involved in other biological functions/processes.
What I would do is look at the unrooted tree. This will allow you to see which group is segregating further than the rest of your taxa, and you can pick it as an outgroup. Otherwise, just root the tree in the middle for convenience.
I think it could work. But if the proteins are truly different, in the sense that they might have unique catalytic domains, you should see them have longer branch lengths, so you should spot them on the unrooted tree grouping together and diverging, so you can use them as an outgroup.
There appears to be a bit of a fundamental misunderstanding present in what you're doing, at least based on my interpretation of your question and responses. I also disagree with the discussion. Please correct me if I'm misinterpreting what you're saying.
The first step in building a tree, the sequence alignment, is an inference of homology. In other words, you are assuming that each site is truly homologous - has shared ancestry - across all individuals/samples. If you are using sequences that are not homologous, your tree is meaningless in an evolutionary context. An alignment should consist of DNA or amino acid sequences from the same protein across all your samples. You don't want to make a tree consisting of multiple proteins; this, too, is meaningless. If you are not confident with respect to the homology of your sequences and just want to build a tree describing similarity (a dendrogram), go ahead, but realize that it is not phylogenetic. You might also consider restricting analyses to the sites you can be confident are homologous, thus salvaging some of your data.
The choice of an outgroup - where to root the tree - is also not arbitrary. It's a hypothesis, and the choice of which individual you use has lots of implications for the conclusions you might make using the tree. Due to substitution rate heterogeneity and the influence of various evolutionary forces on a locus, the most distantly related individual or sample might not be an appropriate outgroup. In this sense, I disagree with Adrian in the comments, but I do agree with his recommendation to midpoint root in the absence of any other information. Remember, being a tree doesn't have to be rooted, and many popular analyses recommend/require an unrooted tree.
Hi Brice, thanks very much for your comment. I am totally agree with you concerning the first part of your post. The protein I am dealing with are all homologous. They have been described to be involved in similar biological processes and to be acquired by bacteria via HGT. As far as the outgroup is concerned, I agree that choosing an appropriate outgroup is a fundamental part in addressing specific phylogenetic questions. However, I also do realize that it seems to be a common procedure to add outgroups in order to increase the robustness of the innergroup of the trees. Maybe this is the part I am not completely sure about, meaning when is it necessary to use an outgroup? I found quite contrasting views on this topic, can you suggest any good reading?
ADD REPLY
• link
updated 2.5 years ago by
Ram
44k
•
written 9.7 years ago by
dago
★
2.8k
I'm facing a similar issue. I'm working on constructing a phylogenetic tree for Arabidopsis protein orthologs, selecting one representative from each plant order. Additionally, I'm dividing the protein into intra- and extracellular domains to create two distinct phylogenies. The unrooted trees for the extracellular and intracellular domains show differences. I'm also aware of the well-established phylogenetic relationships between the species.
Given this context, how should I root my tree? Is it feasible to examine the alignment identity matrix and select the protein with the lowest percentage in each tree as an outgroup (even if this means having different outgroups between the phylogenies)? Alternatively, should I use a known species as the outgroup based on the established Viridiplantae phylogeny?
What I would do is look at the unrooted tree. This will allow you to see which group is segregating further than the rest of your taxa, and you can pick it as an outgroup. Otherwise, just root the tree in the middle for convenience.
Thanks for your suggestion. What do u think about the approach of looking protein containing similar domians,a s I describe in the post?
I think it could work. But if the proteins are truly different, in the sense that they might have unique catalytic domains, you should see them have longer branch lengths, so you should spot them on the unrooted tree grouping together and diverging, so you can use them as an outgroup.
Any other suggestion?