Question

How Do Biologists Use Inferred Phylogenies?

4

Entering edit mode

13.5 years ago

Ehamberg ▴ 130

I work on molecular phylogenetic inference using maximum likelihood methods, but from the computer science side of things, as it is an interesting problem area for heuristic search. However, I would like to know a bit more about how biologists actually use inferred trees and would appreciate it if someone would care to answer some questions I have after having worked with this the last months:

If one infers a phylogenetic trees for a set of sequences, is the resulting tree considered “probably correct”? Only a hypothesis?
Are techniques such as boot-strapping used to help guide this belief? Other techniques?
How many sequences is it common to infer a phylogeny from? I ask this because I know the problem complexity grows very quickly as more species are added (hence the need for heuristic search).

phylogenetics • 4.8k views

ADD COMMENT • link updated 13.5 years ago by David W 4.9k • written 13.5 years ago by Ehamberg ▴ 130

2

Entering edit mode

Bootstrapping is not a panacea for quantifying uncertainty. Say that you use a stupid inference procedure that always infers ((a,b),(c,d)) regardless of the data. This would have 100% bootstrap support. Other less stupid inference procedures can give similarly misleading bootstrap percentages when interpreted as the posterior probability that the branch exists.

ADD REPLY • link 13.5 years ago by Asdf ▴ 50

score 5 · Answer 1 · 2011-05-16

Hello,

to answer your questions:

and
I wouldn't say it is just an hypothesis. The phylogenetic tree you built is here to answer some questions so it tends more to be an answer to an hypothesis. Normally you already chose some specific program/algorithm for your alignments and a given method (maximum likelihood, neighbor joining,...) to best assess your original question of interest taking in account the potential specificity of your dataset. From this perspective the tree is already a result per se, you just have to be careful when you want to build some conclusions from it. Indeed, you have to test how confident you can be about your results. This can start by checking the quality (and sometimes improving) of your alignements and then of the tree itself. Bootstrapping is a common method to test the robustness of a tree. The final result of such an analysis indicates you how confident you can be for each node/branch of your tree. Sometimes the whole tree will be good while, some other times, part of it will have low support from bootstrap analysis.
To my opinion there is no precise answer to this questions. 3 sequences are enough to build a phylogeny. Generally the number of sequences you take in account is determined by the questions you ask. Typically which sequence do you want to study (the whole genome, a precise gene, a gene family) and in which species (related bacterias, all mammals,...). Then you take in account the available sequences for this set of genes/species. You can sometimes decrease this sequence set if you think some are of poor quality and might decrease the confidence you will have on the results.

I hope it has been helpful.

score 4 · Answer 2 · 2011-05-16

4

Entering edit mode

13.5 years ago

Botond Sipos ★ 1.7k

I would recommend the following reviews:

Holder M, Lewis PO. Phylogeny estimation: traditional and Bayesian approaches. Nat Rev Genet. 2003 4(4):275-84.
Huelsenbeck JP, Rannala B. Phylogenetic methods come of age: testing hypotheses in an evolutionary context. Science. 1997 276(5310):227-32.

From the first you can learn about the common practices and the second gives an overview of how phylogenies can be used for testing evolutionary hypotheses.

ADD COMMENT • link 13.5 years ago by Botond Sipos ★ 1.7k

0

Entering edit mode

Thanks for these! Downloaded both now. Looks really interesting.

ADD REPLY • link 13.5 years ago by Ehamberg ▴ 130

score 3 · Answer 3 · 2011-05-16

3

Entering edit mode

13.5 years ago

Asdf ▴ 50

It sounds like all three of your points are related to quantifying the uncertainty in the tree inference. The most common such quantification seems to be to put a "branch support" between 0 and 100 on each inferred branch. There are also things like split networks, majority consensus trees, and wandering taxa which all address uncertainty in the inferred tree. A coherent way to "use" inferred phylogenies (for example if your goal is to estimate some function of the tree) is to integrate over trees sampled from a posterior distribution. For example if you want to estimate the total branch length of a tree, then you can take the average total branch length over all posterior tree samples in a Bayesian framework. In this case I guess your job as a computer scientist would be to provide this posterior sample. Another puzzle that could probably use more research is to visualise uncertainty in the posterior samples.

ADD COMMENT • link 13.5 years ago by Asdf ▴ 50

1

Entering edit mode

Good answer, since you talked about visualising uncertainty in posteriors I thought I'd point out DensiTree which does a great job of this http://www.cs.auckland.ac.nz/~remco/DensiTree/DensiTree.html

ADD REPLY • link 13.5 years ago by David W 4.9k

0

Entering edit mode

Interesting. There are quite a few new terms I have to look up here. :)

Thanks!

ADD REPLY • link 13.5 years ago by Ehamberg ▴ 130

score 3 · Answer 4 · 2011-05-17

Hi Ehamberg,

You've got some good answers already, but I thought I'd add mine from the point of view of someone that does "whole organism biology":

To get to your numbered points I think you need to know why someone is estimating a phylogeny. Like pretty much all science, it's most interesting when you start with a hypothesis you want to test. "Have moa live in New Zealand since Gondwana broke up", "Are neanderthals and humans really different species", "How many times has flightlessness evolved in birds" etc etc. Then you an answers here make more sense

I think of inferred phylogenies as an estimate, which include uncertainty from the sequences and the methods used to infer the tree. If the results are robust to different methods/models and nodes have good support and diagnostic tests on the data (saturation etc) all line up then I think there's a pretty good chance we have the right relationships
As I'm sure you know - bootstraps tells us how consistently the data matches the consensus tree NOT how sure we are they we have the right tree. I really prefer Bayesian methods in which the suppport values make more sense (the number of times that node was sampled in the MCMC). Of course, then you have the problem of how sensible your priors are etc etc, but there are just too many ways to get bootsraps of 1.0 on the 'wrong' tree (in fact, throw enough data at a probelm and you will get 1.0 even if the data is simulated using at 49:51 mix of two conflicting topologies). The Bayesian approach also makes more sense in molecular dating and studying character evolution - your MCMC has the uncertainty in the tree built into the estimate of the paramters you care about (which realtes to 1.)
This is often the bit in which people writing software packages and people doing lab work have the hardest time talking to each other. The answer is as many as you can reasonably get - but getting them can be really hard. At the moment, I would say to get a decent estimate of a species-tree you would want to use three loci, each with at least 600bp of sequence. Plenty of phylogenies will use 20 if the organisms in question are close enough to a model organism to make getting the sequences easy (next gen sequencing will no doubt slowly drag us non-model types in the same direction ;). If you mean the number of OTUs its really anything between 10s and 1000s depending on the question.

If you are interesting in a text on the maths/comp. sci of phylogeny these two (1,2) are meant to be good (I'm a humble empricist, so can't judge them)

score 2 · Answer 5 · 2011-05-16

Hi.

1 and 2. You always need to bootstrap (typically 100-1000 times) and see how often a given clade is together. That gives you an idea of "how likely" is that that clade means anything. If it is an hypothesis, it depends on the study. For some the phylogenetic tree is a start, for some the end product.

3 this can go from a handful of sequences to thousands. If there are many, tipically the problem is solved stepwise. You use a small sequence and or very few botstraps and then re-do the tree using longer sequences within defined brances, including outgroups from other clades. Or you use a computers from days and weeks.

I think there are already some euristic methods or some "genetic algorithm" that find a good tree.