Question

Building A Phylogenetic Tree

1

Entering edit mode

12.3 years ago

Mark ▴ 10

Dear All,

What I am stating here is pretty much a textbook problem so please do not be offended if you find it too trivial. I have never been into phylogenetics but at the moment, I need to make a phylogenetic tree for a gene family of interest.

I extracted the sequences for this gene from different bacterial genomes, aligned the sequences and now I am ready to start. I looked at the available programs and there are just too many of them. For some reason, I liked PHYLIP. So I started to use it.

What I propose to do is the following: Use the alignment to build tree using distance based method (both NJ and UPGMA) and see if they both produce the same tree. If that is the case, I should be happy with the same and use this tree. If not, then I might need to explore Parsimony or Likelihood based methods. However, from whatever I have read so far, I understand that these methods run longer to run and sometimes, they do not even converge. I actually did try using dnapenny program from Phylip package and got the message "search broken off" after the program had been running for a couple of hours atleast. Any feedback/suggestions with my problems will be highly appreciated.

Thanks and regards, Mark.

phylogenetics tree • 5.5k views

ADD COMMENT • link updated 10.8 years ago by Biostar 20 • written 12.3 years ago by Mark ▴ 10

score 3 · Answer 1 · 2012-07-30

3

Entering edit mode

12.3 years ago

Leonor Palmeira 3.9k

Here is a little background on phylogenetic inference methods. First of all, I would say there are four big types of inference methods (roughly in their historical order):

Maximum parsimony methods
Distance-based methods
Maximum likelihood methods
Bayesian methods

Nowadays, the golden standard is Maximum likelihood as well as Bayesian methods, specially because they implement probabilistic models of evolution which can be complexified to accurately model evolution (there are many papers on this, I could link you to a few if you are interested). The main issues for other methods are (i) consistency and (ii) robustness. See this paper or this website for some insights. I would, for instance, never use UPGMA (inconsistencies) nor parsimony (problems with high substitution rates or with long branches).

You can use PhyML (maximum likelihood), which is very fast. It can be used from within Seaview which I find quite useful.

ADD COMMENT • link 12.3 years ago by Leonor Palmeira 3.9k

0

Entering edit mode

Thanks Leonor. I will look at PhyML and see if I need to run it. Cheers

ADD REPLY • link 12.3 years ago by Mark ▴ 10

0

Entering edit mode

I always go for RAxML for ML and PhyloBayes for Bayesian inference. FastTree is excellent too if you just want a general idea of the topology..

ADD REPLY • link 10.8 years ago by 5heikki 11k

score 1 · Answer 2 · 2012-07-30

1

Entering edit mode

12.3 years ago

Stefano Berri 4.4k

Some info you want to provide:

Nucleic or protein sequences? How many sequences do you have? How long are they? What is the typical similarity? Do they contain a particualr motif/domain?

Regardless of the method you use, it is very likely that any two methods will give you two different results, unless there are very few sequences or they are very "easy". In any case, you won't be able to say which one is correct. The approach is to do some bootstrapping (PHILIP has a program that does it), produce 100 or 1000 sub datasets, run the same programs on them, and then see what is the consensus. You will then be able to say how much you "trust" your phylogenetic tree.

Also, a lot of care and "manual" work need to be spent "cleaning" the multiple alignment. Remove gaps that occur in most of the sequences, limit to region of relative similarity. If there are MANY sequences (like > 100) usually you make a rough tree to find groups and then you run within groups.

Hope this helps

ADD COMMENT • link 12.3 years ago by Stefano Berri 4.4k

0

Entering edit mode

Thanks Stefano for your rather quick response. My alignment is a set of nucleotide sequences, the total alignment has ~ 500 sequences and the length of the alignment is 1500 nt. The similarity (at the level of protein sequence since I did a BLASTP to find the homologs in the first place) is > 35%. I actually started to look into the bootstrapping and consensus programs in Phylip. I suppose I can look at 1000 replicates, get a consensus tree for these replicates and see if the two results (NJ vs UPGMA) match. The overall alignment is Ok with not too many "gapped regions" so I suppose I might not need to do the manual cleansing of the alignment.

If there is something else that I should be wary of, please let me know. Cheers

ADD REPLY • link 12.3 years ago by Mark ▴ 10

0

Entering edit mode

May I ask whether it is possible that a given a conserved motif does not align at the required position, let's say, for some of the sequences in a multiple sequence alignment? What is the solution in this case? Hope it's not too much of a digression. Thanks

ADD REPLY • link 12.3 years ago by Olivier ▴ 440