Question

Calculating Bootstrap Support From Newick Trees

4

Entering edit mode

14.8 years ago

Michael Kuhn 5.0k

I'm using PhyML to compute phylogenetic trees. In principle, there's a parallel option (via MPI), but it doesn't work for me. Instead of spending lots of time debugging MPI, I was wondering if I could run the bootstraps independently (with a user-supplied tree) on my cluster. So the real question is: Given a list of trees in Newick format, how do I calculate the bootstrap support for the original tree?

phylogenetics • 7.5k views

ADD COMMENT • link updated 7.8 years ago by Biostar 20 • written 14.8 years ago by Michael Kuhn 5.0k

3

Entering edit mode

14.8 years ago

Michael Kuhn 5.0k

Here's one approach (proposed 17 years ago), but I think it doesn't fully capture the "normal" way PhyML operates: Using consense from Phylip to calculate the consensus tree from bootstrap trees generated by independent instances of PhyML, which contains numbers that might be similar to the bootstrap values (but are not projected on the original tree).

(Marked as community wiki for further enlightment.)

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 14.8 years ago by Michael Kuhn 5.0k

score 5 · Accepted Answer · 2010-07-08

You can use an strategy similar to what Phylip does. You can generate 1000 random input files with SEQBOOT and use these files in PhyML, not in a real parallel mode, but in parallel as starting different PhyML processes in different nodes/cores at the same time (using something like this.

At the end CONSENSE will calculate a consensus tree and give you the actual bootstrap values.

Ram · Accepted Answer · 2010-12-22

Bootstrap replicates are independent, so to the first part of your question: yes, you can simply create however many bootstrapped matrices you require (e.g. 1000) and run a tree search on each of those on separate nodes of your cluster. You can use seqboot for this, or other utilities with similar functionality. As a plug for Bio::Phylo you might, for example, do the following:

use Bio::Phylo::IO 'parse';

my ($matrix) = @{ parse(
  -format => 'nexus', # or any of the other supported formats, e.g. 'phylip'
  -file   => 'myfile.nex', # or a string, url or handle
  -as_project => 1,
)->get_matrices };

for ( 1 .. 1000 ) {
  my $bootstrapped = $matrix->bootstrap;
  open my $outfh, '>', "myfile{$_}.nex" or die $!;
  print $outfh "#NEXUS\n", $bootstrapped->to_nexus;
}

...which gives you a thousand bootstrapped versions of 'myfile.nex', with names 'myfile[1..1000].nex'. Then, you run a tree search on each of those, concatenate the resulting trees into a list of newick strings (as per your original query) and do the following:

use Bio::Phylo::IO 'parse';

my $forest = parse(
  -format => 'newick',
  -file   => 'mytrees.dnd',
);

my $consensus = $forest->make_consensus( -branches => 'frequency' );
open my $outfh, '>', 'consensus.dnd' or die $!;
print $outfh $consensus->to_newick;

This gives you a little more flexibility in terms of the file formats you can use beyond phylip and newick (and, consequently, the tree searching programs you can use) but other than that it is equivalent to using seqboot and consense - and, depending on the number of sequences and bootstrap replicates, there might be performance issues with using perl.