Question

Bootstrap Using Paup* Is So Slow, Why?

2

Entering edit mode

14.2 years ago

Ijessie ▴ 70

Hi,

Yesterday I tried to perform a bootstrap analysis with 1000 replicates using PAUP* on Windows 7. My dataset consists of 107 coding sequences, which has 675 letters each. Frist I used jModeltest to select the best-fit model, and then appended the following PAUP block to DATA block, and excecuted it using PAUP concole.

BEGIN PAUP;           
log start=yes file=test.log replace=yes;
set criterion=likelihood;

lset base=(0.3037 0.2083 0.2478) nst=6  rmat=(1.4346 9.4669 0.8990 0.3065 7.1675) rates=gamma shape=0.6360 ncat=4 pinvar=0;

bootstrap nreps=1000 bseed=12345 conlevel=60 treefile=bootstrap.tre replace=yes format=NEXUS search=HEURISTIC;

savetrees brlens=yes savebootp=both maxdecimals=0 from=1 to=1 file=boots.tre replace=yes;
log stop;
END;

But it turns out that 17 hours passed, it's in Bootstrap replicate 6 stage. Each run takes more than 3 hours.

So, any wrong in the commands? Or is it just simply a fact that when the number of sequences is above a certain naumber, such as 100, PAUP* is not recommended to use?

Thank in advance for your answer and help!

phylogenetics • 7.4k views

ADD COMMENT • link updated 14.2 years ago by Paulo Nuin ★ 3.7k • written 14.2 years ago by Ijessie ▴ 70

score 4 · Answer 1 · 2010-10-27

4

Entering edit mode

14.2 years ago

Paulo Nuin ★ 3.7k

It sounds about right the performance you're getting. The main bottleneck here is the number of taxa, the length of the protein doesn't have a huge impact on the performance.

PAUP is well known to be slow on bootstrap calculations. In some tests I run PAUP was at least 2-3 times slower than Phylip. My suggestion would be to use Phylip to calculate the bootstrap, but you might no be able to set the same evolutionary model on it.

On Phylip you need to use SEQBOOT to generate bootstrap matrices, then PROML to calculate the likelihood of all these matrices and then CONSENSE to finally get a consensus tree for all the results from PROML.

ADD COMMENT • link 14.2 years ago by Paulo Nuin ★ 3.7k

0

Entering edit mode

Thanks. I will give Phylip a try.

When I read papers, I've noticed that most of them use PAUP* to construct a ML tree with 1000 bootstraps. In some cases, their dataset is large, too. So, I'm just curious how they did it...

ADD REPLY • link 14.2 years ago by Ijessie ▴ 70

0

Entering edit mode

They will use PAUP too, but the performance is identical to what you're getting. In some tests I made (some years ago) to generate 1000 bootstrap replicates of a 100 taxa matrix, it took more than two weeks.

ADD REPLY • link 14.2 years ago by Paulo Nuin ★ 3.7k

score 2 · Answer 2 · 2010-10-27

2

Entering edit mode

14.2 years ago

Stefano Berri 4.4k

I have never used PAUP, but done similar things with Phylip. Bootstrapping should increase time linearly. However 107 proteins 675 aminoacid long is very computational intensive and I would not be surprised if it takes 3 hours each.

Usually only a conserved region is used for phylogenetic analysis. Are you sure 675 nucleotides are conserved across your 107 proteins?

ADD COMMENT • link 14.2 years ago by Stefano Berri 4.4k

0

Entering edit mode

Yes, I think at most of positions they're pretty conserved. Besides, I exclude positions with gaps. I'm not sure about whether I need to exclude those gaps, but I read an article entitiled "Phylogeny for the faint of heart: a tutorial", and it said: The general rule is to delete all positions with gaps plus any adjacent, ambiguously aligned positions.

When you conduct such an analysis, how would you do with the data after multiple alignment?