Hi there,
I'm a newbie who's trying to make use of the Bio::PopGen packages to analyze measurements of diversity in different populations of viral quasispecies. I can turn to DNAsp on Windows to analyze different FASTA alignments. This is straightforward and easy but takes a long time for 100 different alignments.
I've tried turning to BioPerl to do this and have written a script to take a fasta alignment and run Tajima's analyses on it to return statistics. It runs fine, but I am getting very different values for certain metrics than what I would get on DNAsp.
Here is my code (my input file is a fasta formatted alignment file):
#!/usr/bin/perl
use warnings;
use strict;
use Bio::AlignIO;
use Bio::PopGen::IO;
use Bio::PopGen::Statistics;
use Bio::PopGen::Utilities;
my $infile = shift;
my $in = Bio::AlignIO->new(-format => 'fasta', -file => $infile);
my $aln;
if($aln = $in->next_aln){
}
my $pop = Bio::PopGen::Utilities->aln_to_population(-alignment=>$aln, -include_monomorphic =>1);
my $pi = Bio::PopGen::Statistics->pi($pop);
my $theta = Bio::PopGen::Statistics->theta($pop);
my $D = Bio::PopGen::Statistics->tajima_D($pop);
my $segsites = Bio::PopGen::Statistics->segregating_sites_count($pop);
print "Pi = $pi"."\n"."Theta = $theta"."\n"."Tajima's D = $D"."\n"."SegSites = $segsites"."\n";
If I compare the results from DNAsp and Bioperl I get the following:
DNAsp
Pi: 0.00020 Theta (per site): 0.00958 Tajima's D: -2.21621 SegSites: 18
BioPerl
Pi: 0.04944 Theta: 2.16841 Tajima's D: -2.21623 SegSites: 18
Does anyone know why Pi and Theta are so different? Moreover, is there any way that I can get it to spit out the number of sequences in the alignment file and measure the statistical significance of the Tajima's D-test?
Thanks,
Matt
My guess but DNASP gives per site value (divided by the length of ungapped alignment) whereas BioPerl seems go give added diversity for the whole sequence. Can you divide by the ungapped alignment length and see if the values start matching ?
So this makes sense for the Pi value! However, this doesn't work for the theta statistic (2.16841/249 = 0.00871 which is pretty off from 0.00958). Of note, my sequence length is 249bp.
You can use
$aln->no_sequences
to get the number of sequences and I agree with the other comment about the statistics. Dividing Theta and Pi by the length works out. I think the different implementation of the programs and how they deal with floating point numbers accounts for the slight variation. I'd be pretty confident to have that level of agreement.Thanks for addressing the number of sequences flag! However, I'm still concerned with the theta value. Do you still have confidence in a difference between 0.00871 and 0.00958?
No, I don't and would not want to speculate until I looked into what both are doing exactly. It could be an artifact of the floating point issue I mentioned or it could be differences in what is calculated, and it's important to know which. You can look at the documentation to dive into what is being calculated for BioPerl (or the code, though it's obviously very dense and mathematical in nature). Hope that helps!
Ahh, gotcha. Thanks for the insight!
Matthew: Did you figure out the answers to your questions and resolve it to your satisfaction so that you may now have a version of your Perl script that can be shared for pop gen analyses? Could you let us know please. Thanks!
Sorry for the late response Anand. I did not figure out what was going on with the BioPerl analysis. I ended up just sticking to DNAsp :-/