Question

Fasta alignment to Tajima's D and population statistics

1

Entering edit mode

8.7 years ago

matthew.m.hernandez ▴ 30

Hi there,

I'm a newbie who's trying to make use of the Bio::PopGen packages to analyze measurements of diversity in different populations of viral quasispecies. I can turn to DNAsp on Windows to analyze different FASTA alignments. This is straightforward and easy but takes a long time for 100 different alignments.

I've tried turning to BioPerl to do this and have written a script to take a fasta alignment and run Tajima's analyses on it to return statistics. It runs fine, but I am getting very different values for certain metrics than what I would get on DNAsp.

Here is my code (my input file is a fasta formatted alignment file):

#!/usr/bin/perl                                                                                                                                                                                    
use warnings;
use strict;
use Bio::AlignIO;
use Bio::PopGen::IO;
use Bio::PopGen::Statistics;
use Bio::PopGen::Utilities;

my $infile = shift;

my $in = Bio::AlignIO->new(-format => 'fasta', -file => $infile);
my $aln;
if($aln = $in->next_aln){
}

my $pop = Bio::PopGen::Utilities->aln_to_population(-alignment=>$aln, -include_monomorphic =>1);

my $pi = Bio::PopGen::Statistics->pi($pop);
my $theta = Bio::PopGen::Statistics->theta($pop);
my $D = Bio::PopGen::Statistics->tajima_D($pop);
my $segsites = Bio::PopGen::Statistics->segregating_sites_count($pop);

print "Pi = $pi"."\n"."Theta = $theta"."\n"."Tajima's D = $D"."\n"."SegSites = $segsites"."\n";

If I compare the results from DNAsp and Bioperl I get the following:

DNAsp

Pi: 0.00020 Theta (per site): 0.00958 Tajima's D: -2.21621 SegSites: 18

BioPerl

Pi: 0.04944 Theta: 2.16841 Tajima's D: -2.21623 SegSites: 18

Does anyone know why Pi and Theta are so different? Moreover, is there any way that I can get it to spit out the number of sequences in the alignment file and measure the statistical significance of the Tajima's D-test?

Thanks,

Matt

alignment tajima PopGen Statistics BioPerl • 4.6k views

ADD COMMENT • link 8.7 years ago by matthew.m.hernandez ▴ 30

3

Entering edit mode

My guess but DNASP gives per site value (divided by the length of ungapped alignment) whereas BioPerl seems go give added diversity for the whole sequence. Can you divide by the ungapped alignment length and see if the values start matching ?

ADD REPLY • link 8.7 years ago by microfuge ★ 2.0k

0

Entering edit mode

So this makes sense for the Pi value! However, this doesn't work for the theta statistic (2.16841/249 = 0.00871 which is pretty off from 0.00958). Of note, my sequence length is 249bp.

ADD REPLY • link 8.7 years ago by matthew.m.hernandez ▴ 30

1

Entering edit mode

You can use $aln->no_sequences to get the number of sequences and I agree with the other comment about the statistics. Dividing Theta and Pi by the length works out. I think the different implementation of the programs and how they deal with floating point numbers accounts for the slight variation. I'd be pretty confident to have that level of agreement.

ADD REPLY • link 8.7 years ago by SES 8.6k

0

Entering edit mode

Thanks for addressing the number of sequences flag! However, I'm still concerned with the theta value. Do you still have confidence in a difference between 0.00871 and 0.00958?

ADD REPLY • link 8.7 years ago by matthew.m.hernandez ▴ 30

0

Entering edit mode

No, I don't and would not want to speculate until I looked into what both are doing exactly. It could be an artifact of the floating point issue I mentioned or it could be differences in what is calculated, and it's important to know which. You can look at the documentation to dive into what is being calculated for BioPerl (or the code, though it's obviously very dense and mathematical in nature). Hope that helps!

ADD REPLY • link 8.7 years ago by SES 8.6k

0

Entering edit mode

Ahh, gotcha. Thanks for the insight!

ADD REPLY • link 8.7 years ago by matthew.m.hernandez ▴ 30

0

Entering edit mode

Matthew: Did you figure out the answers to your questions and resolve it to your satisfaction so that you may now have a version of your Perl script that can be shared for pop gen analyses? Could you let us know please. Thanks!

ADD REPLY • link 8.0 years ago by Anand Rao ▴ 640

1

Entering edit mode

Sorry for the late response Anand. I did not figure out what was going on with the BioPerl analysis. I ended up just sticking to DNAsp :-/

ADD REPLY • link 7.0 years ago by matthew.m.hernandez ▴ 30