Question

Convert Sequence Alignment Into A Prettybase Polymorphic Sites Table And Calculate Pi, Theta, And Tajima'S D Using Bioperl?

0

Entering edit mode

12.3 years ago

anin.gregory ▴ 110

Hi I am new to this forum! I am trying to pull all the Polymorphic Sites (SNPs) out of a Fasta or Clustal alignment and create a table in Prettybase format (http://genapps.uchicago.edu:8081/maxdip/sample.pretty), so I can calculate pi, theta, and tajima's D.

Fasta Alignment:
>A1
ATGCA
>A2
ATGCG
>A3
ATGCA
>A4
ATCGG

Convert to (position \t individual \t allele) - PRETTYBASE format:
01    A1    A
01    A2    A
01    A3    A
01    A4    A
02    A1    T
02    A2    T
02    A3    T
02    A4    T
03    A1    G
03    A2    G
03    A3    G
03    A4    C
04    A1    C
04    A2    C
04    A3    C
04    A4    G
05    A1    A
05    A2    G
05    A3    A
05    A4    G

I am am fairly proficient in BioPerl, but have come to a wall. I have looked at all the CPAN and BioPerl websites about the different modules and I can't figure out how to do it? Any help would be great! Or do you another way to calculate pi, theta, and tajima's D from alignment data?

Here is my script so far:

#!/usr/bin/perl
use warnings;
use strict;
use Bio::SearchIO;
use Bio::SeqIO;

my $seqio  = Bio::SeqIO->new(-format => 'fasta', -file   => "$ARGV[0]");

while (my $seqobj = $seqio->next_seq) {
        my $seqid = $seqobj->display_id;
        my $nuc = $seqobj->seq();
        push (my @ids, $seqid);
        push (my @nucs, $nuc);
        my @spl = split(//, $nucs[0]);
    my $count=1;
    my $count2=1;
    my $count3=1;
    my $count4=1;
        foreach my $i (@spl) {
         my $position = $count++;
                my $individual = $ids[0];
            print $position."\t".$individual."\t".$i."\n";
        }
        my @spl2 = split(//, $nucs[1]);
        foreach my $k (@spl2) {
                my $position2 = $count2++;
                my $individual2 = $ids[1];
            print $position2."\t".$individual2."\t".$k."\n";
        }
        my @spl3 = split(//, $nucs[2]);
        foreach my $j (@spl3) {
                my $position3 = $count3++;
                my $individual3 = $ids[2];
            print $position3."\t".$individual3."\t".$j."\n";
        }
        my @spl4 = split(//, $nucs[3]);
        foreach my $l (@spl4) {
                my $position4 = $count4++;
                my $individual4 = $ids[3];
            print $position4."\t".$individual4."\t".$l."\n";
        }
}

bioperl perl snps alignment • 4.9k views

ADD COMMENT • link 12.3 years ago by anin.gregory ▴ 110

score 2 · Answer 1 · 2013-04-18

2

Entering edit mode

12.3 years ago

anin.gregory ▴ 110

Figured it out! The script above I edited and is correct for 4 sequences...you will need to modify it if you are comparing more than 4 sequences.

Once you have your prettybase format, you can pipe the prettybase format into this script to calculate pi, theta and tajima's D:

#!/usr/bin/perl
use warnings;
use strict;
use Bio::PopGen::IO;
use Bio::PopGen::Statistics;

my $pop =Bio::PopGen::IO->new(-format => "prettybase", -file => $ARGV[0])->next_population;
my $pi = Bio::PopGen::Statistics->pi($pop);
my $theta = Bio::PopGen::Statistics->theta($pop);
my $D = Bio::PopGen::Statistics->tajima_D($pop);
print "Pi = $pi"."\n"."Theta = $theta"."\n"."Tajima's D = $D"."\n";

ADD COMMENT • link 12.3 years ago by anin.gregory ▴ 110

0

Entering edit mode

+1 Great! Thanks for posting your solution.

ADD REPLY • link 12.3 years ago by SES 8.6k

0

Entering edit mode

Excellent script, would be really useful to determine nr of sequences automatically. I did have a problem however, the Tajima D value never printed, not sure what the error was:

@BioPower3-IBM ~/nosema/annotation/diversity/no_stop_codons $ perl trial/Fasta2Prettybase.pl M715_870002501/aligned_nt.fasta > tmp1; perl trial/Prettybase2ThetaPiTajima.pl tmp1; rm -rf tmp1 Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 24, <GEN0> line 1. Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 30, <GEN0> line 1. Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 36, <GEN0> line 1. Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 24, <GEN0> line 2. Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 30, <GEN0> line 2. Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 36, <GEN0> line 2. Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 24, <GEN0> line 3. Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 30, <GEN0> line 3. Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 36, <GEN0> line 3. Use of uninitialized value $D in concatenation (.) or string at trial/Prettybase2ThetaPiTajima.pl line 11. Pi = 383.333333333335 Theta = 364.666666666667 Tajima's D =

This all may be because I did not change the script but only had 3 sequences in my alignment.

Would be also very useful to merge the 2 scripts into 1. Too bad I am not pro efficient in perl.

ADD REPLY • link 10.5 years ago by Adrian Pelin ★ 2.7k