Hi I am new to this forum! I am trying to pull all the Polymorphic Sites (SNPs) out of a Fasta or Clustal alignment and create a table in Prettybase format (http://genapps.uchicago.edu:8081/maxdip/sample.pretty), so I can calculate pi, theta, and tajima's D.
Fasta Alignment:
>A1
ATGCA
>A2
ATGCG
>A3
ATGCA
>A4
ATCGG
Convert to (position \t individual \t allele) - PRETTYBASE format:
01 A1 A
01 A2 A
01 A3 A
01 A4 A
02 A1 T
02 A2 T
02 A3 T
02 A4 T
03 A1 G
03 A2 G
03 A3 G
03 A4 C
04 A1 C
04 A2 C
04 A3 C
04 A4 G
05 A1 A
05 A2 G
05 A3 A
05 A4 G
I am am fairly proficient in BioPerl, but have come to a wall. I have looked at all the CPAN and BioPerl websites about the different modules and I can't figure out how to do it? Any help would be great! Or do you another way to calculate pi, theta, and tajima's D from alignment data?
Here is my script so far:
#!/usr/bin/perl
use warnings;
use strict;
use Bio::SearchIO;
use Bio::SeqIO;
my $seqio = Bio::SeqIO->new(-format => 'fasta', -file => "$ARGV[0]");
while (my $seqobj = $seqio->next_seq) {
my $seqid = $seqobj->display_id;
my $nuc = $seqobj->seq();
push (my @ids, $seqid);
push (my @nucs, $nuc);
my @spl = split(//, $nucs[0]);
my $count=1;
my $count2=1;
my $count3=1;
my $count4=1;
foreach my $i (@spl) {
my $position = $count++;
my $individual = $ids[0];
print $position."\t".$individual."\t".$i."\n";
}
my @spl2 = split(//, $nucs[1]);
foreach my $k (@spl2) {
my $position2 = $count2++;
my $individual2 = $ids[1];
print $position2."\t".$individual2."\t".$k."\n";
}
my @spl3 = split(//, $nucs[2]);
foreach my $j (@spl3) {
my $position3 = $count3++;
my $individual3 = $ids[2];
print $position3."\t".$individual3."\t".$j."\n";
}
my @spl4 = split(//, $nucs[3]);
foreach my $l (@spl4) {
my $position4 = $count4++;
my $individual4 = $ids[3];
print $position4."\t".$individual4."\t".$l."\n";
}
}
+1 Great! Thanks for posting your solution.
Excellent script, would be really useful to determine nr of sequences automatically. I did have a problem however, the Tajima D value never printed, not sure what the error was:
@BioPower3-IBM ~/nosema/annotation/diversity/no_stop_codons $ perl trial/Fasta2Prettybase.pl M715_870002501/aligned_nt.fasta > tmp1; perl trial/Prettybase2ThetaPiTajima.pl tmp1; rm -rf tmp1
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 24, <GEN0> line 1.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 30, <GEN0> line 1.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 36, <GEN0> line 1.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 24, <GEN0> line 2.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 30, <GEN0> line 2.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 36, <GEN0> line 2.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 24, <GEN0> line 3.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 30, <GEN0> line 3.
Use of uninitialized value in split at trial/Fasta2Prettybase.pl line 36, <GEN0> line 3.
Use of uninitialized value $D in concatenation (.) or string at trial/Prettybase2ThetaPiTajima.pl line 11.
Pi = 383.333333333335
Theta = 364.666666666667
Tajima's D =
This all may be because I did not change the script but only had 3 sequences in my alignment.
Would be also very useful to merge the 2 scripts into 1. Too bad I am not pro efficient in perl.