Counting nucleotide frequency using perl script
1
1
Entering edit mode
7.0 years ago
MAPK ★ 2.1k

I have this perl script below to calculate sequence length and their frequency along with nucleotide frequency(A,T,G and C). This script works fine for a file with large number of sequences, but it does not give the right result for a file only a few sequences( compared to a file with 1000's of sequences):

infile.fasta

>NC_013_1051_1114
TTGTCCCTTTGAGTCTCTGG
>NC_013_1051_1114
GCGCAGCCGATATGGATAA
>NC_013_1051_1114
TCGAGACTTTGTAATGTTTGGG
>NC_013_1051_1114
TATTCCACGTCAGGTGCTTTT
>NC_013_1051_1114
TAGAGCCGATTCCAGACTGTTCC
>NC_013_1051_1114
TACAGGACCAAGCTCTTCACTC
>NC_013_1051_1114
CCGTCAAGTTCAGCTCCAATAA
>NC_013_6_301
CCACGCAACGGACAATCAAACA
>NC_013_6_301
GGACACTTCCAACTATAAATA
>NC_013_6_301
CCACGCAACGGACAATCAAACA
>NC_013_1051_1114
GCTCTTCACTCTTCCTCGTCT
>NC_013_1051_1114
TTGGGAAAAAGAAGTTGCTGCAGC
>NC_013_1051_1114
TCGCAGTATCTCTGAAGTTG

count.pl

#!/usr/bin/perl -w

#usage ./count.pl infile min_length max_length
#usage ./count.pl infile 18 34

my $min_len = $ARGV[1];
my $max_len = $ARGV[2];
my $read_len = 0;
my @lines = ("header1","sequence","header2","quality");
my @lray = ();
my $count = 0;
my $total = 0;
my $i = 0;

my @Aray = ();
my @Cray = ();
my @Gray = ();
my @Tray = ();

my$FN = "";

for ($i=$min_len; $i<=$max_len; $i++){
   $lray[$i] = 0;
}

open (INFILE, "<$ARGV[0]") || die "couldn't open input file!";
   while (<INFILE>) {
      $lines[$count] = $_;
      chomp($lines[$count]);
      $count++;
      if($count eq 4){
         $read_len = length($lines[1]); 
#         print "$read_len $lines[1]\n";
         $FN = substr $lines[1], 0, 1;  
         $lray[$read_len]++;
         if ($FN eq "T") { $Tray[$read_len]++;} 
         else {        
            if ($FN eq "A"){ $Aray[$read_len]++;}
            else {
               if ($FN eq "C"){ $Cray[$read_len]++;}
               else {
                 if ($FN eq "G"){ $Gray[$read_len]++;}
               }   
            }
         }           
         $count = 0;
      }
   }
print "length\tnumber\tA\tC\tG\tT\n";
for ($i=$min_len; $i<=$max_len; $i++){
   print "$i\t$lray[$i]\t$Aray[$i]\t$Cray[$i]\t$Gray[$i]\t$Tray[$i]\n";
}
exit;

This is the result I get for the above infile (which should look something like the result below for another file with many sequences): length number A C G T 18 0
19 0
20 1 1 21 2 2
22 2 1 1 23 1 1 24 0
25 0
26 0
27 0
28 0
29 0
30 0
31 0
32 0
33 0
34 0

This is the correct form of result I get from a big infile.fasta with many sequences.

length  number  A   C   G   T
18  4473    542 710 471 2750
19  12647   990 1680    1103    8874
20  31194   3010    3354    2743    22087
21  61214   6288    7196    5784    41946
22  128642  14596   11902   12518   89626
23  65190   6859    6525    7773    44033
24  10012   611 1401    1112    6888
25  1406    231 192 435 548
26  661 169 91  105 296
27  407 126 81  65  135
28  602 391 49  68  94
29  520 54  30  370 66
30  175 26  93  18  38
31  156 35  28  29  64
32  106 22  16  24  44
33  97  45  17  16  19
34  0

I would really appreciate if you could help me correct this code. Thanks

perl nucleotide_frequency • 6.6k views
ADD COMMENT
1
Entering edit mode

First, use strict;, it will save you a lot of headache.

Second, I guess this script was created for fastq files ( if($count eq 4){ ), but you are using it for fasta files.

Finally, a more readable (for me, at least) way of counting characters is:

my $seq = "aaATATGAAGCTc";
my $Acount = () = $seq =~ /A/gi;
my $Tcount = () = $seq =~ /T/gi;
my $Ccount = () = $seq =~ /C/gi;
my $Gcount = () = $seq =~ /G/gi;
ADD REPLY
0
Entering edit mode

Is this an assignment question? Why do you want a Perl script that does this - have you tried looking for existing tools?

ADD REPLY
0
Entering edit mode

No, I am not much familiar with perl scripting. Had this code from a colleague to do smRNA analysis, but doesn't seem to give the expected output for a fasta file with only few sequences. Works fine with a fasta file with many sequences though. Of course, I wanted to know what is causing the problem so I could learn something new with perl.

ADD REPLY
0
Entering edit mode

I think it doesn't work fine with any fasta due to if($count eq 4).

ADD REPLY
2
Entering edit mode
7.0 years ago
h.mon 35k

Two solutions (both ignore non-ACGT bases):

count1.pl

#!/usr/bin/perl
use warnings;
use strict;

my $infile = $ARGV[0];
my $min = $ARGV[1];
my $max = $ARGV[2];
my %hash = ();
$/ = ">";

open (INFILE, "<$infile") || die "couldn't open input file!";
while (<INFILE>) {
    chomp;
    next if ( /^\s*$/ );
    my ($header, $seq) = split("\n");
    my $seq_len = length($seq);
    next if ($seq_len < $min || $seq_len > $max);
    $hash{$seq_len}[1] += 1;
    $hash{$seq_len}[2] += () = $seq =~ /A/gi;
    $hash{$seq_len}[3] += () = $seq =~ /C/gi;
    $hash{$seq_len}[4] += () = $seq =~ /G/gi;
    $hash{$seq_len}[5] += () = $seq =~ /T/gi;
}
close(INFILE);
print "length\tnumber\tA\tC\tG\tT\n";
for my $len ( sort keys %hash ) {
    print "$len\t";
    for my $i ( 1 .. 5 ) {
        print "$hash{$len}[$i]\t";
    }
    print "\n";
}

count2.pl

#!/usr/bin/perl
use warnings;
use strict;

my $infile = $ARGV[0];
my $min = $ARGV[1];
my $max = $ARGV[2];
my( %lenhash, %Ahash, %Chash, %Ghash, %Thash );
$/ = ">";

open (INFILE, "<$infile") or die "couldn't open input file!";
while (<INFILE>) {
    my %count;
    chomp;
    next if ( /^\s*$/ );
    my ($header, $seq) = split("\n");
    my $seq_len = length($seq);
    next if ($seq_len < $min || $seq_len > $max);
    foreach my $base (split "", $seq) {
        $count{$base}++;
    }
    $lenhash{$seq_len} += 1;
    $Ahash{$seq_len} += $count{"A"};
    $Chash{$seq_len} += $count{"C"};
    $Ghash{$seq_len} += $count{"G"};
    $Thash{$seq_len} += $count{"T"};
}
close(INFILE);
print "length\tnumber\tA\tC\tG\tT\n";
for my $len ( sort keys %lenhash ) {
    print "$len\t$lenhash{$len}\t$Ahash{$len}\t$Chash{$len}\t$Ghash{$len}\t$Thash{$len}\n";
}
ADD COMMENT
0
Entering edit mode

Thank you so much. How do you add rows for sequences that are not present? For example, If the input parameters are 18 34, I want to show the frequency for sequences of length 18 to 34 even if they are not there. If sequence length of 18 is not present, I still want to show zeros for 18.

ADD REPLY
0
Entering edit mode

I want to show the table like this even if there is no frequency for some of the sequences (you may notice 18 bases long and 34 bases long sequences are not present in the table below, but I still want to include them).

length  number  A   C   G   T
18  0       0     0    0        0
19  12647   990 1680    1103    8874
20  31194   3010    3354    2743    22087
21  61214   6288    7196    5784    41946
22  128642  14596   11902   12518   89626
23  65190   6859    6525    7773    44033
24  10012   611 1401    1112    6888
25  1406    231 192 435 548
26  661 169 91  105 296
27  407 126 81  65  135
28  602 391 49  68  94
29  520 54  30  370 66
30  175 26  93  18  38
31  156 35  28  29  64
32  106 22  16  24  44
33  97  45  17  16  19
34  0    0     0    0    0
ADD REPLY
1
Entering edit mode

For the count1.pl script, replace the final part with:

print "length\tnumber\tA\tC\tG\tT\n";
foreach my $len ( $min .. $max ) {
    print "$len\t";
    if ( exists( $hash{$len} ) ) {
        foreach my $i ( 1 .. 5 ) {
            print "$hash{$len}[$i]\t";
        }
        print "\n";
    }
    else { print "0\t0\t0\t0\t0\n"; }
}

The same logic applies to count2.pl: if $lenhash{$len} exists, print base count hashes, if it doesn't, print zeros.

ADD REPLY
0
Entering edit mode

Thank you so much. This helped me a lot. I usually work with R and was not familiar with perl.

ADD REPLY

Login before adding your answer.

Traffic: 1678 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6