I have this perl script below to calculate sequence length and their frequency along with nucleotide frequency(A
,T
,G
and C
). This script works fine for a file with large number of sequences, but it does not give the right result for a file only a few sequences( compared to a file with 1000's of sequences):
infile.fasta
>NC_013_1051_1114
TTGTCCCTTTGAGTCTCTGG
>NC_013_1051_1114
GCGCAGCCGATATGGATAA
>NC_013_1051_1114
TCGAGACTTTGTAATGTTTGGG
>NC_013_1051_1114
TATTCCACGTCAGGTGCTTTT
>NC_013_1051_1114
TAGAGCCGATTCCAGACTGTTCC
>NC_013_1051_1114
TACAGGACCAAGCTCTTCACTC
>NC_013_1051_1114
CCGTCAAGTTCAGCTCCAATAA
>NC_013_6_301
CCACGCAACGGACAATCAAACA
>NC_013_6_301
GGACACTTCCAACTATAAATA
>NC_013_6_301
CCACGCAACGGACAATCAAACA
>NC_013_1051_1114
GCTCTTCACTCTTCCTCGTCT
>NC_013_1051_1114
TTGGGAAAAAGAAGTTGCTGCAGC
>NC_013_1051_1114
TCGCAGTATCTCTGAAGTTG
count.pl
#!/usr/bin/perl -w
#usage ./count.pl infile min_length max_length
#usage ./count.pl infile 18 34
my $min_len = $ARGV[1];
my $max_len = $ARGV[2];
my $read_len = 0;
my @lines = ("header1","sequence","header2","quality");
my @lray = ();
my $count = 0;
my $total = 0;
my $i = 0;
my @Aray = ();
my @Cray = ();
my @Gray = ();
my @Tray = ();
my$FN = "";
for ($i=$min_len; $i<=$max_len; $i++){
$lray[$i] = 0;
}
open (INFILE, "<$ARGV[0]") || die "couldn't open input file!";
while (<INFILE>) {
$lines[$count] = $_;
chomp($lines[$count]);
$count++;
if($count eq 4){
$read_len = length($lines[1]);
# print "$read_len $lines[1]\n";
$FN = substr $lines[1], 0, 1;
$lray[$read_len]++;
if ($FN eq "T") { $Tray[$read_len]++;}
else {
if ($FN eq "A"){ $Aray[$read_len]++;}
else {
if ($FN eq "C"){ $Cray[$read_len]++;}
else {
if ($FN eq "G"){ $Gray[$read_len]++;}
}
}
}
$count = 0;
}
}
print "length\tnumber\tA\tC\tG\tT\n";
for ($i=$min_len; $i<=$max_len; $i++){
print "$i\t$lray[$i]\t$Aray[$i]\t$Cray[$i]\t$Gray[$i]\t$Tray[$i]\n";
}
exit;
This is the result I get for the above infile (which should look something like the result below for another file with many sequences):
length number A C G T
18 0
19 0
20 1 1
21 2 2
22 2 1 1
23 1 1
24 0
25 0
26 0
27 0
28 0
29 0
30 0
31 0
32 0
33 0
34 0
This is the correct form of result I get from a big infile.fasta with many sequences.
length number A C G T
18 4473 542 710 471 2750
19 12647 990 1680 1103 8874
20 31194 3010 3354 2743 22087
21 61214 6288 7196 5784 41946
22 128642 14596 11902 12518 89626
23 65190 6859 6525 7773 44033
24 10012 611 1401 1112 6888
25 1406 231 192 435 548
26 661 169 91 105 296
27 407 126 81 65 135
28 602 391 49 68 94
29 520 54 30 370 66
30 175 26 93 18 38
31 156 35 28 29 64
32 106 22 16 24 44
33 97 45 17 16 19
34 0
I would really appreciate if you could help me correct this code. Thanks
First,
use strict;
, it will save you a lot of headache.Second, I guess this script was created for fastq files (
if($count eq 4){
), but you are using it for fasta files.Finally, a more readable (for me, at least) way of counting characters is:
Is this an assignment question? Why do you want a Perl script that does this - have you tried looking for existing tools?
No, I am not much familiar with perl scripting. Had this code from a colleague to do smRNA analysis, but doesn't seem to give the expected output for a fasta file with only few sequences. Works fine with a fasta file with many sequences though. Of course, I wanted to know what is causing the problem so I could learn something new with perl.
I think it doesn't work fine with any fasta due to
if($count eq 4)
.