I want to count how many bases for each of them and what is the frequency of nucleotide (A,T,G,C) in each of the sequence. I tried this one, but it gave total base count whereas I want a count for each sequence.
#!/usr/bin/perl
use strict;
use warnings;
my %seqs;
$/ = "\n>"; # read fasta by sequence, not by lines
while (<>) {
s/>//g;
my ($seq_id, @seq) = split (/\n/, $_);
my $seq = uc(join "", @seq); # rebuild sequence as a single string
my $len = length $seq;
my $numA = $seq =~ tr/A//; # removing A's from sequence returns total counts
my $numC = $seq =~ tr/C//;
my $numG = $seq =~ tr/G//;
my $numT = $seq =~ tr/T//;
print "$seq_id: Size=$len A=$numA C=$numC G=$numG T=$numT\n";
}
I had a similar question and found this response really helpful, thank you! In my case, there are some sequences that don't contain any C's and those are all I want to count, so my output is as below. Could you let me know if there is a way to have it output 0 C where there is none?
Thank you!
AAC71248.1 >AAC71255.1 3 C
AAC71256.1 1 C
AAC71261.1 1 C
AAC71285.1 1 C
AAC71286.1 1 C
AAC71293.1 >AAC71313.1 1 C
AAC71314.1 >AAC71345.1 1 C
You can use bioawk (
bioawk -c fastx
) to get this done.