I want to count how many bases for each of them and what is the frequency of nucleotide (A,T,G,C) in each of the sequence. I tried this one, but it gave total base count whereas I want a count for each sequence.
#!/usr/bin/perl
use strict;
use warnings;
my %seqs;
$/ ="\n>";# read fasta by sequence, not by lineswhile(<>){
s/>//g;
my ($seq_id, @seq)=split(/\n/, $_);
my $seq= uc(join "", @seq);# rebuild sequence as a single string
my $len= length $seq;
my $numA=$seq=~ tr/A//;# removing A's from sequence returns total counts
my $numC=$seq=~ tr/C//;
my $numG=$seq=~ tr/G//;
my $numT=$seq=~ tr/T//;
print "$seq_id: Size=$len A=$numA C=$numC G=$numG T=$numT\n";}
I had a similar question and found this response really helpful, thank you! In my case, there are some sequences that don't contain any C's and those are all I want to count, so my output is as below. Could you let me know if there is a way to have it output 0 C where there is none?
Thank you!
AAC71248.1 >AAC71255.1 3 C
AAC71256.1 1 C
AAC71261.1 1 C
AAC71285.1 1 C
AAC71286.1 1 C
AAC71293.1 >AAC71313.1 1 C
AAC71314.1 >AAC71345.1 1 C
You can use bioawk (
bioawk -c fastx
) to get this done.