I have a bunch of fasta files from refseq and I want to know how many bases are on each one. I want to count all characters after the first line of each sequence. I have something among these lines:
for file in *.faa; do echo $file; grep -c ">" "$file" | wc -c; done;
But this gets all characters after > including the headers of the sequences. I just want the number of nucleotides.
I have tried some other grep options but they don't seem to make it better, I would very much appreciate some help.
seqkit stats *.faa
file format type num_seqs sum_len min_len avg_len max_len
tests/seqs4amplicon.fa FASTA DNA 2 33 16 16.5 17
tests/hairpin.fa FASTA RNA 28,645 2,949,871 39 103 2,354
tests/contigs.fa FASTA DNA 9 54 2 6 10
tests/hsa.fa FASTA DNA 4 40 10 10 10
grep -c will do a counting of all the '>' occurrences in the file.
if you want to count but omit the header line from the counting do something like this:
for file in *.faa; do echo $file; sed '1d' "$file" | wc -c; done;
the sed '1d' part will delete the first line of <file> and pass the rest of the file through to wc
if you want to use grep, you should use grep -v '>' (this will grep any line that does not match '>' )
You should also keep in mind that wc will count any character in the file (even the ones you can't see, such as line breaks etc), but that will give a good approximation of the total count.
Assuming you have only standard FASTA/FASTQ sequence characters (i.e. only G, C, T, A, and N) then you can pipe two greps together. The first grep -v removes the header line (as suggested by Lieven and boczniak767) and then the subsequent grep -E -o uses multiple pattern matching (and catching multiple matches per line) and a pipe to wc -l to count the number of bases.
You could try putting the above in a loop. I haven't tested the performance relative to other approaches though and multiple comparisons will likely be slower than regex. Also, if the genome is quite large then there may possibly be some RAM issues but again - I haven't tested that. In any case purpose-built applications (like seqkit stats as suggested by shenwei356) would be the go with more memory-intense genomes.
I'm interested to hear if anyone sees any holes/issues with this approach, long-term learning is the best!
Just use seqkit stats
For a large number of files: