I have a bunch of fasta files from refseq and I want to know how many bases are on each one. I want to count all characters after the first line of each sequence. I have something among these lines:
seqkit stats *.faa
file format type num_seqs sum_len min_len avg_len max_len
tests/seqs4amplicon.fa FASTA DNA 2 33 16 16.5 17
tests/hairpin.fa FASTA RNA 28,645 2,949,871 39 103 2,354
tests/contigs.fa FASTA DNA 9 54 2 6 10
tests/hsa.fa FASTA DNA 4 40 10 10 10
the sed'1d' part will delete the first line of <file> and pass the rest of the file through to wc
if you want to use grep, you should use grep -v '>' (this will grep any line that does not match '>' )
You should also keep in mind that wc will count any character in the file (even the ones you can't see, such as line breaks etc), but that will give a good approximation of the total count.
Assuming you have only standard FASTA/FASTQ sequence characters (i.e. only G, C, T, A, and N) then you can pipe two greps together. The first grep -v removes the header line (as suggested by Lieven and boczniak767) and then the subsequent grep -E -o uses multiple pattern matching (and catching multiple matches per line) and a pipe to wc -l to count the number of bases.
You could try putting the above in a loop. I haven't tested the performance relative to other approaches though and multiple comparisons will likely be slower than regex. Also, if the genome is quite large then there may possibly be some RAM issues but again - I haven't tested that. In any case purpose-built applications (like seqkit stats as suggested by shenwei356) would be the go with more memory-intense genomes.
I'm interested to hear if anyone sees any holes/issues with this approach, long-term learning is the best!
Just use seqkit stats
For a large number of files: