Question

How to get number of characters (nucleotides) in fasta file

1

Entering edit mode

4.4 years ago

er.doug.ragnar ▴ 30

Hi,

I have a bunch of fasta files from refseq and I want to know how many bases are on each one. I want to count all characters after the first line of each sequence. I have something among these lines:

for file in *.faa; do echo $file; grep -c ">" "$file" | wc -c; done;

But this gets all characters after > including the headers of the sequences. I just want the number of nucleotides.

I have tried some other grep options but they don't seem to make it better, I would very much appreciate some help.

count fasta linux characters commands • 15k views

ADD COMMENT • link updated 3.5 years ago by bio_driven • 0 • written 4.4 years ago by er.doug.ragnar ▴ 30

1

Entering edit mode

Just use seqkit stats

seqkit stats *.faa

file                    format  type  num_seqs    sum_len  min_len  avg_len  max_len
tests/seqs4amplicon.fa  FASTA   DNA          2         33       16     16.5       17
tests/hairpin.fa        FASTA   RNA     28,645  2,949,871       39      103    2,354
tests/contigs.fa        FASTA   DNA          9         54        2        6       10
tests/hsa.fa            FASTA   DNA          4         40       10       10       10

For a large number of files:

seqkit stats --infile-list <(find seq-dir/ -name "*.faa") --tabular -j 12 -o stats.tsv

ADD REPLY • link 4.4 years ago by shenwei356 8.7k

score 1 · Answer 1 · 2021-06-18

grep -c will do a counting of all the '>' occurrences in the file.

if you want to count but omit the header line from the counting do something like this:

for file in *.faa; do echo $file; sed '1d' "$file" | wc -c; done;

the sed '1d' part will delete the first line of <file> and pass the rest of the file through to wc

if you want to use grep, you should use grep -v '>' (this will grep any line that does not match '>' )

You should also keep in mind that wc will count any character in the file (even the ones you can't see, such as line breaks etc), but that will give a good approximation of the total count.

score 1 · Answer 2 · 2021-06-18

1

Entering edit mode

4.4 years ago

cpad0112 21k

Assuming that sequences are in single line in each record:

for file in *.fa; do awk -v OFS="\t" '/^>/ {getline seq}; {gsub(/>/,"",$0); print FILENAME, $0, length(seq)}' $file; done

This would print: file name, fasta header (without >), sequence length (count)

ADD COMMENT • link 4.4 years ago by cpad0112 21k

score 1 · Answer 3 · 2021-06-18

1

Entering edit mode

4.4 years ago

boczniak767 ▴ 880

Maybe grep -v '>' file.fa | sed -e 's/\(.\)/&\n/g' | wc -l

It removes the line with name, then translates any character to the same character followed by the newline character and finally counts lines.

It works ok only if you have only one sequence in the file.

ADD COMMENT • link 4.4 years ago by boczniak767 ▴ 880

score 0 · Answer 4 · 2022-05-27

If you really love grep there might be an easier-to-read solution - though it may have drawbacks.

grep -v ">" fasta_file.fa | grep -E -o "G|C|T|A|N" | wc -l

Assuming you have only standard FASTA/FASTQ sequence characters (i.e. only G, C, T, A, and N) then you can pipe two greps together. The first grep -v removes the header line (as suggested by Lieven and boczniak767) and then the subsequent grep -E -o uses multiple pattern matching (and catching multiple matches per line) and a pipe to wc -l to count the number of bases.

You could try putting the above in a loop. I haven't tested the performance relative to other approaches though and multiple comparisons will likely be slower than regex. Also, if the genome is quite large then there may possibly be some RAM issues but again - I haven't tested that. In any case purpose-built applications (like seqkit stats as suggested by shenwei356) would be the go with more memory-intense genomes.

I'm interested to hear if anyone sees any holes/issues with this approach, long-term learning is the best!