Question

Count number of basepairs and exclude other characters

0

Entering edit mode

3.6 years ago

genomes_and_MGEs ▴ 10

Hey everyone,

When I want to count basepairs (A, C, G, T) on many fasta files, I usually use

for F in *.fna ; do N=$(basename $F .fna)_count_bps.txt ; grep -v ">" $F | wc | awk '{print $3-$1}' > $N ; done

However, if my fasta files have characters other than A, C, G, and T, these will be included in the total count. Is there a way to optimize my code, so that I only get the total count of A, C, G and T in each fasta file?

Thanks!

sequence • 1.7k views

ADD COMMENT • link updated 3.6 years ago by Kevin Blighe 89k • written 3.6 years ago by genomes_and_MGEs ▴ 10

score 1 · Answer 1 · 2021-11-03

There are many scripts and programs dedicated to residue counting in FASTA files.

compseq in EMBOSS package:

http://emboss.sourceforge.net/apps/cvs/emboss/apps/compseq.html

stats.sh in BBtools package:

https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/statistics-guide/

They have a breakdown by residue type, which should be easy to parse so you count only ACGTs.

score 1 · Answer 2 · 2021-11-03

You can also add a sed statement after your grep statement, i.e., on a Linux system:

for F in *.fna ; do N=$(basename $F .fna)_count_bps.txt ; grep -v ">" $F | sed 's/[^ACTGactg]//g' | wc | awk '{print $3-$1}' > $N ; done

The command sed 's/[^ACTGactg]//g' removes all characters but those which are A, C, G, T, and their masked lowercase equivalents (change as needed). Whatever sequence you pass to this will have other characters removed. The filtered sequence is then passed along to wc for counting.

score 1 · Answer 3 · 2021-11-04

awk can of course do this:

cat test.fasta 
>header1
ATGCATGC
>header2
TACGTCGAAGTAAG
>header3
CGTGTACAGGTGGGAGC

awk -F "" 'BEGIN {totA=0; totT=0; totG=0; totC=0} !/^>/ {nA=gsub(/A/,A,$0); nT=gsub(/T/,T,$0); nC=gsub(/C/,C,$0); nG=gsub(/G/,G,$0); totA+=nA; totT+=nT; totG+=nG; totC+=nC} END {print "A="totA"; T="totT"; G="totG"; C="totC}' test.fasta 
A=10; T=8; G=14; C=7

Kind regards,

Kevin

PS - awk is not necessarily a one-liner

awk -F "" '
  BEGIN {
    totA=0; totT=0; totG=0; totC=0
  } !/^>/ {
    nA=gsub(/A/,A,$0);
    nT=gsub(/T/,T,$0);
    nG=gsub(/G/,G,$0);
    nC=gsub(/C/,C,$0);
    totA+=nA;
    totT+=nT;
    totG+=nG;
    totC+=nC
 } END {
    print "A="totA"; T="totT"; G="totG"; C="totC
 }' test.fasta 
A=10; T=8; G=14; C=7