Count number of basepairs and exclude other characters
3
0
Entering edit mode
3.1 years ago

Hey everyone,

When I want to count basepairs (A, C, G, T) on many fasta files, I usually use

for F in *.fna ; do N=$(basename $F .fna)_count_bps.txt ; grep -v ">" $F | wc | awk '{print $3-$1}' > $N ; done

However, if my fasta files have characters other than A, C, G, and T, these will be included in the total count. Is there a way to optimize my code, so that I only get the total count of A, C, G and T in each fasta file?

Thanks!

sequence • 1.5k views
ADD COMMENT
1
Entering edit mode
3.1 years ago
Mensur Dlakic ★ 28k

There are many scripts and programs dedicated to residue counting in FASTA files.

compseq in EMBOSS package:

http://emboss.sourceforge.net/apps/cvs/emboss/apps/compseq.html

stats.sh in BBtools package:

https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/statistics-guide/

They have a breakdown by residue type, which should be easy to parse so you count only ACGTs.

ADD COMMENT
1
Entering edit mode
3.1 years ago

You can also add a sed statement after your grep statement, i.e., on a Linux system:

for F in *.fna ; do N=$(basename $F .fna)_count_bps.txt ; grep -v ">" $F | sed 's/[^ACTGactg]//g' | wc | awk '{print $3-$1}' > $N ; done

The command sed 's/[^ACTGactg]//g' removes all characters but those which are A, C, G, T, and their masked lowercase equivalents (change as needed). Whatever sequence you pass to this will have other characters removed. The filtered sequence is then passed along to wc for counting.

ADD COMMENT
1
Entering edit mode
3.1 years ago

awk can of course do this:

cat test.fasta 
>header1
ATGCATGC
>header2
TACGTCGAAGTAAG
>header3
CGTGTACAGGTGGGAGC

awk -F "" 'BEGIN {totA=0; totT=0; totG=0; totC=0} !/^>/ {nA=gsub(/A/,A,$0); nT=gsub(/T/,T,$0); nC=gsub(/C/,C,$0); nG=gsub(/G/,G,$0); totA+=nA; totT+=nT; totG+=nG; totC+=nC} END {print "A="totA"; T="totT"; G="totG"; C="totC}' test.fasta 
A=10; T=8; G=14; C=7

Kind regards,

Kevin

PS - awk is not necessarily a one-liner

awk -F "" '
  BEGIN {
    totA=0; totT=0; totG=0; totC=0
  } !/^>/ {
    nA=gsub(/A/,A,$0);
    nT=gsub(/T/,T,$0);
    nG=gsub(/G/,G,$0);
    nC=gsub(/C/,C,$0);
    totA+=nA;
    totT+=nT;
    totG+=nG;
    totC+=nC
 } END {
    print "A="totA"; T="totT"; G="totG"; C="totC
 }' test.fasta 
A=10; T=8; G=14; C=7
ADD COMMENT

Login before adding your answer.

Traffic: 1563 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6