Question

Counting base and nucleotide frequency of multifasta file

0

Entering edit mode

6.8 years ago

saadleeshehreen ▴ 140

Hi, I have a mulifasta with 2000 sequences. The file is like this.

>spacer_1
ATCCCGGGGGGTTTA...............
>spacer_2
TCAGGTTT.......
.
.

I want to count how many bases for each of them and what is the frequency of nucleotide (A,T,G,C) in each of the sequence. I tried this one, but it gave total base count whereas I want a count for each sequence.

grep -v ">" file.fasta | wc | awk '{print $3-$1}'

Any script for this purpose?

Cheers

multifasta base count nucleotide frequency • 11k views

ADD COMMENT • link updated 4.1 years ago by ainsley.beaton • 0 • written 6.8 years ago by saadleeshehreen ▴ 140

1

Entering edit mode

You can use bioawk (bioawk -c fastx) to get this done.

ADD REPLY • link 6.8 years ago by Ram 45k

score 5 · Answer 1 · 2018-12-06

It can be done with Perl:

#!/usr/bin/perl

use strict;
use warnings;

my %seqs;
$/ = "\n>"; # read fasta by sequence, not by lines

while (<>) {
    s/>//g;
    my ($seq_id, @seq) = split (/\n/, $_);
    my $seq = uc(join "", @seq); # rebuild sequence as a single string
    my $len = length $seq;
    my $numA = $seq =~ tr/A//; # removing A's from sequence returns total counts
    my $numC = $seq =~ tr/C//;
    my $numG = $seq =~ tr/G//;
    my $numT = $seq =~ tr/T//;
    print "$seq_id: Size=$len  A=$numA  C=$numC  G=$numG  T=$numT\n";
}

Testing it:

$ perl count.pl < seqs.fa
spacer_1: Size=15 A=2  C=3  G=6  T=4
spacer_2: Size=8 A=1  C=1  G=2  T=4

score 2 · Answer 2 · 2018-12-06

2

Entering edit mode

6.8 years ago

FX ▴ 20

Using shell

while read line; do echo $line | grep -v '>' | grep -o "[ACGT]" | sort | uniq -c; \
echo $line | grep '>' ; done < file.fasta

The result:

>spacer_1
      2 A
      3 C
      6 G
      4 T
>spacer_2
      1 A
      1 C
      2 G
      4 T

Or use

while read line; do echo $line | grep -v '>' | grep -o "[ACGT]" | sort | uniq -c \
| paste - - - - ; echo $line | grep '>' | tr "\n" "\t" ; done < file.fasta

for a more convenient output

>spacer_1         2 A         3 C         6 G         4 T
>spacer_2         1 A         1 C         2 G         4 T

ADD COMMENT • link 6.8 years ago by FX ▴ 20

0

Entering edit mode

I had a similar question and found this response really helpful, thank you! In my case, there are some sequences that don't contain any C's and those are all I want to count, so my output is as below. Could you let me know if there is a way to have it output 0 C where there is none?

Thank you!

AAC71248.1 >AAC71255.1 3 C
AAC71256.1 1 C
AAC71261.1 1 C
AAC71285.1 1 C
AAC71286.1 1 C
AAC71293.1 >AAC71313.1 1 C
AAC71314.1 >AAC71345.1 1 C

ADD REPLY • link 4.1 years ago by ainsley.beaton • 0

score 0 · Answer 3 · 2020-11-27

0

Entering edit mode

4.8 years ago

William ★ 5.4k

pyfastx can also get the base composition of a fasta file.

https://pypi.org/project/pyfastx/

import pyfastx
fa = pyfastx.Fasta('test/data/test.fa.gz')
fa.composition
{'A': 24534, 'C': 18694, 'G': 18855, 'T': 24179}

ADD COMMENT • link 4.8 years ago by William ★ 5.4k