Counting base and nucleotide frequency of multifasta file
4
0
Entering edit mode
6.0 years ago

Hi, I have a mulifasta with 2000 sequences. The file is like this.

>spacer_1
ATCCCGGGGGGTTTA...............
>spacer_2
TCAGGTTT.......
.
.

I want to count how many bases for each of them and what is the frequency of nucleotide (A,T,G,C) in each of the sequence. I tried this one, but it gave total base count whereas I want a count for each sequence.

grep -v ">" file.fasta | wc | awk '{print $3-$1}'

Any script for this purpose?

Cheers

multifasta base count nucleotide frequency • 11k views
ADD COMMENT
1
Entering edit mode

You can use bioawk (bioawk -c fastx) to get this done.

ADD REPLY
4
Entering edit mode
6.0 years ago
JC 13k

It can be done with Perl:

#!/usr/bin/perl

use strict;
use warnings;

my %seqs;
$/ = "\n>"; # read fasta by sequence, not by lines

while (<>) {
    s/>//g;
    my ($seq_id, @seq) = split (/\n/, $_);
    my $seq = uc(join "", @seq); # rebuild sequence as a single string
    my $len = length $seq;
    my $numA = $seq =~ tr/A//; # removing A's from sequence returns total counts
    my $numC = $seq =~ tr/C//;
    my $numG = $seq =~ tr/G//;
    my $numT = $seq =~ tr/T//;
    print "$seq_id: Size=$len  A=$numA  C=$numC  G=$numG  T=$numT\n";
}

Testing it:

$ perl count.pl < seqs.fa
spacer_1: Size=15 A=2  C=3  G=6  T=4
spacer_2: Size=8 A=1  C=1  G=2  T=4
ADD COMMENT
2
Entering edit mode
6.0 years ago
FX ▴ 20

Using shell

while read line; do echo $line | grep -v '>' | grep -o "[ACGT]" | sort | uniq -c; \
echo $line | grep '>' ; done < file.fasta

The result:

>spacer_1
      2 A
      3 C
      6 G
      4 T
>spacer_2
      1 A
      1 C
      2 G
      4 T

Or use

while read line; do echo $line | grep -v '>' | grep -o "[ACGT]" | sort | uniq -c \
| paste - - - - ; echo $line | grep '>' | tr "\n" "\t" ; done < file.fasta

for a more convenient output

>spacer_1         2 A         3 C         6 G         4 T
>spacer_2         1 A         1 C         2 G         4 T
ADD COMMENT
0
Entering edit mode

I had a similar question and found this response really helpful, thank you! In my case, there are some sequences that don't contain any C's and those are all I want to count, so my output is as below. Could you let me know if there is a way to have it output 0 C where there is none?

Thank you!

AAC71248.1 >AAC71255.1 3 C
AAC71256.1 1 C
AAC71261.1 1 C
AAC71285.1 1 C
AAC71286.1 1 C
AAC71293.1 >AAC71313.1 1 C
AAC71314.1 >AAC71345.1 1 C

ADD REPLY
0
Entering edit mode
4.0 years ago
William ★ 5.3k

pyfastx can also get the base composition of a fasta file.

https://pypi.org/project/pyfastx/

import pyfastx
fa = pyfastx.Fasta('test/data/test.fa.gz')
fa.composition
{'A': 24534, 'C': 18694, 'G': 18855, 'T': 24179}
ADD COMMENT

Login before adding your answer.

Traffic: 1877 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6