Question

count word ocurrences in fasta file

0

Entering edit mode

7.1 years ago

erick_rc93 ▴ 30

I would like to do a perl script which can count all word ocurrences in all multifasta files in a directory and the output looks like

Number of occurrences in >name_sequence is : #

I have tried with

for file in *.fasta; do perl -lne 'if($_ =~ /word/) { $a++;} ; END { print "No of words in the file:"; print $a}' $file; done

I also would like that the script count the sequences length, I put the next script in a for loop in shell

#!/usr/bin/perl -w

use strict;
open( FASTA, $ARGV[0] ) or die "Cannot open fasta file: $!\n";
my $total = 0; #create a counter set to zero
my $name = ''; #to record the name of the current sequence
while( <FASTA> )
{
        chomp;
        if( $_ =~ />/ )
        {
        if( $total > 0 )
        {
        print "$name $total \n";
        }
        $total = 0; #reset the counter
        $name = $_;
}
else
{
$total = $total + length( $_ ); #add to the length
}
}
close( FASTA );

I saved the above code in a text file named count.pl an run the next code

for file in *.fna; do ./count.pl $file; done

I'd like to do these two task in one alone, but I don't know how to do it.

sequence • 2.2k views

ADD COMMENT • link updated 7.1 years ago by JC 13k • written 7.1 years ago by erick_rc93 ▴ 30

0

Entering edit mode

Looks like an assignment exercise. Is it one, erick_rc93?

ADD REPLY • link 7.1 years ago by Ram 45k

0

Entering edit mode

Your Perl one-liner doesn't really count the number of occurrences of word, it counts the numbers of lines where word occurred. Test example, save it as test.fasta:

>seq1
ATCGCCCCCCCCCCCCCCATCG

The pattern ATCG occurs twice, but it will be counted only once.

ADD REPLY • link 7.1 years ago by h.mon 35k

score 2 · Answer 1 · 2018-07-23

There are several ways to do this, also there are unknown conditions, do you need all words including overlapped words? Do you care for words in the reverse-complement chain? The simplest solution I can think is to split your sequences and count array elements:

#!/usr/bin/perl

use strict;
use warnings;

my $word = "ATG"; # here you define the word to count, non-overlaps in forward chain are considered only
$/ = "\n>"; # reads sequences in Fasta by blocks
while (<>) {
    s/>//g;
    my ($id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq; # reconstruct sequence in one single string
    my @subseq = split (/$word/, $seq);
    print "sequence $id has $#subseq occurences of $word\n";
}