count word ocurrences in fasta file
1
0
Entering edit mode
6.3 years ago
erick_rc93 ▴ 30

I would like to do a perl script which can count all word ocurrences in all multifasta files in a directory and the output looks like

Number of occurrences in >name_sequence is : #

I have tried with

for file in *.fasta; do perl -lne 'if($_ =~ /word/) { $a++;} ; END { print "No of words in the file:"; print $a}' $file; done

I also would like that the script count the sequences length, I put the next script in a for loop in shell

#!/usr/bin/perl -w

use strict;
open( FASTA, $ARGV[0] ) or die "Cannot open fasta file: $!\n";
my $total = 0; #create a counter set to zero
my $name = ''; #to record the name of the current sequence
while( <FASTA> )
{
        chomp;
        if( $_ =~ />/ )
        {
        if( $total > 0 )
        {
        print "$name $total \n";
        }
        $total = 0; #reset the counter
        $name = $_;
}
else
{
$total = $total + length( $_ ); #add to the length
}
}
close( FASTA );

I saved the above code in a text file named count.pl an run the next code

for file in *.fna; do ./count.pl $file; done

I'd like to do these two task in one alone, but I don't know how to do it.

sequence • 2.0k views
ADD COMMENT
0
Entering edit mode

Looks like an assignment exercise. Is it one, erick_rc93?

ADD REPLY
0
Entering edit mode

Your Perl one-liner doesn't really count the number of occurrences of word, it counts the numbers of lines where word occurred. Test example, save it as test.fasta:

>seq1
ATCGCCCCCCCCCCCCCCATCG

The pattern ATCG occurs twice, but it will be counted only once.

ADD REPLY
2
Entering edit mode
6.3 years ago
JC 13k

There are several ways to do this, also there are unknown conditions, do you need all words including overlapped words? Do you care for words in the reverse-complement chain? The simplest solution I can think is to split your sequences and count array elements:

#!/usr/bin/perl

use strict;
use warnings;

my $word = "ATG"; # here you define the word to count, non-overlaps in forward chain are considered only
$/ = "\n>"; # reads sequences in Fasta by blocks
while (<>) {
    s/>//g;
    my ($id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq; # reconstruct sequence in one single string
    my @subseq = split (/$word/, $seq);
    print "sequence $id has $#subseq occurences of $word\n";
}
ADD COMMENT

Login before adding your answer.

Traffic: 1814 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6