Question

How To Parse Fasta Files In Perl

7

Entering edit mode

14.9 years ago

nikulina ▴ 300

Dear colleagues! I have a file with lots of sequences in FASTA format. I want to write a perl script to analyze each sequence (to count the length of certain fragment). So, how can I manage to treat each sequence as a variable? Should I use an array to read my file?

So, here is my script. It might be not very nice, but it works. I would like to modify it in order to work with FASTA data.

$string_filename = 'file.txt';  

open(FILE, $string_filename);  

@array = FILE;     

close FILE;  

foreach $string(@array) {
    $R = length $string;  
    if ( $string =~ /ggc/ ) {   
        $M = $';   
        $W = length $M;  
        if ( $string =~ /atg/ ) {   
            $K = $`;   
            $Z = length $K;  
            $x = $W + $Z - $R;     
            print " \n\ the distance is the following: \n\n ";  
            print $x;  
        } else {  
            print "\n\ I couldn\'t find the start codone.\n\n";  
        }  
    } else {  
        print "\n\ I couldn\'t find the binding site.\n\n"; }  
}  
exit;

I will be grateful for your help :)

fasta perl • 36k views

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 14.9 years ago by nikulina ▴ 300

0

Entering edit mode

Could you also show us an example sequence for which this code works? If the code is supposed to do what I think it is supposed to do, I think there may be quite a few problems with it.

ADD REPLY • link 14.9 years ago by Neilfws 49k

0

Entering edit mode

Are you really sure none of to 10000 topics about how to parse file XXX did match your needs?

ADD REPLY • link 13.4 years ago by Fabian Bull ★ 1.3k

Ram · Answer 1 · 2010-06-11

First, there is no need to reinvent the wheel. As Stefano wrote, Bioperl will parse fasta sequences for you and do a whole lot more besides. Once installed, it is as simple as:

use Bio::SeqIO;
my $seqio = Bio::SeqIO->new(-file => "file.fa", '-format' => 'Fasta');
while(my $seq = $seqio->next_seq) {
    my $string = $seq->seq;
    # do stuff with $string
}

Second, there are some issues with your code. It should be "@array = <file>" - although as Stefano points out, you should not read the whole file into an array.

So far as I can tell, you are trying to find sub-sequences which begin "atg" and end with "ggc". Some other issues with your code:

It seems to assume that there is only one each of "atg" and "ggc", because you use if() to match the regular expressions, not while().
It returns negative values for length of the sub-sequence. Is this what you want? It is unclear whether you are looking for "atg" which lie upstream of "ggc" or whether they can be at any position in the sequence.
It looks as though you are looking for start codons. There may be alternatives to atg: gtg or ttg.
Your regular expressions are case-sensitive and would miss, for example, ATG.

Assuming that you are trying to find the region atg -> ggc, you could try something like:

while(my $string =~/atg(.*)ggc/gi) {
    # do something with match
    # e.g. match start = $-[0]+1, match end = $+[0]
}

That example uses the special Perl variables @- and @+ to get match positions, but Bioperl will also provide you with plenty of methods for analysing sub-sequences.

Ram · Answer 2 · 2010-06-11

7

Entering edit mode

14.9 years ago

Stefano Berri 4.4k

If you are planning to read and manipulate a lot of files with fasta sequences, do it properly. Use Bioperl. It make life easier (see an example here ). It takes some time to set it up and learn the "philosophy" behind, but then you can do much more: read from NCBI/EMBL, read/write to different formats... all with the same interface. Already debugged for you.

Also, if you use big files, don't do this:

open(FILE, $string_filename);  
@array = FILE;

It will load the whole file in memory. Nowadays fasta files might be Huuuuuge.

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 14.9 years ago by Stefano Berri 4.4k

0

Entering edit mode

Thank you! indeed i recognise that my variant is not very convinient and consumes lots of memory. I'll try to examine BIOperl and use it for futher tasks.

ADD REPLY • link 14.9 years ago by nikulina ▴ 300

Ram · Answer 3 · 2010-06-11

Along the lines of answers to this question, you can read/process one FASTA sequence at a time. I'd modify your code like this:

$string_filename = 'file.txt';  
open(FILE, $string_filename) || die("Couldn't read file $string_filename\n");  

local $/ = "\n>";  # read by FASTA record

while (my $seq = <>) {
chomp $seq;
    $seq =~ s/^>*.+\n//;  # remove FASTA header
    $seq =~ s/\n//g;  # remove endlines

    $R = length $seq;
    if ( $seq =~ /ggc/ ) {   
        $M = $';
        $W = length $M;
        if ( $seq =~ /atg/ ) {   
            $K = $`;   
            $Z = length $K;  
            $x = $W + $Z - $R;     
            print "\n\ the distance is the following: $x\n\n";
        } else {  
            print "\n\ I couldn't find the start codon.\n\n";
        }
    } else {  
        print "\n\ I couldn't find the binding site.\n\n"; }  
    }

}  # end while

close FILE;  

exit;

score 0 · Answer 4 · 2011-12-05

0

Entering edit mode

13.4 years ago

Tarah • 0

Can I please add a question to this? What if you want to remove that string/sequence that you are looking for? I have a control phage in my illumina data that I want to remove, but am having a hard time finding out how to do this. Thanks so much!