How To Parse Fasta Files In Perl
4
7
Entering edit mode
14.5 years ago
nikulina ▴ 300

Dear colleagues! I have a file with lots of sequences in FASTA format. I want to write a perl script to analyze each sequence (to count the length of certain fragment). So, how can I manage to treat each sequence as a variable? Should I use an array to read my file?

So, here is my script. It might be not very nice, but it works. I would like to modify it in order to work with FASTA data.

$string_filename = 'file.txt';  

open(FILE, $string_filename);  

@array = FILE;     

close FILE;  

foreach $string(@array) {
    $R = length $string;  
    if ( $string =~ /ggc/ ) {   
        $M = $';   
        $W = length $M;  
        if ( $string =~ /atg/ ) {   
            $K = $`;   
            $Z = length $K;  
            $x = $W + $Z - $R;     
            print " \n\ the distance is the following: \n\n ";  
            print $x;  
        } else {  
            print "\n\ I couldn\'t find the start codone.\n\n";  
        }  
    } else {  
        print "\n\ I couldn\'t find the binding site.\n\n"; }  
}  
exit;

I will be grateful for your help :)

fasta perl • 35k views
ADD COMMENT
0
Entering edit mode

Could you also show us an example sequence for which this code works? If the code is supposed to do what I think it is supposed to do, I think there may be quite a few problems with it.

ADD REPLY
0
Entering edit mode

Are you really sure none of to 10000 topics about how to parse file XXX did match your needs?

ADD REPLY
21
Entering edit mode
14.5 years ago
Neilfws 49k

First, there is no need to reinvent the wheel. As Stefano wrote, Bioperl will parse fasta sequences for you and do a whole lot more besides. Once installed, it is as simple as:

use Bio::SeqIO;
my $seqio = Bio::SeqIO->new(-file => "file.fa", '-format' => 'Fasta');
while(my $seq = $seqio->next_seq) {
    my $string = $seq->seq;
    # do stuff with $string
}

Second, there are some issues with your code. It should be "@array = <file>" - although as Stefano points out, you should not read the whole file into an array.

So far as I can tell, you are trying to find sub-sequences which begin "atg" and end with "ggc". Some other issues with your code:

  1. It seems to assume that there is only one each of "atg" and "ggc", because you use if() to match the regular expressions, not while().
  2. It returns negative values for length of the sub-sequence. Is this what you want? It is unclear whether you are looking for "atg" which lie upstream of "ggc" or whether they can be at any position in the sequence.
  3. It looks as though you are looking for start codons. There may be alternatives to atg: gtg or ttg.
  4. Your regular expressions are case-sensitive and would miss, for example, ATG.

Assuming that you are trying to find the region atg -> ggc, you could try something like:

while(my $string =~/atg(.*)ggc/gi) {
    # do something with match
    # e.g. match start = $-[0]+1, match end = $+[0]
}

That example uses the special Perl variables @- and @+ to get match positions, but Bioperl will also provide you with plenty of methods for analysing sub-sequences.

ADD COMMENT
0
Entering edit mode

Thank you for your attention to my question. In fact I would like to find the distance between binding sites for RNAP II and start of transcription in certain human genes. So, I used some sample motifs ('ggc' and 'atg') in my perl script, just to make the task easier and to test how it works in this simple variant. The binding site is situated before the start codone, that's why i didn't take into consideration those variants, where the first motif is situated after the second. Once more, thank you for your help.

ADD REPLY
7
Entering edit mode
14.5 years ago

If you are planning to read and manipulate a lot of files with fasta sequences, do it properly. Use Bioperl. It make life easier (see an example here ). It takes some time to set it up and learn the "philosophy" behind, but then you can do much more: read from NCBI/EMBL, read/write to different formats... all with the same interface. Already debugged for you.

Also, if you use big files, don't do this:

open(FILE, $string_filename);  
@array = FILE;

It will load the whole file in memory. Nowadays fasta files might be Huuuuuge.

ADD COMMENT
0
Entering edit mode

Thank you! indeed i recognise that my variant is not very convinient and consumes lots of memory. I'll try to examine BIOperl and use it for futher tasks.

ADD REPLY
4
Entering edit mode
14.5 years ago
Hanif Khalak ★ 1.3k

Along the lines of answers to this question, you can read/process one FASTA sequence at a time. I'd modify your code like this:

$string_filename = 'file.txt';  
open(FILE, $string_filename) || die("Couldn't read file $string_filename\n");  

local $/ = "\n>";  # read by FASTA record

while (my $seq = <>) {
chomp $seq;
    $seq =~ s/^>*.+\n//;  # remove FASTA header
    $seq =~ s/\n//g;  # remove endlines

    $R = length $seq;
    if ( $seq =~ /ggc/ ) {   
        $M = $';
        $W = length $M;
        if ( $seq =~ /atg/ ) {   
            $K = $`;   
            $Z = length $K;  
            $x = $W + $Z - $R;     
            print "\n\ the distance is the following: $x\n\n";
        } else {  
            print "\n\ I couldn't find the start codon.\n\n";
        }
    } else {  
        print "\n\ I couldn't find the binding site.\n\n"; }  
    }

}  # end while

close FILE;  

exit;
ADD COMMENT
0
Entering edit mode

Thank you! it works!

ADD REPLY
0
Entering edit mode
13.0 years ago
Tarah • 0

Can I please add a question to this? What if you want to remove that string/sequence that you are looking for? I have a control phage in my illumina data that I want to remove, but am having a hard time finding out how to do this. Thanks so much!

ADD COMMENT
1
Entering edit mode

Tarah, please delete this and post as a separate question.

ADD REPLY

Login before adding your answer.

Traffic: 2283 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6