Help With Perl Script: Store Fasta Sequences Into A Hash.
3
0
Entering edit mode
10.9 years ago
biolab ★ 1.4k

Hi everyone I am working on a fasta file. I want to format it to a hash(SeqID as the key and Sequence as the value). I write a script, but somewhere is wrong. Could you please indicate for me? Thank you very much!

#!/bin/perl
use strict;
use warnings;

open IN, $ARGV[0];
while (<>){

$_ =~ s/[\r\n]/\t/g;  ##replace all newlines with tabs

my @a;
my %h; 
@a = split (/\t/, $_);  ##change to array

my $i;                  ##change array to hash
for ($i=0, $i<=$#a/2, $i++){
    my $id = shift @a;
    my $seq = shift @a;
    $h{$id} = $seq;
}
}
close IN;
perl • 9.9k views
ADD COMMENT
6
Entering edit mode
10.9 years ago
Varun Gupta ★ 1.3k
    #!usr/bin/perl
    use strict;
    use warnings;

    my %id2seq = ();
    my $id = '';
    open F,"test.fa",or die $!;
    while(<F>){
        chomp;
        if($_ =~ /^>(.+)/){
            $id = $1;
        }else{
            $id2seq{$id} .= $_;
        }
    }
close F;

Hope this helps. You can then use the foreach loop to loop over the keys of the hash and manipulate the sequences associated with each id accordingly.

Varun

ADD COMMENT
0
Entering edit mode

Really helpful again! Thanks!

ADD REPLY
5
Entering edit mode
10.9 years ago
Neilfws 49k

Bioperl provides libraries for sequence parsing so as you don't have to write them. Have a look at Bio::SeqIO.

use strict;
use Bio::SeqIO;

my %sequences;
my $seqio = Bio::SeqIO->new(-file => "myfastafile.fa", -format => "fasta");
while(my$seqobj = $seqio->next_seq) {
    my $id  = $seqobj->display_id;    # there's your key
    my $seq = $seqobj->seq;           # and there's your value
    $sequences{$id} = $seq;
}
ADD COMMENT
1
Entering edit mode

-format => "fasta" is programmatically unnecessary here, since Bio::SeqIO 'knows' the format by the file's extension.

Why not just:

while ( my $seqobj = $seqio->next_seq ) {
    $sequences{ $seqobj->display_id } = $seqobj->seq;
}

since the object's methods are self-documenting?

ADD REPLY
1
Entering edit mode

Because being explicit is helpful for beginners.

ADD REPLY
2
Entering edit mode
10.9 years ago
Kenosis ★ 1.3k

Since the sequences can be multi-line, you can set Perl's record separator to '>', so one fasta record at a time is read. Then, a regex can be used to capture the id and seq: the id is all before the first space and the seq is all after the first newline. At this point, you can add the id/seq pair as a key/val pair in a hash, where $1 is the id and $2 is the seq:

use strict;
use warnings;

my %h;
local $/ = '>';

while (<>) {
    chomp;
    /(\w+).+?\n(.+)/s and $h{$1} = $2 or next;
}

Hope this helps!

ADD COMMENT
0
Entering edit mode

It really helps! Thank you very much!

ADD REPLY
0
Entering edit mode

You're most welcome!

ADD REPLY

Login before adding your answer.

Traffic: 1565 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6