Question

Help With Perl Script: Store Fasta Sequences Into A Hash.

0

Entering edit mode

11.3 years ago

biolab ★ 1.4k

Hi everyone I am working on a fasta file. I want to format it to a hash(SeqID as the key and Sequence as the value). I write a script, but somewhere is wrong. Could you please indicate for me? Thank you very much!

#!/bin/perl
use strict;
use warnings;

open IN, $ARGV[0];
while (<>){

$_ =~ s/[\r\n]/\t/g;  ##replace all newlines with tabs

my @a;
my %h; 
@a = split (/\t/, $_);  ##change to array

my $i;                  ##change array to hash
for ($i=0, $i<=$#a/2, $i++){
    my $id = shift @a;
    my $seq = shift @a;
    $h{$id} = $seq;
}
}
close IN;

perl • 10k views

ADD COMMENT • link updated 11.3 years ago by Neilfws 49k • written 11.3 years ago by biolab ★ 1.4k

score 6 · Answer 1 · 2014-01-17

6

Entering edit mode

11.3 years ago

Varun Gupta ★ 1.3k

    #!usr/bin/perl
    use strict;
    use warnings;

    my %id2seq = ();
    my $id = '';
    open F,"test.fa",or die $!;
    while(<F>){
        chomp;
        if($_ =~ /^>(.+)/){
            $id = $1;
        }else{
            $id2seq{$id} .= $_;
        }
    }
close F;

Hope this helps. You can then use the foreach loop to loop over the keys of the hash and manipulate the sequences associated with each id accordingly.

Varun

ADD COMMENT • link 11.3 years ago by Varun Gupta ★ 1.3k

0

Entering edit mode

Really helpful again! Thanks!

ADD REPLY • link 11.3 years ago by biolab ★ 1.4k

score 5 · Answer 2 · 2014-01-17

5

Entering edit mode

11.3 years ago

Neilfws 49k

Bioperl provides libraries for sequence parsing so as you don't have to write them. Have a look at Bio::SeqIO.

use strict;
use Bio::SeqIO;

my %sequences;
my $seqio = Bio::SeqIO->new(-file => "myfastafile.fa", -format => "fasta");
while(my$seqobj = $seqio->next_seq) {
    my $id  = $seqobj->display_id;    # there's your key
    my $seq = $seqobj->seq;           # and there's your value
    $sequences{$id} = $seq;
}

ADD COMMENT • link 11.3 years ago by Neilfws 49k

1

Entering edit mode

-format => "fasta" is programmatically unnecessary here, since Bio::SeqIO 'knows' the format by the file's extension.

Why not just:

while ( my $seqobj = $seqio->next_seq ) {
    $sequences{ $seqobj->display_id } = $seqobj->seq;
}

since the object's methods are self-documenting?

ADD REPLY • link 11.3 years ago by Kenosis ★ 1.3k

1

Entering edit mode

Because being explicit is helpful for beginners.

ADD REPLY • link 11.3 years ago by Neilfws 49k

score 2 · Answer 3 · 2014-01-16

2

Entering edit mode

11.3 years ago

Kenosis ★ 1.3k

Since the sequences can be multi-line, you can set Perl's record separator to '>', so one fasta record at a time is read. Then, a regex can be used to capture the id and seq: the id is all before the first space and the seq is all after the first newline. At this point, you can add the id/seq pair as a key/val pair in a hash, where $1 is the id and $2 is the seq:

use strict;
use warnings;

my %h;
local $/ = '>';

while (<>) {
    chomp;
    /(\w+).+?\n(.+)/s and $h{$1} = $2 or next;
}

Hope this helps!

ADD COMMENT • link 11.3 years ago by Kenosis ★ 1.3k

0

Entering edit mode

It really helps! Thank you very much!

ADD REPLY • link 11.3 years ago by biolab ★ 1.4k

0

Entering edit mode

You're most welcome!

ADD REPLY • link 11.3 years ago by Kenosis ★ 1.3k