Question

Problem With Perl Script When Extracting Sequences According To The Ids

2

Entering edit mode

10.8 years ago

biolab ★ 1.4k

Dear All,

I have a Fasta file and a List file. I need to extract sequences according to the IDs in each line of the List file, and output to a file. An examplified output result is shown below (Files out_1 and out_2). I wrote a perl script to do this job (useage: perl extract_seq.pl fastafile listfile), but got error message : Use of uninitialized value in string eq at ... Could you please point out my error and give me some suggesttions? Thank you very much!!

Fasta file

>a1
tggtacgtgtcagtcaacgtgggggggggggggg
>a2
cgtacgtgacgtacactacgtttttttttttttt
>a3
cgctagacgtataaaaatatatattataaaaaaaa
>a4
cccgcgcggcgcgcggcccccccccccccccccccc
>a5
tggtttttttttttttttttttttttttttttttttttt
>a6
cgtacgggggggggggggggggggggggggggggggg
>a7
cgctagacgtataaaaaaaaaaaaaaaaaaaaaaaaa
>a8
cccgcgcattattattatatattatattttattatat

List file

a1    a3    a4
a5    a6    a8

Ideal output files:

out_1

>a1
tggtacgtgtcagtcaacgtgggggggggggggg
>a3
cgctagacgtataaaaatatatattataaaaaaaa
>a4
cccgcgcggcgcgcggcccccccccccccccccccc

out_2

>a5
tggtttttttttttttttttttttttttttttttttttt
>a6
cgtacgggggggggggggggggggggggggggggggg
>a8
cccgcgcattattattatatattatattttattatat

Problematic Perl script

#!/usr/bin/perl
use strict;
use warnings;


#Firstly store Fasta sequences into a hash;
open SEQ, $ARGV[0];
my $id; 
my %seq;  

while (<SEQ>) {
    chomp;
    if (/^>(.+)/){ 
        $id = $1;
        }else{$seq{$id} .= $_;
        }
}
close SEQ;

#Next open LIST file and generate output files according to IDs in each line;
open LIST, $ARGV[1];
my $i;
while (<LIST>){
        $i ++;
        my @a = (split /\t/, $_);
        open OUT, ">> out_$i";  
        foreach (sort keys %seq){
                if($a[0] eq $_){
                        print OUT ">$_\n$seq{$_}\n";
                }elsif($a[1] eq $_){
                        print OUT ">$_\n$seq{$_}\n";
                }elsif($a[2] eq $_){
                        print OUT ">$_\n$seq{$_}\n";
                }
        }  
        close OUT;
}
close LIST;

perl • 6.2k views

ADD COMMENT • link updated 10.8 years ago by Istvan Albert 102k • written 10.8 years ago by biolab ★ 1.4k

3

Entering edit mode

Don't write your own code for basic tasks that have been scripted by so many other people already. Just google your question and you will easily get working solutions

ADD REPLY • link 10.8 years ago by Irsan ★ 7.8k

4

Entering edit mode

Exercising is always good. Why not to appreciate the effort made.

ADD REPLY • link 10.8 years ago by Pavel Senin ★ 1.9k

1

Entering edit mode

OK, I agree. I didn't think of this being an exercise. But sometimes I see people spending 3 days working on a buggy script for a task that can easily be accomplished with well established widely used tools. I think it is just as important to know what is already available.

ADD REPLY • link 10.8 years ago by Irsan ★ 7.8k

1

Entering edit mode

Try writing simple bash loop: for each line split IDs & grep from FASTA file & redirect into specific file.

ADD REPLY • link 10.8 years ago by PoGibas 5.1k

1

Entering edit mode

This site kind of is becoming as site for programming help. Learn debugging your code line for line and try to find the error yourself. A first semester student could see the bug and fix it. This is no scientific question but a programmatic one. If you can't programm, then I suggest to learn it first and dont copy your code in here for someone to help debug your code. Sorry for being rude (and honest)

ADD REPLY • link 10.8 years ago by my_name_is_simon ▴ 10

5

Entering edit mode

while I don't disagree completely, I think in this case, it's a bioinformatics question and the OP has shown his code and what he has tried rather than just asking how to do it without any effort.

ADD REPLY • link 10.8 years ago by brentp 24k

score 3 · Answer 1 · 2014-01-31

First, there are some really good parts of your code.

I'm not going to write the code for you because you are close. The warning you are getting is because those array elements don't contain what you think they do, therefore you're trying to match an array element that was never initialized. Kudos for using use strict; and use warnings;, those will always tell you what is wrong.

Your approach seems fine, but try to think about what is more efficient, storing all the sequences and IDs vs just a list of IDs. One last thing, $_ is the default scalar, and you should not have to explicitly write that. The exception being with map/grep blocks. In my opinion, writing $_ is not expressive and it makes it hard to debug or even understand the code (it's also very ugly).

Here's the bad part:

 open SEQ, $ARGV[0];

Seeing this in code usually indicates someone either learned Perl 15 years ago or is just learning from some very dated materials. That line may be the single worst line of Perl you could write, partly because that typeglob has global scope. If you are not extremely careful, you will clobber your data very fast doing this. I encourage you pick up a modern Perl book and read about open (or on the web, or at the command line: perldoc perlfunc). You'll find the authors of the language routinely say they are trying to "rid the world" of this syntax (quoted from the latest camel book). Aside from possibly ruining your data, you are not checking if there is a file and if can be opened. This may be part of your uninitialized warnings.

I know people that do Perl training for a living and the only reason they would write this is to show you what kind or horrors might be encountered in the wild. In the Perl open tutorial this single argument version of open is demonstrated, followed by the statement, "Why is this here? Someone has to cater to the hysterical porpoises." Without being too technical, they are trying to say this is no longer acceptable, so don't do it. Remember this and it will save you loads of trouble. Don't feel bad though, we all wrote this kind of thing at one point.

score 2 · Answer 2 · 2014-01-31

Perhaps the following will be helpful:

use strict;
use warnings;

my ( %list, %FHs, $id );

while (<>) {
    $list{$_} = "out_$." for split;
    last if eof;
}

local $/ = '>';
while (<>) {
    chomp;
    if ( ($id) = /(.+)/ and exists $list{$id} ) {
        open $FHs{ $list{$id} }, '>', $list{$id} or die $! unless defined $FHs{ $list{$id} };
        print { $FHs{ $list{$id} } } ">$_";
    }
}

Usage: perl extract_seq.pl listfile fastafile

The script first processes the list file, building a hash where the keys are the IDs and the associated values are the file names. Next, it sets Perl's record separator $/ to >, so the fasta file is read in chunks of fasta records. When it's processing the fasta file, it captures the record's ID and checks whether it exists as a key in %list. If there is such a key, it checks whether a file handle (held in %FHs) has been defined for it. If there's no file handle associated with the captured ID, then one's created. Finally, the fasta record is printed to the associated file.

score 1 · Answer 3 · 2014-01-31

1

Entering edit mode

10.8 years ago

brentp 24k

You can accomplish what you're trying to do in perl. Usually, it's better not to read the sequences in to memory. Here's an example that reads the list file into memory and splits the output files according to the line the header appears on in the list file:

open(LIST, $ARGV[1]);

my $i = 0;
my %names;
my %FHS;
while(<LIST>){
    $i++;
    open($FHS{$i}, ">seqs_$i.fa");
    my @names = split(/\s+/, $_);
    $names{$i} = \@names;
}

open(SEQ, $ARGV[0]);
my $id;
while(<SEQ>){
    if (/^>(.+)/){
        $id = $1;
        next;
    } else {
        $seq = $_;
    }
    foreach my $k (keys %names){
        my $fh = $FHS{$k};
        if(grep { $_ eq $id } @{$names{$k}}){
            print $fh ">$id\n$seq";
        } 
    }
}

ADD COMMENT • link 10.8 years ago by brentp 24k

2

Entering edit mode

Always: use strict; use warnings;.
Use the three-argument form of open.
Your script doesn't handle multi-line sequences; the OP's does.
Your script creates n-1 empty files, where n is the number of blank lines in the list file.

How efficient is your algorithm? For each fasta sequence, you're effectively interating through all of the lines of the list file, examining all of the IDs in those lines.

ADD REPLY • link 10.8 years ago by Kenosis ★ 1.3k

0

Entering edit mode

cheers. that'll teach me to mind my manners when answering perl questions. :) though you'd have to have a pretty big list for the "algorithm" to matter.

ADD REPLY • link 10.8 years ago by brentp 24k

0

Entering edit mode

The major issue is that this code will fail silently, so you would never understand why it will not work. You don't have to worry about the first point with modern Perl (these pragmas are implicit with modern web frameworks, object systems, and by specifying a version >= 5.12 will import strictures and anything >= 5.014 (I think) will enable warnings), but often people are using decades old Perls without knowing it. These are therefore best practices. The issue with open() is major (I explain this in my answer), and the only reason it's allowed is because of backwards compatibility (people relying on decades old code).

ADD REPLY • link 10.8 years ago by SES 8.6k

score 1 · Answer 4 · 2014-01-31

1

Entering edit mode

10.8 years ago

Pavel Senin ★ 1.9k

try too change this line

my @a = (split /\t/, $_);

to this

my @a = ( split /\s+/, $_ );

ADD COMMENT • link 10.8 years ago by Pavel Senin ★ 1.9k

1

Entering edit mode

Dear all, Thank you all for correcting and commenting on my script, whatever positively or negatively. I am a true perl beginner. I especially thank Kenosis, SES and brentp for writing new code for me. Cheers!

ADD REPLY • link 10.8 years ago by biolab ★ 1.4k