Need Help Writing A Perl Script For Extracting Sequences Based On A List
5
1
Entering edit mode
11.0 years ago
biolab ★ 1.4k

Dear All, I am a true starter of perl. I have finished reading several chapters of LEARNING PERL. Now I tried to write script. SEQ file contains multiple gene sequences, while LIST file is a list of gene names. Below is an example.

SEQ file:

>gene 1
gctagtcagtacgtacgtac
>gene 2
cgagagagaggccgcgagcatcagtacgtgtac
>gene 3
ccgcggggcggccgcggcccccgcggcgcgcgatta

LIST file:

>gene2
>gene3

My script is as follows:

#!/usr/bin/perl
use strict;

my ($seq, $list) =@ARGV;
open SEQ, "$seq";
open LIST, "$list";
my @seq= <SEQ>;
my @seq = split;
my @list= <LIST>;
my @list = split;

my $i=0;
my $j=0;
while ($i<$#seq){
    while ($j < $#list){
        if ($seq[$i] eq $list[$j]){
            print "$seq[$i] is found\n";
        }
        $j++;
    }
    $i +=2;
}
close SEQ;
close LIST;

Could anyone working on perl make some corrections, suggestions and comments. I appreciate your help, which I believe drive me forward in the bioinformatics field. THANKS A LOT!

perl • 6.7k views
ADD COMMENT
1
Entering edit mode

if you are working on a unix machine and your 'list' file isn't too long you should probably go with:

grep -f list.txt sequences.fa

if your list-file is really large have a look at Kent tools and there the progamme faSomeRecords.

Cheers

ADD REPLY
0
Entering edit mode

THANKS for handy method!

ADD REPLY
0
Entering edit mode

no worries... ;)

ADD REPLY
0
Entering edit mode

I think you should mentioned what do you want your script to do.

ADD REPLY
0
Entering edit mode

Sorry for unclear description. I aim to extract sequences from SEQ file according to LIST file. In my script the line print "$seq[$i] is found\n" should be print "$seq[$i] ($seq[$i+1]) is found\n". Anyway, my script is wrong somewhere, as when running perl test.pl seq.fa list.txt, I didn't get any responding. THANKS for your help!

ADD REPLY
1
Entering edit mode

First ask yourself, what does this do :

my @list= <LIST>; 
my @list = split;

What do you hope to achieve ? Tip, use the line "use Data::Dumper;" and print a variable using print Dumper($variable); to start debugging. I will not give you the answer since I feel this is a learning opportunity for you.

ADD REPLY
3
Entering edit mode
11.0 years ago
Kenosis ★ 1.3k

In a comment, you said, "I aim to extract sequences from SEQ file according to LIST file." Given your list and sequence IDs, you'll never find a match, since the list doesn't contain any spaces, but the sequence IDs do.

Here are just a few items for you to consider:

Always:

use strict;
use warnings;

When opening files, use the three-argument form of open, e.g.,

open my $seq, '<', $seq or die "Unable to open $seq: $!";

Note that this form of open uses lexical varibles (e.g., my $seq) for file handles and not bare words (e.g., SEQ and LIST). This form also traps errors.

It's not necessary to read both files into an array; no arrays are necessary here. In fact, in this case, a hash whose keys are the list items would work better:

  1. Create a hash where the keys are the list items. Use chomp on each line read to remove the trailing EOL character.
  2. Read through the seq file, line-by-line, and solate the seq ID. If that seq ID exists within the list hash, you've got a match, so print the seq.

split is not needed for your task.

Hope this helps!

ADD COMMENT
2
Entering edit mode
11.0 years ago

In case it's helpful, here's how I would solve this. Feel free to ask questions about anything that doesn't make sense :)

use v5.10.0;
use strict;
use warnings;

# checks that you have input files
unless ($ARGV[0] and $ARGV[1]) {
    die "Missing input file\nUsage: $0 list.txt seq.fa\n";
}

# opens the list.txt file and stores the names in a hash
my %interestingnames;
open FILE, $ARGV[0];
while (my $line = <FILE>) {
    chomp $line;
    $interestingnames{$line}++;
}
close FILE;

# opens the fasta file
# goes through line by line checking for a match
my $name;
my $seq;
open FILE1, $ARGV[1];
while (my $line = <FILE1>) {
    chomp $line;

    #checks whether it's a header line
    if ($line =~ />(.+)/) {
        # stores the regular expression match as a name
        $name = $1; 
    } else {
        $seq = $line;

        #checks whether name exists in list file, by checking hash keys
        if(exists($interestingnames{$name})) {
            print ">$name\n";
            print "$seq\n";
        }
    }
}
close FILE1;
ADD COMMENT
1
Entering edit mode

Hi, Jelena Aleksic, THANKS a lot for your script. It's much better than my script. THANKS again.

ADD REPLY
0
Entering edit mode
11.0 years ago

One advantage to using Perl (or Python, or R) is having pre-defined packages for bioinformatics analysis: You should definitely become familiar with BioPerl:

http://www.bioperl.org/wiki/Main_Page

http://www.bioperl.org/wiki/Getting_Started

For example, there is a module that is specifically designed for parsing FASTA files:

http://www.bioperl.org/wiki/HOWTO:Beginners#Retrieving_a_sequence_from_a_file

You just need to selectively print the values from the keys matching your list to a new file.

ADD COMMENT
1
Entering edit mode

Excellent suggestion to use bioperl! But the OP first needs to get a handle (no pun intended) on some Perl basics as a foundation for using such modules. This is scope and sequence in pedagogical terms, and the OP did say, "I am a true starter of perl..." Additionally, bioperl will not work on the OP's list file, but only on the fasta-type file. Even then, though, no match will be found between the two files, given the OP's datasets.

ADD REPLY
0
Entering edit mode
11.0 years ago
biolab ★ 1.4k

Dear all, Thank you for your answers. I still faced trouble with the script to extract seqeunces based on a list. An example is shown below.

FASTA FILE:
>gene1
cgcggccccccccccccccccccc
>gene2
ggggggggggggggggggcgcgcg
>gene3
tagctacgatagcgtcagatgcaacg
>gene4
attttttttttttttttttttttttttttaaaaaaa

LIST FILE:

>gene4
>gene3

MY SCRIPT: #!/usr/bin/perl use strict; use warnings;

open SEQ, "seq.txt";
my @seq=<SEQ>;
open LIST, "list.txt";
my @list=<LIST>;

my $i=0;
while ($i<$#seq){
    my $j=0;
    while ($j<$#list){
        if ($seq[$i] eq $list[$j]){
            print "$seq[$i]\t$seq[$i+1]";
        }
        $j++;
    }
    $i +=2;
}
close SEQ;
close LIST;

THE OUTPUT RESULT:

>gene4
attttttttttttttttttttttttttttaaaaaaa

Where is the error in my script? I have two names in the LIST file, but only got one sequence. THANKS A LOT!!

ADD COMMENT
0
Entering edit mode

Formatting is really important for writing scripts that you and other people can read and debug. While Perl lets you format your scripts however you like, following best practices is still recommended. These are:

  • One command per line
  • Use whitespace for clarity, to separate different sections of your code.
  • Always, always indent loops. Always.
  • Your aim should not be to write a script as short as possible - it should be to write a script that's as humanly readable as possible. This includes giving variables informative names, and leaving comments about what different sections of your code are doing.

This sort of one line script is not particularly human-readable, therefore finding errors in it is basically impossible. If you post a friendly, formatted version, I'd be happy to have a look :)

ADD REPLY
0
Entering edit mode

Though this could just be a formatting issue on the forum rather than in your script :)

ADD REPLY
1
Entering edit mode

I've fixed that for biolab, though he/she should really not post that as an answer.

ADD REPLY
0
Entering edit mode

THANKS for advices. Actually when coping my original formatted script here, its format changed. Anyway I will try to sort out, as I am a new member in Biostars.

ADD REPLY
0
Entering edit mode

No worries, the formatting takes some time to get used to!

ADD REPLY
0
Entering edit mode
11.0 years ago
emazep ▴ 30

Assuming that your files are always as shown (no extraneous lines from your examples), a more concise and faster (and probably more readable too) script is possible, which is maybe also interesting as learning material since it uses a couple of useful Perl features:

#!/usr/bin/perl
use strict;
use warnings;

my $listfile = shift;
my %wanted_seq;

{
    local @ARGV = $listfile;
    while ( my $name = <> ) {
        chomp $name;
        $wanted_seq{$name} = 1 unless exists $wanted_seq{$name}
    }
}

while ( my $name = <> ) {
    chomp $name; my $seq = <>;
    print $name, "\n", $seq if $wanted_seq{$name}
}

Usage:

perl script.pl listfile seqfile > outfile

A couple of general advices, besides the ones already offered by others:

  • unless you need sophisticate error checking, don't explicitly open files in your script, pass instead the filenames on the command line and let Perl open them for you: this way your script can be (more easily) combined with other commands;
  • avoid slurping files into arrays: it's less efficient both in time and in space - really in this particular case it's better to keep listfile in memory (inside %wanted_seq in my implementation), but seqfile can be examined one line (well, two lines) at a time, with potentially considerable savings;
  • when iterating over arrays, foreach is nearly always preferable over while (or C-style for loops).

Good work!

ADD COMMENT
0
Entering edit mode

Helpful advices for beginners! THANKS!

ADD REPLY
0
Entering edit mode

You're welcome! Once you are comfortable with the previous solution, consider that, since we've decided to keep listfile in memory, it can be further reduced to:

#!/usr/bin/perl
use strict;
use warnings;

my $listfile = shift;
my %wanted_seq;

{ local @ARGV = $listfile; @wanted_seq{ map { chomp; $_ } <> } = () }

while ( my $name = <> ) {
    chomp $name; my $seq = <>;
    print $name, "\n", $seq if exists $wanted_seq{$name}
}

which takes advantage of a couple of further tricks ;-) (Don't hesitate to ask for any clarification, should you need it).

To learn Perl, this is one of the most useful on-line resources: http://perl-tutorial.org/

Welcome to Perl!

ADD REPLY

Login before adding your answer.

Traffic: 990 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6