!/usr/bin/perl -w

Question

Perl How To Isolate Fasta Sequences With A Specific Keyword

0

Entering edit mode

14.1 years ago

Raghul ▴ 200

HI to all, I have a file with lots of sequences but I want to extract sequences only with the keyword "FULL-LENGTH". I dont want sequences with keywords NON-FULL-LENGTH.I have a text file that has 8,000 sequences distributed in equal amount with these 2 keywords Can anybody suggest a perl program for this problem?

>isotig07104 FULL-LENGTH (BLAST)
GGTGAGTACTAAATTATaCGAAAGATTGAaGTCCAGTTATAGCTCTGCCT
ATAaTTAAAGCATGAATATCGTGAGTTCCTTCGTATGTGTTTACAGTTTC

>isotig07106 NON-FULL-LENGTH (BLAST)
TTAGCATATTCTAtCTTTTTtAGAcTAAGGAAaGATGGAAgTGtAaTtAA
aGAATTTGAaCCAAAAATTCATAGAtCTGTtATTAAGTCATGTGCTAAaT

Thank u raghul

"Hi to neilfws & all others Thanks for the response. I made-up the code, please correct it. Can you people suggest me tutorial links or textbooks for Bioperl.

!/usr/bin/perl -w
use strict;

use Bio::SeqIO;

my $seqin = Bio::SeqIO->new(-file => "euplotes.txt", -format => "fasta");

my $seqout = Bio::SeqIO->new(-file => ">outfile.txt", -format => "fasta");

while(my $seq = $seqin->next_seq)
  { 
  if($seq->desc) =~/^FULL-LENGTH\s+/ {
    $seqout->write_seq($seq);
  }
}

".

perl fasta sequence retrieval • 4.1k views

ADD COMMENT • link updated 14.1 years ago by Echo ▴ 70 • written 14.1 years ago by Raghul ▴ 200

0

Entering edit mode

Hi to neilfws & all others Thanks for the response. I made-up the code, please correct it. Can you people suggest me tutorial links or textbooks on Bioperl.

!/usr/bin/perl -w

use strict; use Bio::SeqIO;

my $seqin = Bio::SeqIO->new(-file => "euplotes.txt", -format => "fasta"); my $seqout = Bio::SeqIO->new(-file => ">outfile.txt", -format => "fasta"); while(my $seq = $seqin->next_seq) { if($seq->desc) =~/^FULL-LENGTHs+/ { $seqout->write_seq($seq); } }

ADD REPLY • link 14.1 years ago by Raghul ▴ 200

0

Entering edit mode

Your code looks fine. As for Bioperl, everything you need is right there on the website - http://www.bioperl.org/wiki/Main_Page. Link to tutorials - http://www.bioperl.org/wiki/Tutorials.

ADD REPLY • link 14.1 years ago by Neilfws 49k

score 2 · Answer 1 · 2011-04-03

My answer with awk, not perl. If the lines starting with '>' contains "NON-FULL-LENGTH" then don't print the remaining lines.

cat biostar7114.fasta | awk '/^>/   {
    ok=(index($0,"NON-FULL-LENGTH")==0);
    if(ok) print $0;
    next;
    }
    {
    if(ok) print $0;
    }'
>isotig07104 FULL-LENGTH (BLAST)
GGTGAGTACTAAATTATaCGAAAGATTGAaGTCCAGTTATAGCTCTGCCT
ATAaTTAAAGCATGAATATCGTGAGTTCCTTCGTATGTGTTTACAGTTTC

Ram · Answer 2 · 2011-04-03

1

Entering edit mode

14.1 years ago

Neilfws 49k

You can use my answer to your previous question as a starting point.

In this case, the Bioperl method to use is $seq->desc. In your sample sequence fasta header:

>isotig07104 FULL-LENGTH (BLAST)

The description is everything after the first space. Since you are looking for descriptions that begin with the words "FULL-LENGTH", then something like:

if($seq->desc) =~/^FULL-LENGTH\s+/ {
  # write sequence to file as per my previous answer
}

should work.

I suggest investing some effort in learning at least a few of the Bioperl methods (and regular expressions); it will make solving these "variations on a theme" problems very easy indeed.

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 14.1 years ago by Neilfws 49k

0

Entering edit mode

I made as u have told & I am giving the code below, please correct it

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;
my $seqin  = Bio::SeqIO->new(-file => "inputfile.txt",      -format => "fasta");
my $seqout = Bio::SeqIO->new(-file =&gt; ">full_length.txt", -format => "fasta");
while(my $seq = $seqin->next_seq) {
if($seq->desc) =~/^FULL-LENGTHs+/ {
    $seqout->write_seq($seq);
  }
}

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 14.1 years ago by Raghul ▴ 200

0

Entering edit mode

Hi I made up the code, please correct it.

Can u please give me some tutorial links for bioperl(for beginners) or suggest textbook

thank you

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;
my $seqin  = Bio::SeqIO->new(-file => "euplotes.txt",      -format => "fasta");
my $seqout = Bio::SeqIO->new(-file => ">outfile.txt", -format => "fasta");
while(my $seq = $seqin->next_seq) 
 {
  if($seq->desc) =~/^FULL-LENGTHs+/ {
    $seqout->write_seq($seq);
  }
}

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 14.1 years ago by Raghul ▴ 200

0

Entering edit mode

Hi to neilfws & all others

Thanks for the response.

I made-up the code, please correct it.

Can you people suggest me tutorial links or textbooks for Bioperl.

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;
my $seqin  = Bio::SeqIO->new(-file => "euplotes.txt",      -format => "fasta");
my $seqout = Bio::SeqIO->new(-file => ">outfile.txt", -format => "fasta");
while(my $seq = $seqin->next_seq)
{
    if($seq->desc) =~/^FULL-LENGTHs+/ {
        $seqout->write_seq($seq);
    }
}

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 14.1 years ago by Raghul ▴ 200

score 0 · Answer 3 · 2011-04-03

0

Entering edit mode

14.1 years ago

Echo ▴ 70

obviously, the bioperl is highly recommended, but the script below also works.

#read a paragrah every time. your may want to look up the $/ in perldoc perlvar

    $/="";
    while(<>){
        unless (/NON-FULL-LENGTH/){
        print;
        }
    }