Perl How To Isolate Fasta Sequences With A Specific Keyword
3
0
Entering edit mode
13.6 years ago
Raghul ▴ 200

HI to all, I have a file with lots of sequences but I want to extract sequences only with the keyword "FULL-LENGTH". I dont want sequences with keywords NON-FULL-LENGTH.I have a text file that has 8,000 sequences distributed in equal amount with these 2 keywords Can anybody suggest a perl program for this problem?

>isotig07104 FULL-LENGTH (BLAST)
GGTGAGTACTAAATTATaCGAAAGATTGAaGTCCAGTTATAGCTCTGCCT
ATAaTTAAAGCATGAATATCGTGAGTTCCTTCGTATGTGTTTACAGTTTC

>isotig07106 NON-FULL-LENGTH (BLAST)
TTAGCATATTCTAtCTTTTTtAGAcTAAGGAAaGATGGAAgTGtAaTtAA
aGAATTTGAaCCAAAAATTCATAGAtCTGTtATTAAGTCATGTGCTAAaT

Thank u raghul

"Hi to neilfws & all others Thanks for the response. I made-up the code, please correct it. Can you people suggest me tutorial links or textbooks for Bioperl.

!/usr/bin/perl -w
use strict;

use Bio::SeqIO;

my $seqin = Bio::SeqIO->new(-file => "euplotes.txt", -format => "fasta");

my $seqout = Bio::SeqIO->new(-file => ">outfile.txt", -format => "fasta");

while(my $seq = $seqin->next_seq)
  { 
  if($seq->desc) =~/^FULL-LENGTH\s+/ {
    $seqout->write_seq($seq);
  }
}

".

perl fasta sequence retrieval • 3.8k views
ADD COMMENT
0
Entering edit mode

Hi to neilfws & all others Thanks for the response. I made-up the code, please correct it. Can you people suggest me tutorial links or textbooks on Bioperl.

!/usr/bin/perl -w

use strict; use Bio::SeqIO;

my $seqin = Bio::SeqIO->new(-file => "euplotes.txt", -format => "fasta"); my $seqout = Bio::SeqIO->new(-file => ">outfile.txt", -format => "fasta"); while(my $seq = $seqin->next_seq) { if($seq->desc) =~/^FULL-LENGTHs+/ { $seqout->write_seq($seq); } }

ADD REPLY
0
Entering edit mode

Your code looks fine. As for Bioperl, everything you need is right there on the website - http://www.bioperl.org/wiki/Main_Page. Link to tutorials - http://www.bioperl.org/wiki/Tutorials.

ADD REPLY
2
Entering edit mode
13.6 years ago

My answer with awk, not perl. If the lines starting with '>' contains "NON-FULL-LENGTH" then don't print the remaining lines.

cat biostar7114.fasta | awk '/^>/   {
    ok=(index($0,"NON-FULL-LENGTH")==0);
    if(ok) print $0;
    next;
    }
    {
    if(ok) print $0;
    }'
>isotig07104 FULL-LENGTH (BLAST)
GGTGAGTACTAAATTATaCGAAAGATTGAaGTCCAGTTATAGCTCTGCCT
ATAaTTAAAGCATGAATATCGTGAGTTCCTTCGTATGTGTTTACAGTTTC
ADD COMMENT
1
Entering edit mode
13.6 years ago
Neilfws 49k

You can use my answer to your previous question as a starting point.

In this case, the Bioperl method to use is $seq->desc. In your sample sequence fasta header:

>isotig07104 FULL-LENGTH (BLAST)

The description is everything after the first space. Since you are looking for descriptions that begin with the words "FULL-LENGTH", then something like:

if($seq->desc) =~/^FULL-LENGTH\s+/ {
  # write sequence to file as per my previous answer
}

should work.

I suggest investing some effort in learning at least a few of the Bioperl methods (and regular expressions); it will make solving these "variations on a theme" problems very easy indeed.

ADD COMMENT
0
Entering edit mode

I made as u have told & I am giving the code below, please correct it

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;
my $seqin  = Bio::SeqIO->new(-file => "inputfile.txt",      -format => "fasta");
my $seqout = Bio::SeqIO->new(-file => ">full_length.txt", -format => "fasta");
while(my $seq = $seqin->next_seq) {
if($seq->desc) =~/^FULL-LENGTHs+/ {
    $seqout->write_seq($seq);
  }
}
ADD REPLY
0
Entering edit mode

Hi I made up the code, please correct it.

Can u please give me some tutorial links for bioperl(for beginners) or suggest textbook

thank you

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;
my $seqin  = Bio::SeqIO->new(-file => "euplotes.txt",      -format => "fasta");
my $seqout = Bio::SeqIO->new(-file => ">outfile.txt", -format => "fasta");
while(my $seq = $seqin->next_seq) 
 {
  if($seq->desc) =~/^FULL-LENGTHs+/ {
    $seqout->write_seq($seq);
  }
}
ADD REPLY
0
Entering edit mode

Hi to neilfws & all others

Thanks for the response.

I made-up the code, please correct it.

Can you people suggest me tutorial links or textbooks for Bioperl.

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;
my $seqin  = Bio::SeqIO->new(-file => "euplotes.txt",      -format => "fasta");
my $seqout = Bio::SeqIO->new(-file => ">outfile.txt", -format => "fasta");
while(my $seq = $seqin->next_seq)
{
    if($seq->desc) =~/^FULL-LENGTHs+/ {
        $seqout->write_seq($seq);
    }
}
ADD REPLY
0
Entering edit mode
13.6 years ago
Echo ▴ 70

obviously, the bioperl is highly recommended, but the script below also works.

#read a paragrah every time. your may want to look up the $/ in perldoc perlvar

    $/="";
    while(<>){
        unless (/NON-FULL-LENGTH/){
        print;
        }
    }
ADD COMMENT
0
Entering edit mode

AFAIK, it won't work. Test your script with a set of sequences in FASTA format.

ADD REPLY

Login before adding your answer.

Traffic: 1779 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6