Fasta Format Sequence Parsing With Bioperl
2
1
Entering edit mode
13.4 years ago
Raghul ▴ 200

Hi I am having a fasta format sequence but the header line is not like NCBI description(please see below).

>contig07303  length=480  numreads=24  gene=isogroup00008  status=isotig
CACTATCCTcTTCATTTAGGTTTTTGATAtCTTGAATTACTTTCTtCATTTTTCTAGTAG...................

I used a program to extract sequences with length >500 for "normal" fasta sequences (from NCBI) using Bioperl Bio::SeqIO & it worked. The same program when used modifying “keyword" to retrieve sequences with numreads >500, it is not working. Is it possible to do with Bio::SeqIO module with the sequences having the above FASTA descriptions or Do I have to create different code without Bio::SeqIO module.

Thank you very much for replying raghul

Previous code given by neilfws

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;

my $seqin  = Bio::SeqIO->new(-file => "myfile.fa",      -format => "fasta");
my $seqout = Bio::SeqIO->new(-file => ">myfile_100.fa", -format => "fasta");
while(my $seq = $seqin->next_seq) {
  if($seq->length <= 100) {
    $seqout->write_seq($seq);
  }
}
fasta parsing bioperl • 11k views
ADD COMMENT
0
Entering edit mode

Assuming that the lack of ">" in your example fasta file is just a typo can you clarify this somewhat? Your pasted code example doesn't match what you are trying to do when you say modifying "keyword" to retrieve sequences with numthreads > 500. What exactly are you trying to do and what is the code? Question just isn't very clear.

ADD REPLY
0
Entering edit mode

Yes, I misread this question too; it's not clear how the header looks because the sequence may not be formatted properly as shown here (that is, does it start with ">") ?

ADD REPLY
0
Entering edit mode

Yes, the sequence starts with > The keyword here I mean the description in FASTA header line (sorry for the confusion).The above program detects sequences less than (<)100 base in length. When I changed the symbol to > to identify sequences with greater than 100 nt in length, I got wrong answer. The output also included sequences with length less than 100 in length. Thank you for replying raghul

ADD REPLY
1
Entering edit mode
13.4 years ago
Neilfws 49k

EDIT: This is a new answer which assumes that your sequence header line begins with ">"

If I understand your question correctly, you are trying to extract the part of the header line which reads numreads=NN and determine whether NN > 500.

The answer is yes, this can be done using Bio::SeqIO. The portion of the header line directly following the ">" is the sequence ID. Anything after the first space is the sequence description. Bio::SeqIO has a method to retrieve the description, like so:

my $desc = $seq->desc;

So you need to get $desc, then use a regular expression for numreads:

if($seq->desc =~/\s+numreads=(\d+)\s+/) {
  my $numreads = $1;
  $seqout->write_seq($seq) if $numreads > 500;
}

A thorough read of the Bio::SeqIO documentation will help you answer many of these "variations on the same theme" problems.

ADD COMMENT
0
Entering edit mode

The input is a fasta file with thousands of sequences. The numreads values for sequences range from 1 to 1000. I want to extract sequences with "numreads" values greater than 500.

I am reading this BioSeqIO documentation. Is there a textbook for bioperl just like James Tisdall book for Perl Thank you very much for the reply raghul

ADD REPLY
0
Entering edit mode

There's a very old Bioperl book and newer Bioinformatics/Perl books with sections on Bioperl - go and search Amazon for "Bioperl". However, the Bioperl website documentation is better and more up to date.

ADD REPLY
0
Entering edit mode

I am getting the following error, can you say what it means.

C:Perl>numreads.pl Replacement list is longer than search list at C:/Perl/site/lib/Bio/Range.pm line 251.

Thank you for patience I am giving the altered code below

!/usr/bin/perl -w

use strict; use Bio::SeqIO;

my $desc = $seq->desc;

my $seqin=Bio::SeqIO->new(-file =>"num.txt", -format => "fasta"); my $seqout=Bio::SeqIO->new(-file =>">readoutput.txt", -format => "fasta"); while(my $seq=$seqin->next_seq) { if($seq->desc =~/s+numreads=(d+)s+/) { my $numreads=$1; $seqout->write_seq($seq) if $numreads > 500; } }

ADD REPLY
0
Entering edit mode

I am getting the following error, can you say what it means. C:Perl>numreads.pl Replacement list is longer than search list at C:/Perl/site/lib/Bio/Range.pm line 251.

Thank you for the patience I am giving the altered code below

!/usr/bin/perl -w

use strict; use Bio::SeqIO;

my $desc = $seq->desc;

my $seqin=Bio::SeqIO->new(-file =>"num.txt", -format => "fasta"); my $seqout=Bio::SeqIO->new(-file =>">readoutput.txt", -format => "fasta"); while(my $seq=$seqin->next_seq) { if($seq->desc =~/s+numreads=(d+)s+/) { my $numreads=$1; $seqout->write_seq($seq) if $numreads > 500; } }

ADD REPLY
0
Entering edit mode

I am getting the following error, can you say what it means. C:Perl>numreads.pl Replacement list is longer than search list at C:/Perl/site/lib/Bio/Range.pm line 251.

Thank you for the patience I am giving the altered code below

!/usr/bin/perl -w

use strict; use Bio::SeqIO;

my $desc = $seq->desc;

my $seqin=Bio::SeqIO->new(-file =>"num.txt", -format => "fasta"); my $seqout=Bio::SeqIO->new(-file =>">readoutput.txt", -format => "fasta"); while(my $seq=$seqin->next_seq) { if($seq->desc =~/s+numreads=(d+)s+/) { my $numreads=$1; $seqout->write_seq($seq) if $numreads > 500; }

ADD REPLY
0
Entering edit mode

Well, my Google search for that message (always a good first step) suggests that the warning can be ignored and is fixed in the latest bioperl-live from github. Did the script generate the expected output file? Also, pasting code in comments doesn't work too well; try something like pastebin - http://pastebin.com/.

ADD REPLY
0
Entering edit mode

Yes I got the output.The output available appears correct. All the output sequence is as asked,ie.>500 nucleotide in length. But I cannot understand this. thank you raghul

ADD REPLY
1
Entering edit mode
13.4 years ago
Dvs ▴ 20

It's easier to use inplace editing sed mode.

Furthermore, '1s' command would add '>' at the beginning of the first line only (if you have only one sequence):

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;

`sed -i '1s/^/>/' myfile.fa`;
my $seqin  = Bio::SeqIO->new(-file => "myfile.fa",      -format => "fasta");
my $seqout = Bio::SeqIO->new(-file => ">myfile_100.fa", -format => "fasta");
while(my $seq = $seqin->next_seq) {
  if($seq->length <= 100) {
    $seqout->write_seq($seq);
  }
}
ADD COMMENT

Login before adding your answer.

Traffic: 1608 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6