Parsing A Genbank Record With Bioperl
1
0
Entering edit mode
11.1 years ago
robjohn7000 ▴ 110

Hi,

I would be happy if someone could help me get the following code to output the sequences (including the ids) of all records in the input file (http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk):

use Bio::SeqIO;
use Bio::Seq;
use Bio::DB::GenBank;

$seq_obj = Bio::SeqIO->new(-file => "ls_orchid.gbk" , '-format' => 'genbank');
$seqio_obj2 = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'genbank' );

while ( my $seq = $seq_obj->next_seq() ) {
print "Sequence ",$seq->id ($seq_obj)"\n";
#print $seq_obj->seq,"\n";
$seqio_obj2->write_seq($seq_obj);
}

I would like to see an output consisting of purely sequences (preferably with headers) as follows:

seq1:
ATTCGCTGCATGACAT..........
Seq2:
ACTGCGATGGATGGAT..

Thank you

bioperl perl biopython python parsing • 5.0k views
ADD COMMENT
2
Entering edit mode
11.1 years ago
Kenosis ★ 1.3k

You were very close:

use strict;
use warnings;
use Bio::SeqIO;

my $seq_obj = Bio::SeqIO->new( -file => "ls_orchid.gbk", '-format' => 'genbank' );

while ( my $seq = $seq_obj->next_seq() ) {
    print 'ID/Desc: ', $seq->id, ' ', $seq->desc, "\n";
    print 'Seq: ', $seq->seq, "\n";
}

Sample output using the data you've linked to:

ID/Desc: Z78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA.
Seq: CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAGAATATATGATCGAGTGAATCTGGAGGACCTGTGGTAACTCAGCTCGTCGTGGCACTGCTTTTGTCGTGACCCTGCTTTGTTGTTGGGCCTCCTCAAGAGCTTTCATGGCAGGTTTGAACTTTAGTACGGTGCAGTTTGCGCCAAGTCATATAAAGCATCACTGATGAATGACATTATTGTCAGAAAAAATCAGAGGGGCAGTATGCTACTGAGCATGCCAGTGAATTTTTATGACTCTCGCAACGGATATCTTGGCTCTAACATCGATGAAGAACGCAGCTAAATGCGATAAGTGGTGTGAATTGCAGAATCCCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCTCGAGGCCATCAGGCTAAGGGCACGCCTGCCTGGGCGTCGTGTGTTGCGTCTCTCCTACCAATGCTTGCTTGGCATATCGCTAAGCTGGCATTATACGGATGTGAATGATTGGCCCCTTGTGCCTAGGTGCGGTGGGTCTAAGGATTGTTGCTTTGATGGGTAGGAATGTGGCACGAGGTGGAGAATGCTAACAGTCATAAGGCTGCTATTTGAATCCCCCATGTTGTTGTATTTTTTCGAACCTACACAAGAACCTAATTGAACCCCAATGGAGCTAAAATAACCATTGGGCAGTTGATTTCCATTCAGATGCGACCCCAGGTCAGGCGGGGCCACCCGCTGAGTTGAGGC

Hope this helps!

ADD COMMENT
0
Entering edit mode

That worked brilliantly! But I ran into a problem when I used the code to a multi-record file that I concatenated by myself. The output contained only the ID/Desc and Seq: lines. I'm not sure the best way to upload the file, but the error message pointed to every line with '\\' in the file (e.g., line 853 below): ID/Desc: NZ_ACVQ01000001 Campylobacter showae RM3277 contig00122, whole genome shotgun sequence. Use of uninitialized value in print at script.pl line 10, <gen0> line 853. Seq:

ADD REPLY
0
Entering edit mode

Am glad this worked for you. Am not sure I understand you correctly, but did you attempt to parse a file (that you created) with the above script and it threw an error?

ADD REPLY
0
Entering edit mode

Yes, It threw an error with my new file and poited to the EOL (\\). I simply used cat function in linux to concatenate the genbanlk files. Thanks.

Here's the link to my file (org.rar): http://wikisend.com/download/642792/org.rar

ADD REPLY
0
Entering edit mode

Hi Kenosis, this is the first script in perl that I have tried that is successful in extracting info from genbank files. I am having a hard time understanding the tutorials from bioperl seqio. Do you know how to extract the references (really I just need a title) in a multi-file genbank file? Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 1460 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6