Question

Parsing A Genbank Record With Bioperl

0

Entering edit mode

11.1 years ago

robjohn7000 ▴ 110

Hi,

I would be happy if someone could help me get the following code to output the sequences (including the ids) of all records in the input file (http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk):

use Bio::SeqIO;
use Bio::Seq;
use Bio::DB::GenBank;

$seq_obj = Bio::SeqIO->new(-file => "ls_orchid.gbk" , '-format' => 'genbank');
$seqio_obj2 = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'genbank' );

while ( my $seq = $seq_obj->next_seq() ) {
print "Sequence ",$seq->id ($seq_obj)"\n";
#print $seq_obj->seq,"\n";
$seqio_obj2->write_seq($seq_obj);
}

I would like to see an output consisting of purely sequences (preferably with headers) as follows:

seq1:
ATTCGCTGCATGACAT..........
Seq2:
ACTGCGATGGATGGAT..

Thank you

bioperl perl biopython python parsing • 5.0k views

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 11.1 years ago by robjohn7000 ▴ 110

Ram · Answer 1 · 2013-11-08

2

Entering edit mode

11.1 years ago

Kenosis ★ 1.3k

You were very close:

use strict;
use warnings;
use Bio::SeqIO;

my $seq_obj = Bio::SeqIO->new( -file => "ls_orchid.gbk", '-format' => 'genbank' );

while ( my $seq = $seq_obj->next_seq() ) {
    print 'ID/Desc: ', $seq->id, ' ', $seq->desc, "\n";
    print 'Seq: ', $seq->seq, "\n";
}

Sample output using the data you've linked to:

ID/Desc: Z78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA.
Seq: CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAGAATATATGATCGAGTGAATCTGGAGGACCTGTGGTAACTCAGCTCGTCGTGGCACTGCTTTTGTCGTGACCCTGCTTTGTTGTTGGGCCTCCTCAAGAGCTTTCATGGCAGGTTTGAACTTTAGTACGGTGCAGTTTGCGCCAAGTCATATAAAGCATCACTGATGAATGACATTATTGTCAGAAAAAATCAGAGGGGCAGTATGCTACTGAGCATGCCAGTGAATTTTTATGACTCTCGCAACGGATATCTTGGCTCTAACATCGATGAAGAACGCAGCTAAATGCGATAAGTGGTGTGAATTGCAGAATCCCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCTCGAGGCCATCAGGCTAAGGGCACGCCTGCCTGGGCGTCGTGTGTTGCGTCTCTCCTACCAATGCTTGCTTGGCATATCGCTAAGCTGGCATTATACGGATGTGAATGATTGGCCCCTTGTGCCTAGGTGCGGTGGGTCTAAGGATTGTTGCTTTGATGGGTAGGAATGTGGCACGAGGTGGAGAATGCTAACAGTCATAAGGCTGCTATTTGAATCCCCCATGTTGTTGTATTTTTTCGAACCTACACAAGAACCTAATTGAACCCCAATGGAGCTAAAATAACCATTGGGCAGTTGATTTCCATTCAGATGCGACCCCAGGTCAGGCGGGGCCACCCGCTGAGTTGAGGC

Hope this helps!

ADD COMMENT • link 11.1 years ago by Kenosis ★ 1.3k

0

Entering edit mode

That worked brilliantly! But I ran into a problem when I used the code to a multi-record file that I concatenated by myself. The output contained only the ID/Desc and Seq: lines. I'm not sure the best way to upload the file, but the error message pointed to every line with '\\' in the file (e.g., line 853 below): ID/Desc: NZ_ACVQ01000001 Campylobacter showae RM3277 contig00122, whole genome shotgun sequence. Use of uninitialized value in print at script.pl line 10, <gen0> line 853. Seq:

ADD REPLY • link 11.1 years ago by robjohn7000 ▴ 110

0

Entering edit mode

Am glad this worked for you. Am not sure I understand you correctly, but did you attempt to parse a file (that you created) with the above script and it threw an error?

ADD REPLY • link 11.1 years ago by Kenosis ★ 1.3k

0

Entering edit mode

Yes, It threw an error with my new file and poited to the EOL (\\). I simply used cat function in linux to concatenate the genbanlk files. Thanks.

Here's the link to my file (org.rar): http://wikisend.com/download/642792/org.rar

ADD REPLY • link 11.1 years ago by robjohn7000 ▴ 110

0

Entering edit mode

Hi Kenosis, this is the first script in perl that I have tried that is successful in extracting info from genbank files. I am having a hard time understanding the tutorials from bioperl seqio. Do you know how to extract the references (really I just need a title) in a multi-file genbank file? Thank you!

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.4 years ago by kbrann3 • 0