Question

Parsing Genbank Format In Bioperl.

0

Entering edit mode

12.6 years ago

Daniel ★ 4.0k

I am attempting (with bioperl) to extract the JOURNAL field from a set of Genbank records, but I cant find a list of the references that are used ie

while (my $seq = $in->next_seq() ) {
        print $seq->accession . "\n";

prints accession number

while (my $seq = $in->next_seq() ) {
        print $seq->desc; . "\n";

prints the description

while (my $seq = $in->next_seq() ) {
        print $seq->seq. "\n";

Prints the gene sequence etc, etc, etc.

This has just been gleaned from the bioperl site and other questions as I cant find a reference for the whole scheme. Can anyone point me in the right direction? The http://www.bioperl.org/wiki/Module:Bio::SeqIO::genbank is a dead end unfortunately.

Thanks

For reference:

LOCUS       JQ354682                1420 bp    DNA     linear   PLN 01-JAN-2013
DEFINITION  Gomphonema clevei strain TCC507 ribulose-1,5-bisphosphate
            carboxylase/oxygenase large subunit (rbcL) gene, partial cds;
            chloroplast.
ACCESSION   JQ354682
VERSION     JQ354682.1  GI:410947001
KEYWORDS    .
SOURCE      chloroplast Gomphonema clevei
  ORGANISM  Gomphonema clevei
            Eukaryota; Stramenopiles; Bacillariophyta; Bacillariophyceae;
            Bacillariophycidae; Cymbellales; Gomphonemataceae; Gomphonema.
REFERENCE   1  (bases 1 to 1420)
  AUTHORS   Kermarrec,L., Bouchez,A., Rimet,F. and Humbert,J.-F.
  TITLE     Using a polyphasic approach to explore the diversity and
            geographical distribution of the Gomphonema parvulum (Kutzing)
            Kutzing complex (Bacillariophyta)
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 1420)
  AUTHORS   Kermarrec,L., Bouchez,A., Rimet,F. and Humbert,J.-F.
  TITLE     Direct Submission
  JOURNAL   Submitted (05-JAN-2012) Asconit Consultants, 3 bld de Clairfont
            Bat. G, Toulouges F-66350, France
FEATURES             Location/Qualifiers
     source          1..1420
                     /organism="Gomphonema clevei"
                     /organelle="plastid:chloroplast"
                     /mol_type="genomic DNA"
                     /strain="TCC507"
                     /isolation_source="river"
                     /db_xref="taxon:1223578"
                     /country="Mayotte"
                     /collection_date="20-Apr-2009"
     gene            <1..>1420
                     /gene="rbcL"
     CDS             <1..>1420
                     /gene="rbcL"
                     /codon_start=1
                     /transl_table=11
                     /product="ribulose-1,5-bisphosphate carboxylase/oxygenase
                     large subunit"
                     /protein_id="AFV95053.1"
                     /db_xref="GI:410947002"
                     /translation="DRYESGVIPYAKMGYWDASYAVKTTDVLALFRITPQPGVDPVEA
                     AAAVAGESSTATWTVVWTDLLTACDRYRAKAYRVDPVPNTTDQFFAFIAYECDLFEEG

bioperl genbank • 8.7k views

ADD COMMENT • link updated 12.6 years ago by Ryan Dale 5.0k • written 12.6 years ago by Daniel ★ 4.0k

0

Entering edit mode

Hi Daniel, did you find something here that works? I am looking to extract the references from genbank files as well. I have read all the links here and am not having success with creating a perl script that works. Thanks!

ADD REPLY • link 10.1 years ago by kbrann3 • 0

0

Entering edit mode

Please read the first answer by Ryan and comments underneath for the solution.

ADD REPLY • link 10.1 years ago by Neilfws 49k

0

Entering edit mode

Hello Neilfws. I have looked at those references and it is not straightforward for me. I have the following code that gives me one reference title, but my genbank file has many sequences.

#!/user/bin/perl
use strict;
use warnings;
use Bio::SeqIO;

my $io = Bio::SeqIO->new(-file => "sequence.gb", -format => "genbank" );
my $seq_obj = $io->next_seq();
my $anno_collection = $seq_obj->annotation;

for my $key ( $anno_collection->get_all_annotation_keys ) {
    my @annotations = $anno_collection->get_Annotations($key);
    for my $value ( @annotations ) {
        if ($value->tagname eq "reference") {
            print "title: ",$value->title(), "\n";
        }
    }
}

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 10.1 years ago by kbrann3 • 0

2

Entering edit mode

12.6 years ago

Istvan Albert 103k

This is a more complex topic that you will need to spend some time with. Find a good guide on BioPerl in general via Google.

For example I found this good chapter: Beginning Perl for Bioinformatics: Genbank

ADD COMMENT • link 12.6 years ago by Istvan Albert 103k

1

Entering edit mode

The book just instructed me to parse the file in standard perl which would have been my default anyway. I was trying to use this opportunity to learn more about bioperl and it was just a single parameter that was catching me out. Found it now though.

ADD REPLY • link 12.6 years ago by Daniel ★ 4.0k

1

Entering edit mode

my mistake - on cursory examination I thought it was BioPerl but instead it seems to make use of their own custom module BeginPerlBioinfo

ADD REPLY • link 12.6 years ago by Istvan Albert 103k

score 4 · Accepted Answer · 2013-01-17

4

Entering edit mode

12.6 years ago

Ryan Dale 5.0k

Searching CPAN for "genbank" finds more detailed Bio::SeqIO::genbank docs and near the end of that is a link to a how-to on feature annotation (since presumably the JOURNAL field will be considered an annotation).

ADD COMMENT • link 12.6 years ago by Ryan Dale 5.0k

1

Entering edit mode

And indeed, extracting the REFERENCE section (to which JOURNAL belongs) is right there in the feature annotation how-to. Just search the page for the phrase "Some Annotation objects, like Reference".

ADD REPLY • link 12.6 years ago by Neilfws 49k

1

Entering edit mode

That's great, thanks. found what I needed. The link to the CPAN seqIO::genbank from the wiki doesnt work and I went searching in a different direction. If I had found that I think I would have been sorted from the offset.

ADD REPLY • link 12.6 years ago by Daniel ★ 4.0k