Question

Problems Parsing Genbank Flatfiles Generated By Ensembl

4

Entering edit mode

14.4 years ago

Jarretinha 3.5k

Hi BioStar people,

Here I go again in my odyssey to master EnsEBML. After learning how to use the API and the site, I'm facing difficulties to parse site-generated flatfiles. Tried BioPerl and BioPython without sucess. A common error from python, just to illustrate:

/usr/local/lib/python2.6/dist-packages/Bio/GenBank/Scanner.py:950: UserWarning: Malformed LOCUS line found - is this correct?
LOCUS       11 7208 bp DNA HTG 25-MAR-2011

If I edit LOCUS by hand, other parsing errors appear elsewhere. Am I wrong supposing that these EnsEMBL flatfiles are Bio* parseable? My biopython was compiled from the last source version and my Bioperl came from a ubuntu package.

ensembl genbank parsing • 7.4k views

ADD COMMENT • link updated 14.4 years ago by Monika Komorowska ▴ 20 • written 14.4 years ago by Jarretinha 3.5k

0

Entering edit mode

Can you include a small code snippet which will retrieve a flatfile, and/or link to an example of a retrieved flatfile? It will help diagnosis if we can look at the file and try parsing it.

ADD REPLY • link 14.4 years ago by Neilfws 49k

0

Entering edit mode

I've just loaded SeqIO on a python/perl shell and tried to parse the file. I was just exploring the flatfile when noted this problem.

ADD REPLY • link 14.4 years ago by Jarretinha 3.5k

0

Entering edit mode

@Jarretinha, do you still remember (or maybe had posted somewhere) the issues you had with non-standard features/annotations of EnsEMBL GenBank files, and efforts to bypass those? I'm asking in the context of LOCUS header lines fixed recently in Biopython: https://github.com/biopython/biopython/pull/16

ADD REPLY • link 14.0 years ago by Chronos ▴ 620

Ram · Answer 1 · 2011-03-26

2

Entering edit mode

14.4 years ago

Neilfws 49k

The NCBI has a link to an annotated GenBank sample record. The LOCUS line is supposed to contain (after the word LOCUS):

Locus name
Sequence length
Molecule type
Genbank division
Modification date

In your example, it looks as though locus name = 11. This is not a valid accession. In addition, it may be that the parser is reading "11" and "7208" as "11 7208" (sequence length). If you look at the Biopython code, all it does is split the LOCUS line on space and the warning is generated if the resulting array length looks wrong.

I'm not sure what the best solution is, other than to try and get an accession/identifier from Ensembl, then use that to get valid GenBank format from another source (e.g. NCBI).

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 14.4 years ago by Neilfws 49k

0

Entering edit mode

I've noted this incongruence. But, this is just how the file was generated. I can do it yourself too. My steps in EnsEMBL genome browser: Search for MEN1 -> Entered the location -> Export data -> Flat File(GenBank) with select all. It will generate a lot of non standard annotations/features. With a little effort features are parseable. But the annotations are another story . . .

ADD REPLY • link 14.4 years ago by Jarretinha 3.5k

Ram · Answer 2 · 2011-03-31

I used the following script to parse a few files generated by ensembl, without any problems:

use strict;
use Bio::SeqIO;    
my $seqio_object = Bio::SeqIO->new(-file => "MEN1.gb");

my $seq_object = $seqio_object->next_seq;
while (defined $seq_object) {
   my $accession = $seq_object->accession();
   print "$accession\n";
   my $display_id  = $seq_object->display_id();
   print "$display_id\n";
   my $length = $seq_object->length();
   print "$length\n";

   print "Print sequence object annotaton:\n";
   my $anno_collection = $seq_object->annotation;

   my @annotations = $anno_collection->get_Annotations();
   for my $value ( @annotations ) {
      print "tagname : ", $value->tagname, "\n";
      print "  annotation value: ", $value->as_text, "\n";
   }

   print "Print all the data in the features of a Seq object:\n";
   for my $feat_object ($seq_object->get_SeqFeatures) {          
      print "primary tag: ", $feat_object->primary_tag, "\n";          
      for my $tag ($feat_object->get_all_tags) {             
         print "  tag: ", $tag, "\n";             
         for my $value ($feat_object->get_tag_values($tag)) {                
            print "    value: ", $value, "\n";             
         }          
      }       
   }

   $seq_object = $seqio_object->next_seq;
}

If you're still encountering issues with parsing, can you please email Ensembl Helpdesk at helpdesk@ensembl.org, attaching your script and a URL to the file you're trying to parse.

Hope this helps

Monika Komorowska
Ensembl Developer