Here I go again in my odyssey to master EnsEBML. After learning how to use the API and the site, I'm facing difficulties to parse site-generated flatfiles. Tried BioPerl and BioPython without sucess. A common error from python, just to illustrate:
/usr/local/lib/python2.6/dist-packages/Bio/GenBank/Scanner.py:950: UserWarning: Malformed LOCUS line found - is this correct?
LOCUS 11 7208 bp DNA HTG 25-MAR-2011
If I edit LOCUS by hand, other parsing errors appear elsewhere. Am I wrong supposing that these EnsEMBL flatfiles are Bio* parseable? My biopython was compiled from the last source version and my Bioperl came from a ubuntu package.
Can you include a small code snippet which will retrieve a flatfile, and/or link to an example of a retrieved flatfile? It will help diagnosis if we can look at the file and try parsing it.
@Jarretinha, do you still remember (or maybe had posted somewhere) the issues you had with non-standard features/annotations of EnsEMBL GenBank files, and efforts to bypass those? I'm asking in the context of LOCUS header lines fixed recently in Biopython: https://github.com/biopython/biopython/pull/16
The NCBI has a link to an annotated GenBank sample record. The LOCUS line is supposed to contain (after the word LOCUS):
Locus name
Sequence length
Molecule type
Genbank division
Modification date
In your example, it looks as though locus name = 11. This is not a valid accession. In addition, it may be that the parser is reading "11" and "7208" as "11 7208" (sequence length). If you look at the Biopython code, all it does is split the LOCUS line on space and the warning is generated if the resulting array length looks wrong.
I'm not sure what the best solution is, other than to try and get an accession/identifier from Ensembl, then use that to get valid GenBank format from another source (e.g. NCBI).
I've noted this incongruence. But, this is just how the file was generated. I can do it yourself too. My steps in EnsEMBL genome browser: Search for MEN1 -> Entered the location -> Export data -> Flat File(GenBank) with select all. It will generate a lot of non standard annotations/features. With a little effort features are parseable. But the annotations are another story . . .
I used the following script to parse a few files generated by ensembl, without any problems:
use strict;
use Bio::SeqIO;
my $seqio_object = Bio::SeqIO->new(-file => "MEN1.gb");
my $seq_object = $seqio_object->next_seq;
while (defined $seq_object) {
my $accession = $seq_object->accession();
print "$accession\n";
my $display_id = $seq_object->display_id();
print "$display_id\n";
my $length = $seq_object->length();
print "$length\n";
print "Print sequence object annotaton:\n";
my $anno_collection = $seq_object->annotation;
my @annotations = $anno_collection->get_Annotations();
for my $value ( @annotations ) {
print "tagname : ", $value->tagname, "\n";
print " annotation value: ", $value->as_text, "\n";
}
print "Print all the data in the features of a Seq object:\n";
for my $feat_object ($seq_object->get_SeqFeatures) {
print "primary tag: ", $feat_object->primary_tag, "\n";
for my $tag ($feat_object->get_all_tags) {
print " tag: ", $tag, "\n";
for my $value ($feat_object->get_tag_values($tag)) {
print " value: ", $value, "\n";
}
}
}
$seq_object = $seqio_object->next_seq;
}
If you're still encountering issues with parsing, can you please email Ensembl Helpdesk at helpdesk@ensembl.org, attaching your script and a URL to the file you're trying to parse.
I was able to parse from the API just as you demonstrated. But, I think that is strange. Why should I construct a Seq/SeqFeature object from scratch I (in thesis) could request one already built? I wasn't able to understand why EnsEMBL generates non-conforming fields/annotations as illustrated by the LOCUS case.
I am not 100% sure what you mean by request one already built? As for the non-conforming records this is due to the Ensembl dumping code not using offset fields and the only way to code around this is to use any run of spaces a field separator which it looks like this is the way BioPerl can parse the records. We will look into these flat-file dumps soon so your feedback is good to see how we can improve.
Can you include a small code snippet which will retrieve a flatfile, and/or link to an example of a retrieved flatfile? It will help diagnosis if we can look at the file and try parsing it.
I've just loaded SeqIO on a python/perl shell and tried to parse the file. I was just exploring the flatfile when noted this problem.
@Jarretinha, do you still remember (or maybe had posted somewhere) the issues you had with non-standard features/annotations of EnsEMBL GenBank files, and efforts to bypass those? I'm asking in the context of LOCUS header lines fixed recently in Biopython: https://github.com/biopython/biopython/pull/16