Hello.
I am struggling with some problems reading Genebank files (.gbff) through Bioperl. I am trying to extract CDS and translation sequences using $feat->spliced_seq->seq
and $feat->get_tag_values("translation")
. My problem is that many of the genebank files are incomplete or are not matching the "correct" (example) format:
LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999
DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p
(AXL2) and Rev7p (REV7) genes, complete cds.
...
JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New
Haven, CT, USA
FEATURES Location/Qualifiers
source 1..5028
/organism="Saccharomyces cerevisiae"
/db_xref="taxon:4932"
/chromosome="IX"
/map="9"
CDS <1..206
/codon_start=3
/product="TCP1-beta"
/protein_id="AAA98665.1"
/db_xref="GI:1293614"
/translation="SSIYN...RTANRQHM"
gene 687..3158
/gene="AXL2"
CDS 687..3158
/gene="AXL2"
/note="plasma membrane glycoprotein"
/codon_start=1
/function="required for axial budding pattern of S.
cerevisiae"
/product="Axl2p"
/protein_id="AAA98666.1"
/db_xref="GI:1293615"
/translation="MTQL...GRIPEML"
gene complement(3300..4037)
/gene="REV7"
CDS complement(3300..4037)
/gene="REV7"
/codon_start=1
/product="Rev7p"
/protein_id="AAA98667.1"
/db_xref="GI:1293616"
/translation="MNRWVEK...IFGSLF"
ORIGIN
1 gatcctccat ata...tgatc
//
If all files were in the correct format, it would be relatively straightforward to extract FASTA files with each gene or protein in the format
>FEAT | FEAT | FEAT | FEAT
agcgtgacgattvcatgt...catgcatgcag
and
>FEAT | FEAT | FEAT | FEAT
AYTKRCL...MRTCKYS
Many of the files that I have either do not have the "Origin" field at the bottom (example), or have multiple "Origin" fields (example), each just after a "CDS" field, resulting in warnings and die errors that prevent me from doing what I need to do. Most of the warnings indicate that Bioperl hasn't been able to infer the sequence (because they are lacking that "ORIGIN" field).
So my questions are the following:
- How would you circumvent this kind of problem? What would you do?
- Could you give me any tips on how I can find which of the files have this incorrect file format? I am figuring that a
if($feat->spliced_seq->seq)
fails, push those filenames to a list and manually download them again :( But I haven't been able to test this correctly yet, and maybe there is something already in Bioperl for these cases? - How can I prevent the automatic die everytime a warning comes out, so that I can find the whole list of files that is not designed as it should? Curiously, through the ~1000 files that I am running, the script runs for a few hundreds, outputing those errors but quits at some point. I must say that I have use autodie;in the preamble, but I think the die command is being given by Bioperl.
Note: this post was posted in Perlmonks a few days ago in here, but I'm trying to find a more efficient solution to what was posted there.
The problem is that once a file is incorrect it becomes somewhat of a russian roulette trying to fix it up. Errors almost always come in groups and batches. Try readseq and see if their parser is more robust to your errors.
Yeah.. Right now I'm
trying to fix those that lack the genome completely by retrieving the whole genome from the .fna files in each directory (example directory has .fna and the .gbff files), append it to the .gbff files and interpret it normally since this seems like it is what is lacking. I'll certainly check your suggestion.trying out readseq.I also think I discovered why those other files have multiple genome entries: maybe it is because they pertain to whole-genome-sequencing shotgun projects and what is deposited is all the chunks that were found? (I'm sorry if I'm not using the correct terms, I'm new to this :) )
But why reinvent the wheel?
Do you mean that I should focus on using readseq, or were you thinking about something else?
I think that trying readseq and emboss first before embarking on building our own program to do it would be prudent. What I learnt over the past couple of years is that contributing to existing codebase is more helpful to us and to the community than writing our own tool. Unless we're designing a new algorithm or a new implementation, of course.
Yes, you are absolutely right! I'm going for readseq instead of trying to come up with something that may not work and give me more problems than I already have