How to debug (and solve) Bioperl problems when a gbk file doesn't have the genome?
0
0
Entering edit mode
10.1 years ago
fr ▴ 220

Hello.

I am struggling with some problems reading Genebank files (.gbff) through Bioperl. I am trying to extract CDS and translation sequences using $feat->spliced_seq->seq and $feat->get_tag_values("translation"). My problem is that many of the genebank files are incomplete or are not matching the "correct" (example) format:

LOCUS       SCU49845     5028 bp    DNA             PLN       21-JUN-1999
DEFINITION  Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p
            (AXL2) and Rev7p (REV7) genes, complete cds.
...
  JOURNAL   Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New
            Haven, CT, USA
FEATURES             Location/Qualifiers
     source          1..5028
                     /organism="Saccharomyces cerevisiae"
                     /db_xref="taxon:4932"
                     /chromosome="IX"
                     /map="9"
     CDS             <1..206
                     /codon_start=3
                     /product="TCP1-beta"
                     /protein_id="AAA98665.1"
                     /db_xref="GI:1293614"
                     /translation="SSIYN...RTANRQHM"
     gene            687..3158
                     /gene="AXL2"
     CDS             687..3158
                     /gene="AXL2"
                     /note="plasma membrane glycoprotein"
                     /codon_start=1
                     /function="required for axial budding pattern of S.
                     cerevisiae"
                     /product="Axl2p"
                     /protein_id="AAA98666.1"
                     /db_xref="GI:1293615"
                     /translation="MTQL...GRIPEML"
     gene            complement(3300..4037)
                     /gene="REV7"
     CDS             complement(3300..4037)
                     /gene="REV7"
                     /codon_start=1
                     /product="Rev7p"
                     /protein_id="AAA98667.1"
                     /db_xref="GI:1293616"
                     /translation="MNRWVEK...IFGSLF"
ORIGIN
        1 gatcctccat ata...tgatc
//

If all files were in the correct format, it would be relatively straightforward to extract FASTA files with each gene or protein in the format

>FEAT | FEAT | FEAT | FEAT
agcgtgacgattvcatgt...catgcatgcag

and

>FEAT | FEAT | FEAT | FEAT
AYTKRCL...MRTCKYS

Many of the files that I have either do not have the "Origin" field at the bottom (example), or have multiple "Origin" fields (example), each just after a "CDS" field, resulting in warnings and die errors that prevent me from doing what I need to do. Most of the warnings indicate that Bioperl hasn't been able to infer the sequence (because they are lacking that "ORIGIN" field).

So my questions are the following:

  1. How would you circumvent this kind of problem? What would you do?
  2. Could you give me any tips on how I can find which of the files have this incorrect file format? I am figuring that a if($feat->spliced_seq->seq) fails, push those filenames to a list and manually download them again :( But I haven't been able to test this correctly yet, and maybe there is something already in Bioperl for these cases?
  3. How can I prevent the automatic die everytime a warning comes out, so that I can find the whole list of files that is not designed as it should? Curiously, through the ~1000 files that I am running, the script runs for a few hundreds, outputing those errors but quits at some point. I must say that I have use autodie;in the preamble, but I think the die command is being given by Bioperl.

Note: this post was posted in Perlmonks a few days ago in here, but I'm trying to find a more efficient solution to what was posted there.

bioperl genbank NCBI • 3.3k views
ADD COMMENT
2
Entering edit mode

The problem is that once a file is incorrect it becomes somewhat of a russian roulette trying to fix it up. Errors almost always come in groups and batches. Try readseq and see if their parser is more robust to your errors.

ADD REPLY
0
Entering edit mode

Yeah.. Right now I'm trying to fix those that lack the genome completely by retrieving the whole genome from the .fna files in each directory (example directory has .fna and the .gbff files), append it to the .gbff files and interpret it normally since this seems like it is what is lacking. I'll certainly check your suggestion. trying out readseq.

I also think I discovered why those other files have multiple genome entries: maybe it is because they pertain to whole-genome-sequencing shotgun projects and what is deposited is all the chunks that were found? (I'm sorry if I'm not using the correct terms, I'm new to this :) )

ADD REPLY
0
Entering edit mode

But why reinvent the wheel?

ADD REPLY
0
Entering edit mode

Do you mean that I should focus on using readseq, or were you thinking about something else?

ADD REPLY
1
Entering edit mode

I think that trying readseq and emboss first before embarking on building our own program to do it would be prudent. What I learnt over the past couple of years is that contributing to existing codebase is more helpful to us and to the community than writing our own tool. Unless we're designing a new algorithm or a new implementation, of course.

ADD REPLY
1
Entering edit mode

Yes, you are absolutely right! I'm going for readseq instead of trying to come up with something that may not work and give me more problems than I already have

ADD REPLY

Login before adding your answer.

Traffic: 2264 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6