Question

gbk2faa: why is it not working for refseq gbk?

1

Entering edit mode

8.8 years ago

fhsantanna ▴ 620

I have the following script to extract protein sequences from genbank files.

#!/usr/bin/env python

#This script is a modification of the script found in Peter Cock's site (http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/genbank2fasta/).
# Usage: python gbk2faa.py <input> <output>

import sys
from Bio import GenBank
from Bio import SeqIO

input_handle = open(sys.argv[1], "r")
output_handle = open(sys.argv[2], "w")

for seq_record in SeqIO.parse(input_handle, "genbank") :
    print "Dealing with GenBank record %s" % seq_record.id
    for seq_feature in seq_record.features :
        if seq_feature.type=="CDS" :
            assert len(seq_feature.qualifiers['translation'])==1
            output_handle.write(">%s from %s\n%s\n" % (
                   seq_feature.qualifiers['locus_tag'][0],
                   seq_record.name,
                   seq_feature.qualifiers['translation'][0]))

output_handle.close()
input_handle.close()
print "Done"

The problem is that it is not working for refseq genbank files (for example, http://www.ncbi.nlm.nih.gov/nuccore/NZ_JYFS01000106). Here is the message error:

Traceback (most recent call last):

File "gbk2faa.py", line 17, in <module>
    assert len(seq_feature.qualifiers['translation'])==1
KeyError: 'translation'

Oddly, it works for other gb files such as http://www.ncbi.nlm.nih.gov/nuccore/JYFS01000106. Could you please give a hint about what is going on?

biopython • 3.4k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 8.8 years ago by fhsantanna ▴ 620

0

Entering edit mode

I found the problem. When the script finds a pseudogene (a CDS without translation) it crashes. I update the script using try and continue, now it works perfectly!

ADD REPLY • link 8.8 years ago by fhsantanna ▴ 620

0

Entering edit mode

can you put your corrected script here? thank you, Felipe

ADD REPLY • link 7.0 years ago by felipelira3 ▴ 40

1

Entering edit mode

https://github.com/fhsantanna/bioinfo_scripts/blob/master/gbk2faa.py

Too late, I know...

ADD REPLY • link 5.4 years ago by fhsantanna ▴ 620

Ram · Answer 1 · 2016-01-21

0

Entering edit mode

8.8 years ago

dylan.storey ▴ 60

I don't know python well but -

I'd guess that the translation key doesn't exist

Add a switch statement in to test for the existence of the key 'translation' before you.

Something like

if (defined seq_feature.qualifiers['translation']){
         do something
     }
else {
     warn you - possible print out the record you're trying to access
}

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by dylan.storey ▴ 60

1

Entering edit mode

Sorry the code and the warning were imprecise. I edited the text.

I believe "assert" does what you mean...

ADD REPLY • link 8.8 years ago by fhsantanna ▴ 620