Hi all,
I'm trying to translate the a genbank record using BioPython 1.53, ignoring the already given translation in the CDS feature. The code I've written to translate this is pretty straight forward:
...
for gb_record in SeqIO.parse(file_handle, 'genbank'):#Bio.GenBank.Record
for gb_feature in gb_record.features:#Bio.SeqFeature
#Skip any non coding sequence features
if gb_feature.type != 'CDS':
continue
#Protein identifier is a property of the genbank feature
protein_id = gb_feature.qualifiers['protein_id'][0]
#Original sequence retrieved through BioPython 1.53+'s internal method
extracted_seq = gb_feature.extract(gb_record.seq)#Bio.Seq.Seq
#Translation table is a property of the genbank feature
transl_table = gb_feature.qualifiers['transl_table'][0]
#Translate entire sequence as coding sequence using translation table
#Additional CodonTables optionally available from Bio.Data.CodonTable
try:
protein_seq = extracted_seq.translate(table = transl_table, cds = True)
except TranslationError, err:
log.error('%s: Error in translating %s\n%s', gb_record.id, protein_id, extracted_seq)
raise err
#Write out fasta. Header format as requested: >genome_ac|protein_id
_write_fasta_line(write_handle, '{0}|{1}'.formatgb_record.id, protein_id), str(protein_seq))
The translate line throws a TranslationError on the following feature:
CDS complement(2276255..2279302)
/locus_tag="ECBD_2165"
/EC_number="1.7.99.4"
/inference="protein motif:TFAM:TIGR01553"
/note="KEGG: ssn:SSON_1650 formate dehydrogenase-N,
nitrate-inducible, alpha subunit;
TIGRFAM: formate dehydrogenase, alpha subunit;
PFAM: molybdopterin oxidoreductase; molybdopterin
oxidoreductase Fe4S4 region; molydopterin
dinucleotide-binding region"
/codon_start=1
/transl_except=(pos:complement(2278715..2278717),aa:Sec)
/transl_table=11
/product="formate dehydrogenase, alpha subunit"
/protein_id="YP_003036386.1"
/db_xref="GI:253773555"
/db_xref="InterPro:IPR006311"
/db_xref="InterPro:IPR006443"
/db_xref="InterPro:IPR006655"
/db_xref="InterPro:IPR006656"
/db_xref="InterPro:IPR006657"
/db_xref="InterPro:IPR006963"
/db_xref="GeneID:8157271"
/translation="MDVSRRQFFKICAGGMAGTTVAALGFAPKQALAQARNYKLLRAK
EIRNTCTYCSVGCGLLMYSLGDGAKNAREAIYHIEGDPDHPVSRGALCPKGAGLLDYV
NSENRLRYPEYRAPGSDKWQRISWEEAFSRIAKLMKADRDANFIEKNEQGVTVNRWLS
TGMLCASGASNETGMLTQKFARSLGMLAVDNQARVUHGPTVASLAPTFGRGAMTNHWV
DIKNANVVMVMGGNAAEAHPVGFRWAMEAKNNNDATLIVVDPRFTRTASVADIYAPIR
SGTDITFLSGVLRYLIENNKINAEYVKHYTNASLLVRDDFAFEDGLFSGYDAEKRQYD
KSSWNYQFDENGYAKRDETLTHPRCVWNLLKEHVSRYTPDVVENICGTPKADFLKVCE
VLASTSAPDRTTTFLYALGWTQHTVGAQNIRTMAMIQLLLGNMGMAGGGVNALRGHSN
IQGLTDLGLLSTSLPGYLTLPSEKQVDLQSYLEANTPKATLADQVNYWSNYPKFFVSL
MKSFYGDAAQKENNWGYDWLPKWDQTYDVIKYFNMMDEGKVTGYFCQGFNPVASFPDK
NKVVSCLSKLKYMVVIDPLVTETSTFWQNHGESNDVDPASIQTEVFRLPSTCFAEEDG
SIANSGRWLQWHWKGQDAPGEARNDGEILAGIYHHLRELYQAEGGKGVEPLMKMSWNY
KQPHEPQSDEVAKENNGYALEDLYDANGVLIAKKGQLLSSFAHLRDDGTTASSCWIYT
GSWTEQGNQMANRDNSDPSGLGNTLGWAWAWPLNRRVLYNRASADINGKPWDPKRMLI
QWNGSKWTGNDIPDFGNAAPGTPTGPFIMQPEGMGRLFAINKMAEGPFPEHYEPIETP
LGTNPLHPNVVSNPVVRLYEQDALRMGKKEQFPYVGTTYRLTEHFHTWTKHALLNAIA
QPEQFVEISETLAAAKGINNGDRVTVSSKRGFIRAVAVVTRRLKPLNVNGQQVETVGI
PIHWGFEGVARKGYIANTLTPNVGDANSQTPEYKAFLVNIEKA"
root: ERROR: NC_012947.1: Error in translating YP_003036386.1
ATGGACGTCAGTCGCAGACAATTTTTTAAAATCTGCGCGGGCGGTATGGCTGGAACAACGGTAGCGGCATTGGGCTTTGCCCCGAAGCAAGCACTGGCTCAGGCGCGAAACTACAAATTATTACGCGCTAAAGAGATCCGTAACACCTGCACATACTGTTCCGTAGGTTGCGGGCTATTGATGTATAGCCTGGGTGATGGCGCGAAAAACGCCAGAGAAGCGATTTATCACATTGAAGGTGACCCGGATCATCCGGTAAGCCGTGGTGCGCTGTGCCCAAAAGGGGCCGGTTTGCTGGATTACGTCAACAGCGAAAACCGTCTGCGCTACCCGGAATATCGTGCGCCAGGTTCTGACAAATGGCAGCGCATTAGCTGGGAAGAAGCATTCTCCCGTATTGCAAAGCTGATGAAAGCTGACCGTGACGCTAACTTTATTGAAAAGAACGAGCAGGGCGTAACGGTAAACCGTTGGCTTTCTACCGGTATGCTGTGTGCCTCCGGTGCCAGCAACGAAACCGGGATGCTGACACAGAAATTTGCCCGCTCCCTCGGGATGCTGGCGGTAGACAACCAGGCGCGCGTCTGACACGGACCAACGGTAGCAAGTCTTGCTCCAACATTTGGTCGCGGTGCGATGACCAACCACTGGGTGGATATCAAAAACGCTAACGTCGTAATGGTAATGGGCGGTAACGCTGCTGAAGCGCATCCCGTCGGTTTCCGCTGGGCGATGGAAGCGAAAAACAACAACGATGCAACCTTGATCGTTGTCGATCCTCGTTTTACGCGTACCGCTTCTGTGGCGGATATTTACGCACCTATTCGTTCCGGTACGGACATTACGTTCCTGTCTGGCGTTTTGCGCTACCTGATCGAAAACAACAAAATCAACGCCGAATACGTTAAACATTACACCAACGCCAGCCTGCTGGTGCGTGATGATTTTGCTTTCGAAGATGGCCTGTTCAGCGGTTATGACGCTGAAAAACGCCAGTACGACAAATCGTCCTGGAACTATCAGTTCGATGAAAACGGCTATGCGAAACGCGATGAAACACTGACTCATCCGCGCTGTGTGTGGAACCTGCTGAAAGAGCACGTTTCCCGCTACACGCCGGACGTCGTTGAAAACATCTGCGGTACGCCAAAAGCCGACTTCCTGAAAGTGTGTGAAGTGCTGGCCTCCACCAGCGCACCGGATCGCACAACCACCTTCCTGTACGCGCTGGGCTGGACGCAGCACACCGTGGGTGCGCAGAACATCCGTACTATGGCGATGATCCAGTTACTGCTCGGTAACATGGGTATGGCCGGTGGCGGCGTGAACGCATTGCGTGGTCACTCCAACATTCAGGGCCTGACTGACTTAGGTCTGCTCTCTACCAGCCTGCCAGGTTATCTGACGCTGCCGTCAGAAAAACAGGTTGATTTGCAGTCGTATCTGGAAGCGAACACGCCGAAAGCGACGCTGGCTGATCAGGTGAACTACTGGAGCAACTATCCGAAGTTCTTCGTTAGCCTGATGAAATCTTTCTATGGCGATGCCGCGCAGAAAGAGAACAACTGGGGCTATGACTGGCTGCCGAAGTGGGACCAGACCTACGACGTCATCAAGTATTTCAACATGATGGACGAAGGCAAAGTCACCGGTTATTTCTGCCAGGGCTTTAACCCGGTTGCGTCCTTCCCGGACAAAAACAAAGTGGTGAGCTGCCTGAGCAAGCTGAAGTACATGGTGGTTATCGATCCGCTGGTGACTGAAACCTCTACCTTCTGGCAGAACCACGGCGAGTCGAACGATGTCGATCCGGCGTCTATTCAGACTGAAGTATTCCGTCTGCCTTCGACCTGCTTTGCTGAAGAAGATGGTTCTATTGCTAACTCCGGTCGCTGGCTGCAGTGGCACTGGAAAGGTCAGGATGCGCCGGGCGAAGCGCGTAACGACGGTGAAATTCTGGCGGGTATCTACCATCACCTGCGCGAGCTGTACCAGGCCGAAGGTGGTAAAGGCGTAGAACCGCTGATGAAGATGAGCTGGAACTACAAGCAGCCGCACGAACCGCAATCTGACGAAGTAGCTAAAGAGAACAACGGCTATGCGCTGGAAGATCTCTATGATGCTAATGGCGTGCTGATTGCGAAGAAAGGTCAGTTGCTGAGTAGCTTTGCGCATCTGCGTGATGACGGTACAACCGCATCTTCTTGCTGGATCTACACCGGTAGCTGGACAGAGCAGGGCAACCAGATGGCTAACCGCGATAACTCCGACCCGTCCGGTCTGGGGAATACGCTGGGATGGGCCTGGGCGTGGCCGCTCAACCGTCGCGTGCTGTACAACCGTGCTTCGGCGGATATCAACGGTAAACCGTGGGATCCGAAACGGATGCTGATCCAGTGGAACGGCAGCAAGTGGACGGGTAACGATATTCCTGACTTCGGCAATGCCGCACCGGGTACGCCAACCGGGCCGTTTATCATGCAGCCGGAAGGGATGGGACGCCTGTTTGCTATCAACAAAATGGCGGAAGGTCCGTTCCCGGAACACTACGAGCCGATTGAAACGCCGCTGGGCACTAACCCGCTGCATCCGAACGTGGTGTCTAACCCGGTTGTTCGTCTGTATGAACAAGACGCACTGCGGATGGGTAAAAAAGAGCAGTTCCCGTATGTGGGTACGACCTATCGTCTGACCGAGCACTTCCACACCTGGACCAAGCACGCATTGCTCAACGCAATTGCTCAGCCGGAACAGTTTGTGGAAATCAGCGAAACGCTGGCGGCGGCGAAAGGCATTAATAATGGCGATCGTGTCACTGTCTCAAGCAAGCGTGGCTTTATCCGCGCGGTGGCTGTGGTAACGCGTCGTCTGAAACCACTGAATGTAAATGGTCAGCAGGTTGAAACGGTGGGTATTCCAATCCACTGGGGCTTTGAGGGTGTCGCGCGTAAAGGTTATATCGCTAACACTCTGACGCCGAATGTCGGTGATGCAAACTCGCAAACGCCGGAATATAAAGCGTTCTTAGTCAACATCGAGAAGGCGTAA
Error:
Traceback (most recent call last):
File "/usr/lib/python2.6/unittest.py", line 279, in run
testMethod()
File "/home/user/jenkins/workspace/Divergence/divergence/src/divergence/test/test_translate.py", line 33, in test_translate_ecoli_and_salmo
fasta_file = translate_genbank_to_protein(genbank_file, ptt_file)
File "/home/user/jenkins/workspace/Divergence/divergence/src/divergence/translate.py", line 73, in translate_genbank_to_protein
raise err
TranslationError: Extra in frame stop codon found.
Now I'm guessing this has something to do with the /transl_except
I'm seeing in the GenBank record, but I'm not (yet) sure. (The GenBank supplied translation contains a Selenocysteine.) But even if this is the cause: How would I properly handle this in my BioPython translation? I can't find any method to exclude certain sections from translation..
Can anyone help me fix the translation?
Best regards, Tim
(Ps. Should anyone wonder why I'm not using the translation in the GenBank file directly: It's a requirement that I translate from the DNA sequence to protein myself...)
Thanks, this seems like a good method to fall back upon.
I was hoping there was some oversight on my side of a way to handle this case using BioPython, as I've seen quite a few people running into similar problems while Googling: BioPython transl_except. Any suggestions in that direction are still welcome! :)
Tim, those other messages look like an old problem from 2005 with parsing GenBank records containing transl_except (woah, memories). You're dealing with a pretty special case here, so will just need some custom code to handle it. If only the record had a translation in it you could use.
Thanks for the explanation, and about the provided translation: I'll discuss again with my supervisors if I can use the already provided translation in when encountering /transl_except records. I think writing my own code to handle the /transl_except cases is far more error-prone than using the provided translation in these rare cases.