I am trying to get all proteins from a genbank file considering a sequence interval. Here is the code of the script written in python:
#!/usr/bin/env python
#Script for getting proteins in a range from a genbank file
#Usage: python get_proteins.py file start:end
import sys
from Bio import SeqIO
rec = SeqIO.read(sys.argv[1], 'genbank')
feats = [feat for feat in rec.features if feat.type == "CDS"]
start, end = sys.argv[2].split(':')
desired = set(range(int(start),int(end),1))
for f in feats:
span = set(range(f.location._start.position, f.location._end.position))
if span & desired:
print("%s,%s,%d:%d\n%s\n" % (
f.qualifiers['locus_tag'][0],
f.qualifiers['product'][0],
f.location._start.position,
f.location._end.position,
f.qualifiers["translation"][0]))
There are two problems: For some regions containing pseudogenes, printing "f.qualifiers["translation"][0]" breaks the script;
$ python get_proteins.py ../gbk/NZ_AOLM01000025.1.gbk 92686:113021 > island_proteins.faa
Traceback (most recent call last):
File "get_proteins.py", line 24, in <module>
f.qualifiers["translation"][0]))
KeyError: 'translation'
I know that I must use "try", but I was not successful in doing it.
The second problem is that some genbank files give the following error:
$ python get_proteins.py ../gbk/NZ_CM001555.1.gbk 530979:551299 > island_proteins.faa
Traceback (most recent call last): File "get_proteins.py", line 17, in> <module>
span = set(range(f.location._start.position, f.location._end.position)) AttributeError: 'CompoundLocation' object
has no attribute '_start'
I did not understand this last problem.
Thank you.
I think I managed the problem: