Dear all,
I have downloaded complete mitochondrial genomes for one species from genbank and have written them to a file called myseqs.gbk (in genbank format - like a flat file). I now want to parse these genomes and extract only the sequence data relating to the cox3 gene, and save them as fasta files. In my first ever experience with biopython I tried to adapt the script kindly posted on this site last year (by microbeatic ), but I am getting the error:
Traceback (most recent call last):
File "parser2.py", line 21, in <module>
print ">GeneId|%s|COX3|%s\n%s" % (gi,genome.description,genome.seq[start:end])
NameError: name 'start' is not defined
I am sure my mistakes are obvious to the regulars on this site and I would be very thankful for some advice. The script is below. I am guessing that I am not using the correct qualifiers and thereby filtering everything out.
Thank you very much.
TB
from Bio import SeqIO, SeqFeature
import sys
gbank=SeqIO.parse('myseqs.gbk','genbank')
for genome in gbank:
for gene in genome.features:
if(gene.type =="CDS"):
if 'gene' in gene.qualifiers:
if 'cox3' in gene.qualifiers['gene'][0].lower():
start=gene.location.start.position
end=gene.location.end.position
if 'db_xref' in gene.qualifiers:
gi=[]
gi=str(gene.qualifiers['db_xref'])
gi=gi.split(":")[1]
gi=gi.split("'")[0]
print ">GeneId|%s|COX3|%s\n%s" % (gi,genome.description,genome.seq[start:end])
else:
print ">GeneId|NoGenID|COX3|%s\n%s" % (genome.description,genome.seq[start:end])
You're welcome. Let me know if you need it for another species I can make you a datawrap that I'll then put on the website. You can send me an email or reach me via twitter ( http://bioinfo.iric.ca/~daoudat/pyGeno/ ).
Cheers,