Question

How to extract DNA sequences for individual genes from GenBank complete genome files using biopython?

0

Entering edit mode

10.7 years ago

cq151 ▴ 10

Dear all,

I have downloaded complete mitochondrial genomes for one species from genbank and have written them to a file called myseqs.gbk (in genbank format - like a flat file). I now want to parse these genomes and extract only the sequence data relating to the cox3 gene, and save them as fasta files. In my first ever experience with biopython I tried to adapt the script kindly posted on this site last year (by microbeatic ), but I am getting the error:

Traceback (most recent call last):
  File "parser2.py", line 21, in <module>
    print ">GeneId|%s|COX3|%s\n%s" % (gi,genome.description,genome.seq[start:end])
NameError: name 'start' is not defined

I am sure my mistakes are obvious to the regulars on this site and I would be very thankful for some advice. The script is below. I am guessing that I am not using the correct qualifiers and thereby filtering everything out.

Thank you very much.

TB

from Bio import SeqIO, SeqFeature
import sys

gbank=SeqIO.parse('myseqs.gbk','genbank')

for genome in gbank:

    for gene in genome.features:
        if(gene.type =="CDS"):
            if 'gene' in gene.qualifiers:
                if 'cox3' in gene.qualifiers['gene'][0].lower():
                    start=gene.location.start.position
                    end=gene.location.end.position
                if 'db_xref' in gene.qualifiers:
                    gi=[]
                    gi=str(gene.qualifiers['db_xref'])
                    gi=gi.split(":")[1]
                    gi=gi.split("'")[0]
                    print ">GeneId|%s|COX3|%s\n%s" % (gi,genome.description,genome.seq[start:end])
                else:
                    print ">GeneId|NoGenID|COX3|%s\n%s" % (genome.description,genome.seq[start:end])

biopython genbank genomes parsing extract • 12k views

ADD COMMENT • link updated 4.3 years ago by Ram 45k • written 10.7 years ago by cq151 ▴ 10

1

Entering edit mode

10.7 years ago

cq151 ▴ 10

Thank you very much Peter and Tariq!

As Peter pointed out, I had forgotten to indent the final block of code - after indenting, the original script worked perfectly. I am dealing with "nice" genes at the moment; however, I will bear in mind the gene.extract(genome.seq) approach should I have to deal with more messy genes - thank you Peter.

Thank you Tariq for the suggestion of pyGeno - I will look into that, it looks neat, particularly when I have to deal with human sequences.

The original problem is now solved - and so quickly.

Many thanks

TB

ADD COMMENT • link updated 4.3 years ago by Ram 45k • written 10.7 years ago by cq151 ▴ 10

0

Entering edit mode

You're welcome. Let me know if you need it for another species I can make you a datawrap that I'll then put on the website. You can send me an email or reach me via twitter ( http://bioinfo.iric.ca/~daoudat/pyGeno/ ).

Cheers,

ADD REPLY • link updated 4.3 years ago by Ram 45k • written 10.7 years ago by Tariq Daouda ▴ 220

Ram · Accepted Answer · 2014-12-08

Hey,

I'm the developer of pyGeno, If you don't mind switching to another python package, you could try it.

If you're not working on humans, you will have to make you own pyGeno datawrap, but the process is easy if the data you need is available from Ensembl.

Then these few lines might do the trick:

from pyGeno.Genome import *

ref = Genome(name = "GRCh37.75")
cox = ref.get(Gene, name = "COX3")[0]

for trans in cox.get(Transcript) :

  print trans.cDNA, trans.sequence #get the sequences

pyGeno's doc: http://bioinfo.iric.ca/~daoudat/pyGeno/

Cheers,

Ram · Accepted Answer · 2014-12-08

The NameError is due to you only defining start and end if it was a cox gene. Probably you wanted to indent the last block of code to also only happen for cox genes?

Also try reading http://www.warwick.ac.uk/go/peter_cock/python/genbank/ which has examples.

In particular, your approach of getting the sequence using start and end will only work on nice genes - for anything complicated like a ribosomal slippage or an intron/exon gene model, I would suggest you use gene.extract(genome.seq)

Try:

from Bio import SeqIO

for genome in SeqIO.parse('myseqs.gbk','genbank'):
    for gene in gnome.features:
        if gene.type != "CDS":
            continue
        if 'gene' not in gene.qualifiers:
            continue
        if 'cox3' not in gene.qualifiers['gene'][0].lower():
            continue
        gene_seq = gene.extract(genome.seq)
        if 'db_xref' in gene.qualifiers:
            gi = str(gene.qualifiers['db_xref']).split(":")[1].split("'")[0]
        else:
            gi = "NoGenID"
        print ">GeneId|%s|COX3|%s\n%s" % (gi, genome.description, gene_seq)