Question

Problems with extracting genes from a genbank file using biopython

0

Entering edit mode

6.2 years ago

beginner_problem ▴ 10

I am trying go extract the gene positions from a genbank file using Biopython. This is the function i wrote so far:

def get_CDS(file):
record = SeqIO.read(file, "genbank")
cds = []
for feature in record.features:
    if feature.type == 'CDS':
        print feature.location
        start_i = feature.location.start
        end_i = feature.location.end
        cds.append((start_i, end_i))
return cds

However I noticed sometimes, there are entries like:

join{[4585844:4586295](-), [4584940:4585845](-)}

And then start and end positions will return: 4584940 and 4586295.

Does someone maybe know, how can I also get the positions of the genes accordingly, for the first part of the gene [4585844:4586295] and then [4584940:4585845]

gene genome biopython python • 1.7k views

ADD COMMENT • link updated 6.2 years ago by Sej Modha 5.3k • written 6.2 years ago by beginner_problem ▴ 10

0

Entering edit mode

Could you please provide accession number of the genbank file you are trying to parse using this code?

ADD REPLY • link 6.2 years ago by Sej Modha 5.3k

0

Entering edit mode

For example, one of the pestis genomes causes this problem: NC_003143

ADD REPLY • link 6.2 years ago by beginner_problem ▴ 10

score 1 · Answer 1 · 2018-09-11

1

Entering edit mode

6.2 years ago

Sej Modha 5.3k

Hi There,

I have used a sample genbank file here, the following should work for you too.

This produces following output.

['335:4642', '335:1838', '4586:5165', '5104:5396', '5376:7970', '5515:8199', '5607:5856', '5770:8341', '6918:7488', '8342:8963']

ADD COMMENT • link 6.2 years ago by Sej Modha 5.3k