Fixing a biopython script
1
1
Entering edit mode
8.0 years ago
tpaisie ▴ 80

So I'm trying to actively learn how to use Biopython, and python for that matter. I made a script that extracts the Genbank accession number and the country from a .gb (genbank) file and puts the accession number and country in a text file. Here is my script:

from Bio import SeqIO
gb_file = "denv1.gb"

for gb_record in SeqIO.parse(open(gb_file,"r"), "genbank") :
     printgb_record.name),
     for feat in gb_record.features:
                if feat.type == 'source':
                        source = gb_record.features[0]
                        for qualifiers in source.qualifiers:
                                if qualifiers == 'country':
                                        country = source.qualifiers['country']
                                        print(country[0])

Now the output text file, every accession number with a country is listed perfectly, but when a country is not present for a particular sequence in the genbank file, it does this:

example of .gb file

Instead I would like it to be blank where the name of the country would go. Hopefully this makes sense and isn't too confusing. Can anyone help me out? As I said I'm trying to learn, so pointers or links to help would be greatly appreciated!

biopython python GenBank • 2.8k views
ADD COMMENT
1
Entering edit mode

Couple of other things to point out. with SeqIO.parse() you don't also need to call open(). My suggestion would be to throw in a check for existence of location... have a look at this thread from the other day: C: How do can I use Biopython and SeqIO to parse out multiple genes from several NC

from Bio import SeqIO
import sys

with open(sys.argv[2], 'w') as nfh:
        for rec in SeqIO.parse(sys.argv[1], "genbank"):
                if rec.features:
                        for feature in rec.features:
                                if feature.type == "CDS":
                                        nfh.write(">%s from %s\n%s\n" % (
                                        feature.qualifiers['gene'][0],
                                        rec.name,
                                        feature.location.extract(rec).seq))

On line 5, you're making sure that feature exists before you do something with it. My guess is that your code is failing to detect that a source isnt found in the next record, and populating it with the last thing source evaluated to from a previous record.

ADD REPLY
2
Entering edit mode

ok so i added this to my script:

from Bio import SeqIO
gb_file = "denv1.gb"

# parse genbank file
for gb_record in SeqIO.parse(gb_file, "genbank") :
    # now do something with the record
        printgb_record.name),   # print genbank accession number
        for feat in gb_record.features:
                if feat.type == 'source':
                        source = gb_record.features[0]
                        for qualifiers in source.qualifiers:
                            if qualifiers == 'country':
                                country = source.qualifiers['country']
                                print(country[0]) # prints the country
                        else:
                            print("")

and it does leave a the accession numbers country column blank but it also adds an extra line inbetween each accession number and country:

script text file

ADD REPLY
0
Entering edit mode

I suspect this will be to do with carry over of a newline character in the case where the source can actually be found from the file, but not if it's blank.

ADD REPLY
0
Entering edit mode

So I need to add something to my code to tell it when no country is present in the file it should print nothing? Are you showing that in the above code?

ADD REPLY
0
Entering edit mode

I haven't tested your code but that would be my guess. I could be wrong :p

The code above is similar to what you're doing. It extracts info from Genbanks.

I believe you need a statement something like:

source = gb_record.features[0]
if not source:
    source = 'no source information'
ADD REPLY
0
Entering edit mode

Can you upload your genbank that you're trying this on so I can replicate what you've got?

ADD REPLY
0
Entering edit mode

So the entire file is to large, but there is a shortened version of the file: http://gobin.io/OBcO

ADD REPLY
0
Entering edit mode

Could you format the code using in the biostar post editor ?

ADD REPLY
0
Entering edit mode

There! Thanks for letting me know!

ADD REPLY
0
Entering edit mode
8.0 years ago
Joe 22k

Ok here's what I've ended up with:

from Bio import SeqIO
gb_file = "test.gbk"

# parse genbank file
for gb_record in SeqIO.parse(gb_file, "genbank") :
    # now do something with the record
        for feat in gb_record.features:
                if feat.type == 'source':
                        source = gb_record.features[0]
                        for qualifiers in source.qualifiers:
                            if qualifiers == 'country':
                                country = str(source.qualifiers['country'][0])
                            elif country == '':
                                country = 'none'

        printgb_record.name + ' ' + country)

This works on the test data. The file as you provided it returns:

KT279761 Haiti
EU482591 USA: Puerto Rico
HQ332181 Venezuela
FJ810415 Venezuela: Aragua

If I manually change one of the gbk entries so that it doesn't contain any country data:

            ##Assembly-Data-END##
FEATURES             Location/Qualifiers
     source          1..10735
                     /organism="Dengue virus 1"
                     /mol_type="viral cRNA"
                     /strain="Haiti/1207/2014"
                     /serotype="1"
                     /host="Homo sapiens"
                     /db_xref="taxon:11053"
                     /country=""
                     /collection_date="27-Nov-2014"
     5'UTR           1..94

The same code will return:

KT279761 none
EU482591 USA: Puerto Rico
HQ332181 Venezuela
FJ810415 Venezuela: Aragua

However, without seeing one of your genbanks which doesnt contain the country information I'm not sure if this will work 100% of the time. If the /country= field is missing altogether I think it may still break. For your own practice you may want to alter the code to try and handle that (another simple elif statement would probably do it).

ADD COMMENT

Login before adding your answer.

Traffic: 3150 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6