So I'm trying to actively learn how to use Biopython, and python for that matter. I made a script that extracts the Genbank accession number and the country from a .gb (genbank) file and puts the accession number and country in a text file. Here is my script:
from Bio import SeqIO
gb_file = "denv1.gb"
for gb_record in SeqIO.parse(open(gb_file,"r"), "genbank") :
printgb_record.name),
for feat in gb_record.features:
if feat.type == 'source':
source = gb_record.features[0]
for qualifiers in source.qualifiers:
if qualifiers == 'country':
country = source.qualifiers['country']
print(country[0])
Now the output text file, every accession number with a country is listed perfectly, but when a country is not present for a particular sequence in the genbank file, it does this:
Instead I would like it to be blank where the name of the country would go. Hopefully this makes sense and isn't too confusing. Can anyone help me out? As I said I'm trying to learn, so pointers or links to help would be greatly appreciated!
Couple of other things to point out. with
SeqIO.parse()
you don't also need to callopen()
. My suggestion would be to throw in a check for existence of location... have a look at this thread from the other day: C: How do can I use Biopython and SeqIO to parse out multiple genes from several NCOn line 5, you're making sure that feature exists before you do something with it. My guess is that your code is failing to detect that a source isnt found in the next record, and populating it with the last thing source evaluated to from a previous record.
ok so i added this to my script:
and it does leave a the accession numbers country column blank but it also adds an extra line inbetween each accession number and country:
I suspect this will be to do with carry over of a newline character in the case where the source can actually be found from the file, but not if it's blank.
So I need to add something to my code to tell it when no country is present in the file it should print nothing? Are you showing that in the above code?
I haven't tested your code but that would be my guess. I could be wrong :p
The code above is similar to what you're doing. It extracts info from Genbanks.
I believe you need a statement something like:
Can you upload your genbank that you're trying this on so I can replicate what you've got?
So the entire file is to large, but there is a shortened version of the file: http://gobin.io/OBcO
Could you format the code using in the biostar post editor ?
There! Thanks for letting me know!