Question

More file parsing :) ***EDIT*** how do I make a fast and bed file from Genbank file - SOLVED

1

Entering edit mode

9.3 years ago

skbrimer ▴ 740

Hi All,

I'm trying to make a script that will allow me to take a given genbank file and convert it a fasta and a bed file in the same script so I can then use it with the mapper and viewers downstream. The file conversion itself is pretty straight forward. I'm not sure which of the developers for Biopython made SeqIO.convert but they have all my gratitude. In fact all of Biopython I have found to be really empowering and enjoyable, however as I'm still a inexperienced I'm having some trouble extracting the information I want.

So what I want is the name of the gene, start, stop, and strand just the basics. and when I read in the file I can see this.

genome.features

[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1192), strand=1), type='source'),SeqFeature(FeatureLocation(ExactPosition(23), ExactPosition(1127), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(23), ExactPosition(1127), strand=1), type='CDS')]

which I know is a list of the SeqFeatures but I'm having trouble getting to that information so I was exploring and printed what each item was:

for I in genome.features:
    print(i,type(i))

and got:

type: source
location: [0:1192](+)
qualifiers:
    Key: country, Value: ['USA']
    Key: db_xref, Value: ['taxon:38170']
    Key: mol_type, Value: ['genomic RNA']
    Key: organism, Value: ['Avian orthoreovirus']
    Key: segment, Value: ['S4']
    Key: strain, Value: ['AVS-B']
<class 'Bio.SeqFeature.SeqFeature'>
type: gene
location: [23:1127](+)
qualifiers:
    Key: gene, Value: ['sigma-NS']
<class 'Bio.SeqFeature.SeqFeature'>
type: CDS
location: [23:1127](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: db_xref, Value: ['GI:315466580']
    Key: gene, Value: ['sigma-NS']
    Key: product, Value: ['sigma-NS protein']
    Key: protein_id, Value: ['CBX25032.1']
    Key: translation, Value: ['MDNTVRVGVSRNTSGAAGQTVFRNYYLLRCNISADGRNATKAVQSHFPFLSRAVRCLSPLAAHCADRTLRRDNVKQILTRELPFPSDLINYAHHVNSSSLTTSQGVEAARLVAQVYGEQLSFDHIYPTGSATYCPGAIANAISRIMAGFVPHEGDNFTPDGAIDYLAADLVAYKFVLPYMLDIVDGRPQIVLPSHTVEEMLSNTSLLNSIDASFGIESKSDQRMTRDAAEMSSRSLNELEDHEQRGRMPWKIMTAMFAAQLKVELDALADERVESQANAHVTSFGSRLFNQMSAFVPIDRELMELALLIKEQGFAMNPGQVASKWSLIRRSGPTRPLSGARLEIRNGNWTIREGDQTLLSVSPARMA']
<class 'Bio.SeqFeature.SeqFeature'>

I know that the Key Value pair means its a dictionary and when I try to just print the genome.features.type I get a list error so I think each feature is a list with two lists and a dictionary inside them but I can not seem to figure out how to extract the information.

Can anybody point me in the right direction in either an explanation or the correct documentation I need to read I would be very grateful.

biopython • 2.5k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 9.3 years ago by skbrimer ▴ 740

0

Entering edit mode

Can you change the title to something better and more meaningful? (keeping in mind about the people visiting the site, how could the get benefit from this question).

ADD REPLY • link 9.3 years ago by Sukhi Singh 11k

0

Entering edit mode

sure, no problem

ADD REPLY • link 9.3 years ago by skbrimer ▴ 740

Ram · Accepted Answer · 2016-01-21

HAHAHAHA!!! after a lot of trial and error and reading I got a script that works. Please find it below is anyone could use it for themselves.

''' file conversion for from genbank to fasta
    and to make a bedfile as well. Code modified from 
    Brant Faircloth's "bed_from_genbank.py" for the bed file
    you can find the original code here
    https://gist [dot] github [dot] com/brantfaircloth/893580'''

from Bio import SeqIO
import sys

# Lets gets some files 
gb_file = sys.argv[1]
lables = gb_file.split(".")

# Producing the bed file, first make a new writeable file and creat header
# Added the line break \n for a cleaner look with a text editor
# sorry my OCD. 

out_bedfile = open(lables[0]+".bed", 'w')
bed_header = "Track name="+ lables[0] + " description="+lables[0]+" genes\n"
out_bedfile.write(bed_header)

# loop will work with multi genbank files. 

for record in SeqIO.parse(open(gb_file,"rU"),"genbank"):
    for feature in record.features:
        if feature.type == 'gene':
            start = int(feature.location.start)
            stop = int(feature.location.end)
            try:
                name = feature.qualifiers['gene'][0]
            except:
                #some features only have locus tags
                name = feature.qualifiers['locus_tag'][0]
            if feature.strand < 0:
                strand = "-"
            else:
                strand = "+"
            bed_line = lables[0]+" dna\t{0}\t{1}\t{2}\t1000\t{3}\t{0}\t{1}\n".format(start, stop, name, strand)
            out_bedfile.write(bed_line)

out_bedfile.close()

#convert to fasta - so easy (thank you biopython)
SeqIO.convert(gb_file,"genbank",lables[0]+".fasta","fasta")

EDIT

correct a typo, code should be good now :)