Question

Parsing Genbank To Fasta Format In Biopython For Metagenomic Classification

0

Entering edit mode

13.2 years ago

Josh Herr 5.8k

Please forgive the newbie question, but I am indeed new to BioPython. I'm just simply trying to parse a large file in Genbank format to FASTA format and am using Bio.SeqIO in BioPython.

I'm looking to parse an output file with the Accession number and Taxon in the FASTA > header and then the Genbank Taxonomy instead of the nucleotide sequence. I am comfortable with parsing just the fasta title and sequence. What I am doing is constructing a file to train RDP classifier for a Eukaryote marker gene (one does not already exist for my marker).

The output I am looking for is:

X62988Emericellanidulans
Eukaryota; Fungi; Dikarya; Ascomycota; Pezizomycotina; Eurotiomycetes; Eurotiomycetidae; Eurotiales; Trichocomaceae; Emericella; Emericella nidulans.

or:

573145
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Escherichia; Escherichia sp.

This is what I have used for simple nucleotide parsing.

from Bio import SeqIO

gbk_filename = "genbank.gbk"
faa_filename = "fasta.fna"

input_handle  = open(gbk_input, "r")
output_handle = open(faa_output, "w")

for seq_record in SeqIO.parse(input_handle, "genbank") :
    print "Parsing GenBank record %s" % seq_record.id
    output_handle.write(">%s %s\n%s\n" % (
           seq_record.id,
           seq_record.description,
           seq_record.seq.tostring()))

output_handle.close()
input_handle.close()
print "Completed"

I know this is probably a simple fix, but I've searched for a long while and can't find an output in SeqIO for the taxonomy string, does anyone have any recommendations for modifying the above script?

Thanks so much for helping me out... I'm pretty new to the BioPython parsing here. My best to you all.

biopython parsing metagenomics fasta • 6.0k views

ADD COMMENT • link updated 13.2 years ago by Damian Kao 16k • written 13.2 years ago by Josh Herr 5.8k

score 4 · Answer 1 · 2012-03-16

4

Entering edit mode

13.2 years ago

Damian Kao 16k

The annotations method returns a dictionary that holds the various attributes in the genbank file. This:

seq_record.annotations['taxonomy']

will return an array of the taxonomy line. To get it in the format you want, you can

'; '.join(seq_record.annotations['taxonomy'])

ADD COMMENT • link 13.2 years ago by Damian Kao 16k