Please forgive the newbie question, but I am indeed new to BioPython. I'm just simply trying to parse a large file in Genbank format to FASTA format and am using Bio.SeqIO in BioPython.
I'm looking to parse an output file with the Accession number and Taxon in the FASTA > header and then the Genbank Taxonomy instead of the nucleotide sequence. I am comfortable with parsing just the fasta title and sequence. What I am doing is constructing a file to train RDP classifier for a Eukaryote marker gene (one does not already exist for my marker).
The output I am looking for is:
X62988Emericellanidulans
Eukaryota; Fungi; Dikarya; Ascomycota; Pezizomycotina; Eurotiomycetes; Eurotiomycetidae; Eurotiales; Trichocomaceae; Emericella; Emericella nidulans.
or:
573145
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Escherichia; Escherichia sp.
This is what I have used for simple nucleotide parsing.
from Bio import SeqIO
gbk_filename = "genbank.gbk"
faa_filename = "fasta.fna"
input_handle = open(gbk_input, "r")
output_handle = open(faa_output, "w")
for seq_record in SeqIO.parse(input_handle, "genbank") :
print "Parsing GenBank record %s" % seq_record.id
output_handle.write(">%s %s\n%s\n" % (
seq_record.id,
seq_record.description,
seq_record.seq.tostring()))
output_handle.close()
input_handle.close()
print "Completed"
I know this is probably a simple fix, but I've searched for a long while and can't find an output in SeqIO for the taxonomy string, does anyone have any recommendations for modifying the above script?
Thanks so much for helping me out... I'm pretty new to the BioPython parsing here. My best to you all.