I have this script below and it seems to be hard coded for files in the working directory. I basically know nothing about python but I want to make this script so it can read in a many genbank files instead of just be coded for one file at a time.
from Bio import GenBank
gbk_filename = "NC_005213.gbk"
faa_filename = "NC_005213_converted.faa"
input_handle = open(gbk_filename, "r")
output_handle = open(faa_filename, "w")
for seq_record in SeqIO.parse(input_handle, "genbank") :
print "Dealing with GenBank record %s" % seq_record.id
for seq_feature in seq_record.features :
if seq_feature.type=="CDS" :
assert len(seq_feature.qualifiers['translation'])==1
output_handle.write(">%s from %s\n%s\n" % (
seq_feature.qualifiers['locus_tag'][0],
seq_record.name,
seq_feature.qualifiers['translation'][0]))
output_handle.close()
input_handle.close()
print "Done"
i am not familiar with scripts but instead I think you can retrieve Genbank Protein GI (from http://biodbnet.abcc.ncifcrf.gov/db/db2db.php) then in Batch Entrez you can retrieve the bulk protein sequence
Yes but the main function I am looking for is to get the sequences parsed into separate files based on the different gb accession for the overall contig.
don't worry friend, your problem will be solved soon...