I'm not super handy with the Genbank submodule as I've never tried to make one from scratch, but I'm fairly sure what you tried to do above doesn't work like you think it does.
A Genbank.Record
describes the whole file (I think). What you're building there is the header of the genbank file, not a sequence record/feature. If you try to create that Record
you would be thrown an error from Record.__init__()
that there is no keyword argument for features
. I'm pretty sure this is the case, because it wouldnt make much sense for every sequence feature in the genbank to also accept a data_file_division
/locus_line
etc.
If you read a genbank file in to BioPython via SeqIO
and inspect its contents, you'll see all the sequence information is the .features
attribute, and they are all objects of the form:
SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(29330), strand=1), type='source'),
SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1065), strand=1), type='gene'),
SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1065), strand=1), type='CDS')
Therefore, I'm pretty sure you'll need to populate your Record()
with SeqFeature()
objects, which in turn contain FeatureLocation
objects and associated metadata/methods.
In my toy example, the data structure looks something a little like this, within the genbank Record()
, the attributes for actual entries are as follows (not an exhaustive example).
genbank.features[0]
-> this is the source
object, i.e. the full genbank sequence:
>>> repr(gbk.features[0])
"SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(29330), strand=1), type='source')"
Next up are the first 'real' features:
>>> repr(gbk.features[1])
"SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1065), strand=1), type='gene')"
>>> gbk.features[1].qualifiers
"OrderedDict([('locus_tag', ['pPAU_00002'])])"
A 'gene' doesn't have much in the way of information (in my example at least), just a locus tag and location information. The cds
, however, has much more interesting stuff:
>>> gbk.features[2]
SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1065), strand=1), type='CDS')
>>> gbk.features[2].qualifiers
OrderedDict([('gene', ['repB']), ('locus_tag', ['pPAU_00001']), ('inference', ['ab initio prediction:Prodigal:2.60', 'similar to AA sequence:UniProtKB:Q57154']), ('codon_start', ['1']), ('transl_table', ['11']), ('product', ['RepFIB replication protein A']), ('translation', ['MIENFNENNDMSDMFWEVEKGTGEVINLVPNTSNTVQPVVLMRLGLFVPTLKSTKRGHQGEMSSMDATAELRQLAIVKTEGYENIHITGARLDMDNDFKTWVGIIHSFAKHKVIGDAVTLSFVDFIKLCGIPSSRSSKRLRERLGASLRRIATNTLSFSSQNKSYHTHLVQSAYYDMVKDTVTIQADPKIFELYQFDRKVLLQLRAINELGRKESAQALYTYIESLPPSPAPISLARLRARLNLRSRVTTQNAIVRKAMEQLKGIGYLDYTEIKRGSSVYFIVHARRPKLKALKSSKSSFKRKKETQEESILTELTREELELLEIIRAEKIIKVTRNHRRKKQTLLTFAEDESQ'])])
Long story short, you need to build your file up from progressively smaller objects in order for the final genbank record to have all the information it needs, but you're incorrectly conflating a GenBank.Record()
and a sequence SeqRecord
/SeqFeature
. To my knowledge, this will be pretty tedious - I don't know of a particularly easy way of doing this.
You can specify the organism the
SeqRecord
for output in GenBank format viaBio.SeqIO.write(...)
as theSOURCE
andORGANISM
lines. You would need to populateseq_record.annotations['source']
, and this is the approach I would recommend here. Relevant code: https://github.com/biopython/biopython/blob/biopython-173/Bio/SeqIO/InsdcIO.py#L920