Question

Creating Genbank files with Python using BioGenBankRecord

3

Entering edit mode

5.8 years ago

toth.joe ▴ 60

I want to create my own Genbank files for import of sequence and annotations into a commercial DNA Software package.

I have a working version using Bio.SeqRecord but the SeqRecord format is limiting (I can't specify organism and some other attributes).

I am looking at Bio.GenBank.Record as a better implementation but I don't know enough python create a script using the source code Bio.GenBank.Record

Can someone point me in the right direction to get this test script to work?

#!/usr/bin/python

from Bio.GenBank.Record import Record
from Bio.Alphabet import IUPAC
from Bio.SeqFeature import SeqFeature, FeatureLocation
from Bio.SeqIO import InsdcIO

# Create a container with the data
container = Record(
    locus_line="SCU49845",
    size="5028 bp",
    residue_type="PROTEIN",
    data_file_division="PLN",
    accession="U49845",
    keywords="test",
    source="Saccharomyces cerevisiae (baker's yeast)",
    organism="Saccharomyces cerevisiae",
    features=['gene','687..3158','gene="AXL2"'],
    sequence="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEAAEVLLRVDNIIRARPRTANRQHM"
   )

# Save as GenBank file
output_file = open('test.gb', 'w')
InsdcIO.write_record(container, output_file, 'genbank')

python Biopython Genbank • 6.0k views

ADD COMMENT • link 5.8 years ago by toth.joe ▴ 60

1

Entering edit mode

You can specify the organism the SeqRecord for output in GenBank format via Bio.SeqIO.write(...) as the SOURCE and ORGANISM lines. You would need to populate seq_record.annotations['source'], and this is the approach I would recommend here. Relevant code: https://github.com/biopython/biopython/blob/biopython-173/Bio/SeqIO/InsdcIO.py#L920

ADD REPLY • link 5.8 years ago by Peter 6.0k

score 2 · Answer 1 · 2019-02-22

Thanks for the information. I realized that I had a problem with the Genbank record format.

Here is the code that works:

from Bio.GenBank.Record import Record

 def main():  
/# open csv read entries row by row then insert in a dictionary  
    with open(args.sampleFile,'r') as f:  
        csvf=csv.DictReader(f)        
        for row in csvf:  
/# Create a container with the data using the dictionary  
            container = Record()  
            container.locus = row['Sample']  
            container.size = len(row['Seq'])  
            container.residue_type="PROTEIN"  
            container.data_file_division="PRI"  
            container.date = (datetime.date.today().strftime("%d-%b-%Y")) # today's date  
            container.definition = row['FullCloneName']  
            container.accession = [row['Vgene']]  
            container.comment = 'project xyz'  
            container.version = getpass.getuser()  
            container.keywords = [row['ProjectName']]  
/# Save as GenBank file  
            with open(row['ProjectName'] + '_' + row['FullCloneName'] + '.gb', 'w') as output_file:  
                output_file.write(str(container))

score 0 · Answer 2 · 2019-02-18

I'm not super handy with the Genbank submodule as I've never tried to make one from scratch, but I'm fairly sure what you tried to do above doesn't work like you think it does.

A Genbank.Record describes the whole file (I think). What you're building there is the header of the genbank file, not a sequence record/feature. If you try to create that Record you would be thrown an error from Record.__init__() that there is no keyword argument for features. I'm pretty sure this is the case, because it wouldnt make much sense for every sequence feature in the genbank to also accept a data_file_division/locus_line etc.

If you read a genbank file in to BioPython via SeqIO and inspect its contents, you'll see all the sequence information is the .features attribute, and they are all objects of the form:

SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(29330), strand=1), type='source'),
SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1065), strand=1), type='gene'), 
SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1065), strand=1), type='CDS')

Therefore, I'm pretty sure you'll need to populate your Record() with SeqFeature() objects, which in turn contain FeatureLocation objects and associated metadata/methods.

In my toy example, the data structure looks something a little like this, within the genbank Record(), the attributes for actual entries are as follows (not an exhaustive example).

genbank.features[0] -> this is the source object, i.e. the full genbank sequence:

>>> repr(gbk.features[0])
"SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(29330), strand=1), type='source')"

Next up are the first 'real' features:

>>> repr(gbk.features[1])
"SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1065), strand=1), type='gene')"
>>> gbk.features[1].qualifiers
"OrderedDict([('locus_tag', ['pPAU_00002'])])"

A 'gene' doesn't have much in the way of information (in my example at least), just a locus tag and location information. The cds, however, has much more interesting stuff:

>>> gbk.features[2]
SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1065), strand=1), type='CDS')
>>> gbk.features[2].qualifiers
OrderedDict([('gene', ['repB']), ('locus_tag', ['pPAU_00001']), ('inference', ['ab initio prediction:Prodigal:2.60', 'similar to AA sequence:UniProtKB:Q57154']), ('codon_start', ['1']), ('transl_table', ['11']), ('product', ['RepFIB replication protein A']), ('translation', ['MIENFNENNDMSDMFWEVEKGTGEVINLVPNTSNTVQPVVLMRLGLFVPTLKSTKRGHQGEMSSMDATAELRQLAIVKTEGYENIHITGARLDMDNDFKTWVGIIHSFAKHKVIGDAVTLSFVDFIKLCGIPSSRSSKRLRERLGASLRRIATNTLSFSSQNKSYHTHLVQSAYYDMVKDTVTIQADPKIFELYQFDRKVLLQLRAINELGRKESAQALYTYIESLPPSPAPISLARLRARLNLRSRVTTQNAIVRKAMEQLKGIGYLDYTEIKRGSSVYFIVHARRPKLKALKSSKSSFKRKKETQEESILTELTREELELLEIIRAEKIIKVTRNHRRKKQTLLTFAEDESQ'])])

Long story short, you need to build your file up from progressively smaller objects in order for the final genbank record to have all the information it needs, but you're incorrectly conflating a GenBank.Record() and a sequence SeqRecord/SeqFeature. To my knowledge, this will be pretty tedious - I don't know of a particularly easy way of doing this.