How can I print and write the strain /isolate/voucher number of a SeqRecord objec in biopython?
1
0
Entering edit mode
2.2 years ago

I would like to print and write (in fasta) the strain /isolate/voucher number of multiple DNA sequences parsed in biopython (and also organism name and sequence). However, although this information (strain/isolate/voucher) is within feature-qualifiers (genbank format), this information is not extracted with atributes annotation and features of a seqrecord object.

Does anyone know how can I do this?

I'm a newbie with python, so any tip would be very helpful.

Thanks

below there is part of a gb file as an example.

[...]
FEATURES             Location/Qualifiers
     source          1..841
                     /organism="Amauroderma calcitum"
                     /mol_type="genomic DNA"
                     /**isolate**="FLOR50931"
                     /db_xref="taxon:1774182"
                     /country="Brazil"
                     /collection_date="07-Jan-2013"
                     /collected_by="D.H. Costa-Rezende"
     rRNA            <1..>841
                     /product="large subunit ribosomal RNA"
[...]
biopython DNA number strain sequence • 1.2k views
ADD COMMENT
0
Entering edit mode

Please provide your current code used for parsing.

ADD REPLY
0
Entering edit mode

Hi Sej Thanks for your reply. The code is below:

from Bio import SeqIO
import sys
print(sys.argv[0])
arquivodeentrada = sys.argv[1]
arquivodesaida = sys.argv[2]

sequences=SeqIO.parse(arquivodeentrada,'genbank')

f=open(arquivodesaida,'w')

for seq in sequences:
    acesso=seq.name
    gensp=seq.annotations['organism'].split(' ')
    sequencia=seq.seq
    f.write(str('>'+gensp[0]+'_'+gensp[1]+'_'+acesso+'\n'))
    print(gensp[0]+'_'+gensp[1]+'_'+acesso)
    f.write(str(sequencia+'\n'))
    print(sequencia)

f.close()
ADD REPLY
1
Entering edit mode
2.2 years ago

The isolate is a qualifier of the source feature that you can access like so:

from Bio import SeqIO
from pprint import pprint

# Read genbank file
for rec in SeqIO.parse("genome.gb", "genbank"):
    source = rec.features[0]
    pprint(source.qualifiers)

will print:

OrderedDict([('organism', ['Amauroderma calcitum']),
             ('mol_type', ['genomic DNA']),
             ('isolate', ['FLOR 50931']),
             ('db_xref', ['taxon:1774182']),
             ('country', ['Brazil']),
             ('collection_date', ['07-Jan-2013']),
             ('collected_by', ['D.H. Costa-Rezende'])])

alternatively, IMHO, the simplest way to get this info is with bio

bio fetch KU315207 | bio json > data.json

now you have a json file that can be immediately materialized in your program, no GenBank parsing needed anymore

ADD COMMENT
0
Entering edit mode

Thank you very much for your help Istvan! First option worked fine for me.

ADD REPLY

Login before adding your answer.

Traffic: 2092 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6