Question

How can I print and write the strain /isolate/voucher number of a SeqRecord objec in biopython?

0

Entering edit mode

2.2 years ago

diogobio.dh • 0

I would like to print and write (in fasta) the strain /isolate/voucher number of multiple DNA sequences parsed in biopython (and also organism name and sequence). However, although this information (strain/isolate/voucher) is within feature-qualifiers (genbank format), this information is not extracted with atributes annotation and features of a seqrecord object.

Does anyone know how can I do this?

I'm a newbie with python, so any tip would be very helpful.

Thanks

below there is part of a gb file as an example.

[...]
FEATURES             Location/Qualifiers
     source          1..841
                     /organism="Amauroderma calcitum"
                     /mol_type="genomic DNA"
                     /**isolate**="FLOR50931"
                     /db_xref="taxon:1774182"
                     /country="Brazil"
                     /collection_date="07-Jan-2013"
                     /collected_by="D.H. Costa-Rezende"
     rRNA            <1..>841
                     /product="large subunit ribosomal RNA"
[...]

biopython DNA number strain sequence • 1.2k views

ADD COMMENT • link 2.2 years ago by diogobio.dh • 0

0

Entering edit mode

Please provide your current code used for parsing.

ADD REPLY • link 2.2 years ago by Sej Modha 5.3k

0

Entering edit mode

Hi Sej Thanks for your reply. The code is below:

from Bio import SeqIO
import sys
print(sys.argv[0])
arquivodeentrada = sys.argv[1]
arquivodesaida = sys.argv[2]

sequences=SeqIO.parse(arquivodeentrada,'genbank')

f=open(arquivodesaida,'w')

for seq in sequences:
    acesso=seq.name
    gensp=seq.annotations['organism'].split(' ')
    sequencia=seq.seq
    f.write(str('>'+gensp[0]+'_'+gensp[1]+'_'+acesso+'\n'))
    print(gensp[0]+'_'+gensp[1]+'_'+acesso)
    f.write(str(sequencia+'\n'))
    print(sequencia)

f.close()

ADD REPLY • link updated 2.2 years ago by GenoMax 147k • written 2.2 years ago by diogobio.dh • 0

score 1 · Answer 1 · 2022-09-07

The isolate is a qualifier of the source feature that you can access like so:

from Bio import SeqIO
from pprint import pprint

# Read genbank file
for rec in SeqIO.parse("genome.gb", "genbank"):
    source = rec.features[0]
    pprint(source.qualifiers)

will print:

OrderedDict([('organism', ['Amauroderma calcitum']),
             ('mol_type', ['genomic DNA']),
             ('isolate', ['FLOR 50931']),
             ('db_xref', ['taxon:1774182']),
             ('country', ['Brazil']),
             ('collection_date', ['07-Jan-2013']),
             ('collected_by', ['D.H. Costa-Rezende'])])

alternatively, IMHO, the simplest way to get this info is with bio

bio fetch KU315207 | bio json > data.json

now you have a json file that can be immediately materialized in your program, no GenBank parsing needed anymore

https://www.bioinfo.help/