Question

How can I obtain specimen_voucher information for each accession number in a python dictionary prepared with biopython?

0

Entering edit mode

6.9 years ago

sw.knudsen • 0

With this piece of code in python using the module biopython:

from Bio import Entrez
Entrez.email = "email@any_email.com"

organisms=["Diaphus anderseni"]
genes = ["H3"]
specs = {}
acclist = {}
for org in organisms:
    for gene in genes:
        query= org+"[organism] AND "+gene+"[gene]"
        res = Entrez.esearch(db="nucleotide", term=query, retmax=10000)
        rec = Entrez.read(res)
        res = Entrez.efetch(db="nucleotide", id=rec["IdList"],  retmode = "xml")
        for record in Entrez.read(res):
            speciesName = record["GBSeq_organism"]
            accn = record["GBSeq_accession-version"]
            if accn in acclist:
                acclist[accn].append(speciesName)
            else:
                acclist[accn] = [speciesName]

I get a dictionary with two entries like this:

{'KJ555688.1': ['Diaphus anderseni'], 'KJ555689.1': ['Diaphus anderseni']}

But I would also like to prepare a dictionary that has the 'specimen_voucher' information , so it looks like this:

{'KJ555688.1': ['SIO:10-169'], 'KJ555689.1': ['SIO:10-170']}

I prepared this last dictionary manually, by looking up the complete GenBank record for KJ555688 and KJ555689. But I would like to be able to do it in python, to do it on a grander scale with hundreds of accession numbers. Any advice on this would be greatly appreciated. Thanks in advance for your time and help.

py python biopython GenBank record • 1.3k views

ADD COMMENT • link updated 5.6 years ago by AK ★ 2.2k • written 6.9 years ago by sw.knudsen • 0

score 0 · Answer 1 · 2019-05-23

Hi sw.knudsen,

specimen_voucher will be record["GBSeq_feature-table"][0]["GBFeature_quals"][3]["GBQualifier_value"]:

$ cat test.py
from Bio import Entrez
Entrez.email = "email@any_email.com"

organisms=["Diaphus anderseni"]
genes = ["H3"]
specs = {}
acclist = {}
for org in organisms:
    for gene in genes:
        query= org+"[organism] AND "+gene+"[gene]"
        res = Entrez.esearch(db="nucleotide", term=query, retmax=10000)
        rec = Entrez.read(res)
        res = Entrez.efetch(db="nucleotide", id=rec["IdList"],  retmode = "xml")
        for record in Entrez.read(res):
            speciesName = record["GBSeq_organism"]
            accn = record["GBSeq_accession-version"]
            specimenVoucher = record["GBSeq_feature-table"][0]["GBFeature_quals"][3]["GBQualifier_value"]
            if accn in acclist:
                acclist[accn].append(specimenVoucher)
            else:
                acclist[accn] = [specimenVoucher]

print(acclist)

$ python test.py
{'KJ555689.1': ['SIO:10-170'], 'KJ555688.1': ['SIO:10-169']}