['5845', '2']

Question

How to use Biopython for fetching metadata of NCBI/GenBank/RefSeq assembly identifiers?

1

Entering edit mode

6.3 years ago

O.rka ▴ 740

I'm trying to use Python and Biopython to fetch metadata for a given assembly identifier. In this case I'm looking for GCF_000005845.2

from Bio import Entrez

# GCF_000005845.2
id_ecoli = "GCF_000005845.2"
esummary_handle = Entrez.esummary(db="assembly", id=id_ecoli, report="full")
record = Entrez.read(esummary_handle, validate=False)
record
# DictElement({'DocumentSummarySet': DictElement({'DocumentSummary': []}, attributes={'status': 'OK'})}, attributes={})

This is the type of data I'm looking for below: https://www.ncbi.nlm.nih.gov/assembly/GCF_000005845.2/ enter image description here

I could make a HTML scraper but I don't want to reinvent the wheel if there is already something available.

biopython ncbi entrez Assembly fetch • 6.7k views

ADD COMMENT • link updated 6.3 years ago by Arup Ghosh 3.2k • written 6.3 years ago by O.rka ▴ 740

0

Entering edit mode

handle = Entrez.efetch(db="assembly", id="GCF_000005845.2") record = Entrez.read(handle) record

['5845', '2']

ADD REPLY • link 6.3 years ago by O.rka ▴ 740

score 11 · Accepted Answer · 2018-10-24

If JSON/XML output will be useful to you, the following script can be used.

#!/usr/bin/python

from Bio import Entrez
import json

#Increase query limit to 10/s & get warnings
Entrez.email = ""
#Get one from https://www.ncbi.nlm.nih.gov/account/settings/ page
Entrez.api_key=""

term="GCF_000005845.2"
#Finds the ids associated with the assembly
def get_ids(term):
    ids = []
    handle = Entrez.esearch(db="assembly", term=term)
    record = Entrez.read(handle)
    ids.append(record["IdList"])
    return ids

#Fetch raw output
def get_raw_assembly_summary(id):
    handle = Entrez.esummary(db="assembly",id=id,report="full")
    record = Entrez.read(handle)
    #Return individual fields
    #XML output: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=assembly&id=79781&report=%22full%22
    #return(record['DocumentSummarySet']['DocumentSummary'][0]['AssemblyName']) #This will return the Assembly name
    return(record)

#JSON formatted output
def get_assembly_summary_json(id):
    handle = Entrez.esummary(db="assembly",id=id,report="full")
    record = Entrez.read(handle)
    #Convert raw output to json
    return(json.dumps(record, sort_keys=True,indent=4, separators=(',', ': ')))

#Test
for id in get_ids(term):
    #print(get_raw_assembly_summary(id)) #For raw output
    print(get_assembly_summary_json(id)) #JSON Formatted