Question

Esearch, Epost, and Efetch for Large Datasets in Biopython

0

Entering edit mode

16 months ago

Salem • 0

Hi, I'm working on a project that involves downloading over a million SARS-COV-2 sequences from NCBI. As this will eventually be an open source project, I'm trying to code as many steps as I can for repeatability. Currently, I'm stuck trying to use Biopython's Entrez tools, Esearch, Epost, and Efetch to download complete sequences in fasta format.

My code so far is as follows (parts with help from this Stack Overflow answer):

from urllib.error import HTTPError
from Bio import Entrez
import time

Entrez.api_key = "<censored>"
Entrez.email = "<censored>"

db = "nuccore"
query = "txid2697049[organism:exp] AND biomol_genomic[prop] AND viruses[filter] AND 'USA'[Text Word] AND 'complete sequence'[Text Word]"

handle = Entrez.esearch(db=db, term=query)
record = Entrez.read(handle)

count = int(record['Count'])

handle = Entrez.esearch(db=db, term=query, retmax=count, usehistory="y")
record = Entrez.read(handle)

id_list = record['IdList']
webenv = record['WebEnv']

batch_size = 3
for start in range(0, count, batch_size):
    end = min(count, start+batch_size)
    print("Going to post accession numbers %i to %i" % (start+1, end))
    attempt = 0
    success = False
    while attempt < 3 and not success:
        attempt += 1
        post_xml = Entrez.epost(db, webenv=webenv, id=",".join(id_list))
        success = True
    search_results = Entrez.read(post_xml)


webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]


batch_size = 2
out_handle = open("sarscov2.txt", "w")
for start in range(0, count, batch_size):
    end = min(count, start+batch_size)
    print("Going to download record %i to %i" % (start+1, end))
    attempt = 0
    success = False
    while attempt < 3 and not success:
        attempt += 1
        try:
            fetch_handle = Entrez.efetch(db=db, rettype="fasta",
                                         retstart=start, retmax=batch_size,
                                         webenv=webenv, query_key=query_key)
            success = True
            time.sleep(10)
        except HTTPError as err:
            if 500 <= err.code <= 599:
                print("Received error from server %s" % err)
                print("Attempt %i of 3" % attempt)
                time.sleep(15)
            else:
                raise
    data = fetch_handle.read()
    fetch_handle.close()
    out_handle.write(data)
out_handle.close()

I'm repeatedly getting a HTTP 504: Gateway Timeout error when trying to run the epost line. I think this is because I'm sending too many requests, but I'm not sure how to go about fixing this. Could anyone point me in the right direction? Thank you!

eutils biopython entrez • 782 views

ADD COMMENT • link updated 16 months ago by Ram 44k • written 16 months ago by Salem • 0

0

Entering edit mode

Instead of this your best option would be to use datasets: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/virus/get-sars2-genomes/

Filter anything you need locally.

ADD REPLY • link 16 months ago by GenoMax 148k