Hi, I'm working on a project that involves downloading over a million SARS-COV-2 sequences from NCBI. As this will eventually be an open source project, I'm trying to code as many steps as I can for repeatability. Currently, I'm stuck trying to use Biopython's Entrez tools, Esearch, Epost, and Efetch to download complete sequences in fasta format.
My code so far is as follows (parts with help from this Stack Overflow answer):
from urllib.error import HTTPError
from Bio import Entrez
import time
Entrez.api_key = "<censored>"
Entrez.email = "<censored>"
db = "nuccore"
query = "txid2697049[organism:exp] AND biomol_genomic[prop] AND viruses[filter] AND 'USA'[Text Word] AND 'complete sequence'[Text Word]"
handle = Entrez.esearch(db=db, term=query)
record = Entrez.read(handle)
count = int(record['Count'])
handle = Entrez.esearch(db=db, term=query, retmax=count, usehistory="y")
record = Entrez.read(handle)
id_list = record['IdList']
webenv = record['WebEnv']
batch_size = 3
for start in range(0, count, batch_size):
end = min(count, start+batch_size)
print("Going to post accession numbers %i to %i" % (start+1, end))
attempt = 0
success = False
while attempt < 3 and not success:
attempt += 1
post_xml = Entrez.epost(db, webenv=webenv, id=",".join(id_list))
success = True
search_results = Entrez.read(post_xml)
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
batch_size = 2
out_handle = open("sarscov2.txt", "w")
for start in range(0, count, batch_size):
end = min(count, start+batch_size)
print("Going to download record %i to %i" % (start+1, end))
attempt = 0
success = False
while attempt < 3 and not success:
attempt += 1
try:
fetch_handle = Entrez.efetch(db=db, rettype="fasta",
retstart=start, retmax=batch_size,
webenv=webenv, query_key=query_key)
success = True
time.sleep(10)
except HTTPError as err:
if 500 <= err.code <= 599:
print("Received error from server %s" % err)
print("Attempt %i of 3" % attempt)
time.sleep(15)
else:
raise
data = fetch_handle.read()
fetch_handle.close()
out_handle.write(data)
out_handle.close()
I'm repeatedly getting a HTTP 504: Gateway Timeout
error when trying to run the epost
line. I think this is because I'm sending too many requests, but I'm not sure how to go about fixing this. Could anyone point me in the right direction? Thank you!
Instead of this your best option would be to use
datasets
: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/virus/get-sars2-genomes/Filter anything you need locally.