I want to download a lot of protein sequences for a metaproteomics study and for hundreds of Genus( under which comes multiple species). My Esearch/Efetch command looks like this but appears to be quite slow and though I belong to University with high speed net connection, download is quite slow and many a times the link gets broken.
esearch -db "protein" -query "txid374666[Organism]" | efetch -format fasta > txid_374666.fasta
Then I lifted the following Python code from a biostars thread . This is slow again , and sometimes issues "bad gateway" error. Can anybody suggest some fast way of downloading the sequences? Thanks
from Bio import Entrez
import json
import pandas as pd
import time
def get_ids(response) -> list:
j = json.loads(response.read())
return list(j['esearchresult']['idlist'])
Entrez.email = "my.name@myuniv.edu"
RETMAX = 990000
txids =[187492] #100K sequences
for txid in txids:
prids = get_ids(Entrez.esearch(db="Protein", term=F"txid{txid}[Organism]", retmax=RETMAX, retmode="json"))
with open(f"taxid_{txid}.fasta", 'w') as file:
start_time = time.time()
for prid in prids:
# print(json.loads(Entrez.esummary(db="Protein", id=prid, retmode="json").read())['result'][prid])
fasta = Entrez.efetch(db="Protein", id=prid, rettype="fasta", retmode="text").read()
file.write(fasta)
print("--- %s minutes, %s proteins , taxid_%s ---" % ( (time.time() - start_time)//60 ,len(prids), txid ))
Thanks a ton