Question

Protein Sequence from Entrez by Taxonomic_ID: Very slow

0

Entering edit mode

12 months ago

The ▴ 180

I want to download a lot of protein sequences for a metaproteomics study and for hundreds of Genus( under which comes multiple species). My Esearch/Efetch command looks like this but appears to be quite slow and though I belong to University with high speed net connection, download is quite slow and many a times the link gets broken.

esearch -db "protein" -query "txid374666[Organism]" | efetch -format fasta > txid_374666.fasta

Then I lifted the following Python code from a biostars thread . This is slow again , and sometimes issues "bad gateway" error. Can anybody suggest some fast way of downloading the sequences? Thanks

from Bio import Entrez
import json
import pandas as pd
import time


def get_ids(response) -> list:
    j = json.loads(response.read())
    return list(j['esearchresult']['idlist'])

Entrez.email = "my.name@myuniv.edu"
RETMAX = 990000


txids =[187492] #100K sequences

for txid in txids:
        prids = get_ids(Entrez.esearch(db="Protein", term=F"txid{txid}[Organism]", retmax=RETMAX, retmode="json"))
        with open(f"taxid_{txid}.fasta", 'w') as file:
            start_time = time.time()
            for prid in prids:
                # print(json.loads(Entrez.esummary(db="Protein", id=prid, retmode="json").read())['result'][prid])
                fasta = Entrez.efetch(db="Protein", id=prid, rettype="fasta", retmode="text").read()
                file.write(fasta)

            print("--- %s minutes, %s proteins  , taxid_%s ---" % ( (time.time() - start_time)//60 ,len(prids), txid ))

Efetch entrez python protein sequence • 651 views

ADD COMMENT • link 12 months ago by The ▴ 180

score 3 · Accepted Answer · 2023-11-02

3

Entering edit mode

12 months ago

GenoMax 147k

Please use NCBI datasets for this kind of a workload.

An example download using taxID in your post above

datasets download genome taxon 187492  --include protein

this currently gets you 103 genomes in 2 mins.

ADD COMMENT • link 12 months ago by GenoMax 147k

0

Entering edit mode

Thanks a ton

ADD REPLY • link 12 months ago by The ▴ 180