Entering edit mode
15 months ago
Eugene
•
0
I run this command to download ~4,000 gene sequences for invA gene for taxonomy# 28901. It works fine for smaller datasets, but ... but takes very long time and never finishes for this large dataset:
esearch -db nuccore -query 'gbdiv BCT[PROP] AND ( invA[gene] ) AND txid28901[ORGN] ' | efetch -format gbc | xtract -insd CDS gene sub_sequence | sed 's/ /_/g' | awk '{ IGNORECASE=1; if ( $2 ~ /invA/ ) print $0 }' > file
The command generates a tab-delimited output for all genes in genomes for tax=28901 -- a very large output given many genomes x ~4,000 genes in each, even though I need only sequences for a single gene=invA that I use awk or grep.
Here are errors I get:
curl: (22) The requested URL returned error: 400
ERROR: curl command failed with: 22
-X POST https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi -d query_key=1&WebEnv=MCID_64dbfbab6b6e680dea649326&retstart=33000&retmax=100&db=nuccore&rettype=gbc&retmode=xml&api_key=xxx&tool=edirect&edirect=20.0&edirect_os=Linux
HTTP/1.1 400 Bad Request
WARNING: FAILURE
nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi -query_key 1 -WebEnv MCID_64dbfbab6b6e680dea649326 -retstart 33000 -retmax 100 -db nuccore -rettype gbc -retmode xml -api_key xxxx -tool edirect -edirect 20.0 -edirect_os Linux
EMPTY RESULT
SECOND ATTEMPT
curl: (22) The requested URL returned error: 500
ERROR: curl command failed with: 22
-X POST https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi -d query_key=1&WebEnv=MCID_64dbfbab6b6e680dea649326&retstart=34100&retmax=100&db=nuccore&rettype=gbc&retmode=xml&api_key=xxx&tool=edirect&edirect=20.0&edirect_os=Linux
Is there a way to speed this command up OR break this query into smaller chunks, so that it does not timeout.
Thank you
--
Gene
I assume you are using NCBI API KEY otherwise this would not be working. If it is a matter of query timing out because of the large amount of data there may not be much you can do.
Consider using
datasets
instead ofEntrezDirect
as an alternative (LINK).You may also want to get the accessions numbers of the records and then submit this query in chunks with a certain number of records at one time.
Thank you! I I used NCBI API KEY. Using NCBI datasets seems a very good idea, but neither command-line nor NCBI interface return any results:
datasets download gene symbol invA --taxon 28901 --include gene,cds Error: No genes found that match selection
I tried an E.coli gene as positive control and used species name instead if taxID, upper/lower case for gene symbol -- same result. I will follow your suggestion and download all ACC first, then do efetch | xtract | awk for each ACC separately.
PS My esearch | efetch generates output, but it seems it does not release memory for data it generated: I got this error message on Ubuntu22.04 with >100 GB RAM: ecommon.sh: xrealloc: cannot allocate 18446744072361744256 bytes