As @cpad0112 suggested, adding < /dev/null
to your esearch
is the way to go if you want to use the while loop.
I recommend skipping the while loop altogether and use epost
as shown below. It is quicker.
$ time cat accs.txt | while read p ; do esearch -db assembly -query $p < /dev/null | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession,Taxid ; done > out.txt
real 0m5.361s
user 0m1.828s
sys 0m0.553s
$ time epost -db assembly -input accs.txt | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession,Taxid > out.txt
real 0m1.567s
user 0m0.375s
sys 0m0.150s
If you have a lot of accessions and you are concerned about hitting the rate limit, I suggest you create an NCBI API key. See https://support.nlm.nih.gov/knowledgebase/article/KA-05316/en-us and the "Programmatic Access" section of https://www.ncbi.nlm.nih.gov/books/NBK179288/ for more details.
Finally, if you don't want to deal with EntrezDirect at all because it still takes a long time to process thousands of accessions you have, you can obtain this information from the assembly report files located on the FTP here: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS. For example, the file assembly_summary_refseq.txt
is a tab-delimited table with 23 fields including assembly accession and taxid.
You should be able to join on the assembly accession to this table:
$ join -j 1 <(sort accs.txt) <(grep -v '^#' assembly_summary_refseq.txt | sort -k1,1 -t $'\t') -t $'\t' | cut -f1,6
GCF_002993365.1 2079529
GCF_003999975.1 2486853
GCF_006496635.1 2027405
GCF_016924235.1 2811233
GCF_017084525.1 2812560
GCF_017104785.1 2703894
Note, assembly_summary_refseq.txt
contains only the latest assembly accessions. Older data are in assembly_summary_refseq_historical.txt
file located in the same FTP path. If the input list has a mix of current and old assemblies, the best way to go is to first concatenate the two files and then do the join
.
probably your while loop needs
/dev/null
redirection as explained here: https://www.ncbi.nlm.nih.gov/books/NBK179288/. Otherwise, it will stop at first line/record.It works!
I have also added a
sleep 3s
to avoid any problem with NCBI.Thank you!
Be careful here. Requests in a loop will hammer the server and could get your IP (or that of your institution) banned. I think now the NCBI is rate limiting access to 3 requests per second to avoid this kind of problem but you could still get banned.
I have thousands of accessions. Perhaps I need to find a different solution.
Thank you for warning me.
It may be simple to just get the assembly summary report file and extract the taxID's. They are in column 6 and 7. You can either search by
GCA*
orGCF*
accessions. There are other reports in that directory if you need RefSeq etc.Yup, that's what I mentioned in my answer. But one caveat with this file is that it only includes assemblies that are current; the older assemblies are in
assembly_summary_{genbank,refseq}_historical.txt
file in the same directory. If the input list has a mix of current and old assemblies, the best way to go is to first concatenate the two files and then do the grep.I did not read all the answers completely. Sorry about that. Moving mine to a comment.
I think downloading this file once and doing the searches locally would save NCBI bandwidth/server load, especially if OP has literally thousands of these to look through.