Question

NCBI E-eutilitis not working properly inside a while loop

1

Entering edit mode

2.9 years ago

andres.firrincieli 3.8k

Hello everyone

I have the following loop to get the TaxID from a list (aaa.txt) of Assembly Accession.

head aaa.txt
GCF_016924235.1
GCF_003999975.1
GCF_002993365.1
GCF_017084525.1
GCF_006496635.1
GCF_017104785.1

The loop:

cat aaa.txt | while read p; do 
    echo $p; 
    esearch -db assembly -query $p | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession,Taxid > out.txt; 
done;

I can't figure out why the loop stops after the first line iteration.

Thank you

A

E-eutilitis NCBI • 2.2k views

ADD COMMENT • link updated 2.9 years ago by GenoMax 148k • written 2.9 years ago by andres.firrincieli 3.8k

4

Entering edit mode

probably your while loop needs /dev/null redirection as explained here: https://www.ncbi.nlm.nih.gov/books/NBK179288/. Otherwise, it will stop at first line/record.

ADD REPLY • link 2.9 years ago by cpad0112 21k

2

Entering edit mode

It works!

cat aaa.txt | while read p; do 
    echo $p; 
    esearch -db assembly -query $p < /dev/null | 
    esummary | 
    xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession,Taxid > $p.txt
    sleep 3s 
done;

I have also added a sleep 3s to avoid any problem with NCBI.

Thank you!

ADD REPLY • link 2.9 years ago by andres.firrincieli 3.8k

0

Entering edit mode

Be careful here. Requests in a loop will hammer the server and could get your IP (or that of your institution) banned. I think now the NCBI is rate limiting access to 3 requests per second to avoid this kind of problem but you could still get banned.

ADD REPLY • link 2.9 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I have thousands of accessions. Perhaps I need to find a different solution.

Thank you for warning me.

ADD REPLY • link 2.9 years ago by andres.firrincieli 3.8k

0

Entering edit mode

It may be simple to just get the assembly summary report file and extract the taxID's. They are in column 6 and 7. You can either search by GCA* or GCF* accessions. There are other reports in that directory if you need RefSeq etc.

$ grep GCF_016924235 assembly_summary_genbank.txt 
GCA_016924235.1 PRJNA575502     SAMN17818122    JAFFZP000000000.1       representative genome   2811233 2811233 Amphritea pacifica      strain=RP18W            latest  Scaffold        Major   Full    2021/02/23 ASM1692423v1     Guangdong Institute of Microbiology     GCF_016924235.1 identical       https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/016/924/235/GCA_016924235.1_ASM1692423v1                   na

ADD REPLY • link 2.9 years ago by GenoMax 148k

0

Entering edit mode

Yup, that's what I mentioned in my answer. But one caveat with this file is that it only includes assemblies that are current; the older assemblies are in assembly_summary_{genbank,refseq}_historical.txt file in the same directory. If the input list has a mix of current and old assemblies, the best way to go is to first concatenate the two files and then do the grep.

ADD REPLY • link 2.9 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

I did not read all the answers completely. Sorry about that. Moving mine to a comment.

I think downloading this file once and doing the searches locally would save NCBI bandwidth/server load, especially if OP has literally thousands of these to look through.

ADD REPLY • link 2.9 years ago by GenoMax 148k

0

Entering edit mode

$ grep -f aaa.txt assembly_summary_genbank.txt | awk -F "\t" -v OFS="\t" '{print $18,$6}'
$ parallel --plus grep {} assembly_summary_genbank.txt :::: aaa.txt |  awk -F "\t" -v OFS="\t" '{print $18,$6}'

ADD REPLY • link 2.9 years ago by cpad0112 21k

6

Entering edit mode

2.9 years ago

MirianT_NCBI ▴ 770

Hi,
Another option would be to use the summary option from NCBI Datasets in combination with jq. Here's the command:

datasets summary genome accession --inputfile aaa.txt | jq -r '.assemblies[].assembly | [.assembly_accession,.org.tax_id] | @tsv' > out.txt

From the list you provided as an example, here's the output:

GCF_003999975.1 2486853
GCF_002993365.1 2079529
GCF_016924235.1 2811233
GCF_006496635.1 2027405
GCF_017104785.1 2703894
GCF_017084525.1 2812560

If you prefer the output as csv, you can change it in the last part of the command (@csv instead of @tsv).

Let me know if you have any questions or if you find any issues. :)

ADD COMMENT • link 2.9 years ago by MirianT_NCBI ▴ 770

2

Entering edit mode

2.9 years ago

Jean-Karim Heriche 27k

Given that there's a limit of 3 requests per second and your command is already making at least two (i.e. esearch and esummary), you won't complete the second iteration unless a full second has passed after the first.

ADD COMMENT • link 2.9 years ago by Jean-Karim Heriche 27k

score 5 · Accepted Answer · 2022-02-09

As @cpad0112 suggested, adding < /dev/null to your esearch is the way to go if you want to use the while loop. I recommend skipping the while loop altogether and use epost as shown below. It is quicker.

$ time cat accs.txt | while read p ; do esearch -db assembly -query $p < /dev/null | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession,Taxid ; done > out.txt
real    0m5.361s
user    0m1.828s
sys     0m0.553s

$ time epost -db assembly -input accs.txt | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession,Taxid > out.txt
real    0m1.567s
user    0m0.375s
sys     0m0.150s

If you have a lot of accessions and you are concerned about hitting the rate limit, I suggest you create an NCBI API key. See https://support.nlm.nih.gov/knowledgebase/article/KA-05316/en-us and the "Programmatic Access" section of https://www.ncbi.nlm.nih.gov/books/NBK179288/ for more details.

Finally, if you don't want to deal with EntrezDirect at all because it still takes a long time to process thousands of accessions you have, you can obtain this information from the assembly report files located on the FTP here: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS. For example, the file assembly_summary_refseq.txt is a tab-delimited table with 23 fields including assembly accession and taxid.

You should be able to join on the assembly accession to this table:

$ join -j 1 <(sort accs.txt) <(grep -v '^#' assembly_summary_refseq.txt | sort -k1,1 -t $'\t') -t $'\t' | cut -f1,6
GCF_002993365.1 2079529
GCF_003999975.1 2486853
GCF_006496635.1 2027405
GCF_016924235.1 2811233
GCF_017084525.1 2812560
GCF_017104785.1 2703894

Note, assembly_summary_refseq.txt contains only the latest assembly accessions. Older data are in assembly_summary_refseq_historical.txt file located in the same FTP path. If the input list has a mix of current and old assemblies, the best way to go is to first concatenate the two files and then do the join.