Question

How to set sleep in GNU parallel in a esearch/efetch script

0

Entering edit mode

4.7 years ago

MAPK ★ 2.1k

I am requesting NCBI's data and looks like it only allows three requests per second, so I wanted to parallelize requests for three query ids ${IDLIST} per second. I would like to know how I can set sleep time of 2 seconds in this code. I know in a for-loop we can just do sleep 2, but what's the syntax to do this with parallel?

For example, If I just do for three IDs, like below (head -3 "${IDLIST}), the download request works:

  parallel -j1 \
  "IFS=$'\n';"'for hit in \
   $(esearch -db sra -query {} | efetch --format runinfo | grep SRR); do \
     echo "{},${hit}"; done' \
  ::: "$(head -3 "${IDLIST}")" \
  | sort -t, -k9,9rn >> out.csv

But won't work for:

parallel -j1 \
  "IFS=$'\n';"'for hit in \
   $(esearch -db sra -query {} | efetch --format runinfo | grep SRR); do \
     echo "{},${hit}"; done' \
  :::: "${IDLIST}" \
  | sort -t, -k9,9rn >> out.csv

Is there a way to limit three request per second in this code?

These are some IDLIST:

A-ADC-AD000037-BR-NCR-09AD14648
A-ADC-AD000044-BR-NCR-09AD14647
A-ADC-AD000068-BR-NCR-08AD8038
A-ADC-AD000075-BR-NCR-08AD9964
A-ADC-AD000092-BR-NCR-09AD13601
A-ADC-AD000096-BR-NCR-08AD9891
A-ADC-AD000097-BR-NCR-08AD9961
A-ADC-AD000104-BR-NCR-09AD14644

sra ncbi programming shell • 1.6k views

ADD COMMENT • link updated 4.7 years ago by ole.tange ★ 4.5k • written 4.7 years ago by MAPK ★ 2.1k

1

Entering edit mode

it only allows three requests per second, so I wanted to parallelize requests for three query ids ${IDLIST} per second.

You are only going to make this worse. NCBI counts the queries per IP address. Have you signed up for NCBI_API_KEY? If not you should do that first. Ultimately NCBI counts number of requests per domain at a higher lever (if I recall right).

ADD REPLY • link 4.7 years ago by GenoMax 150k

1

Entering edit mode

NCBI may have some of this information available in form of reports. Look around in ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Metadata/. You can download the files and parse the info locally, if you have a really large number of queries.

ADD REPLY • link 4.7 years ago by GenoMax 150k

0

Entering edit mode

@genomax I couldn't find anything older than "NCBI_SRA_Metadata_20181202.tar.gz". I need this from 201802. I just created the api_key and exported the variable export api_key="key", but that still won't solve the problem. Where do I add this key? Thank you for your help.

ADD REPLY • link 4.7 years ago by MAPK ★ 2.1k

0

Entering edit mode

Add KEY to your .bashrc file for automatic export or you can export it in your terminal where you are going to run the searches from each time. Export NCBI_API_KEY as the variable.

ADD REPLY • link 4.7 years ago by GenoMax 150k

score 2 · Accepted Answer · 2020-08-07

Something like this:

IDLIST=IDLIST

mysearch() {
    query="$1"
    IFS=$'\n'
    for hit in $(esearch -db sra -query "$query" |
                     efetch --format runinfo |
                     grep SRR); do
        echo "$query,${hit}"
    done
}
export -f mysearch

parallel -j0 --delay 0.34 mysearch :::: $IDLIST |
    sort -t, -k9,9rn >> out.csv

The magic is --delay 0.34 which will make sure a new job is at most started every 0.34 second.