How to avoid http failure when using rentrez to fetch records from a long list of species names
0
0
Entering edit mode
6.5 years ago
lvogel ▴ 30

I have a list of almost 5000 species of interest, for which I would like to download the sequences from Genbank, to create a custom database. I've been using rentrez, and follwing the tutorial, specifically this example:

snail_coi <- entrez_search(db="nuccore", term="COI[Gene] AND Gastropoda[ORGN]", use_history=TRUE)

But my problem is, my species list is not monophyletic. So I can't just use a single search term as [ORGN]. Instead, I read the species list into R, convert it to a character vector, and loop through it, using entrez_search, like this:

i <- 1
while(i <= length(speciesvec)){
  org_search[[i]] <- entrez_search(db="nuccore", term=paste(speciesvec[i], "AND COI[Gene]", sep=" "), use_history=TRUE)
  i <- i + 1
}

But usually after a couple hundred iterations or so, I get kicked out with 502 bad gateway error. It says that this often happens when trying to download many records at once, and to try using web history. I believe the problem lies in that I'm only adding entries to a list object, not creating an actual web history object. I'm running the command thousands of times, instead of once, like in the example; but I can't think of any other ways to do it. I appreciate any advice.

R rentrez server error NCBI • 2.5k views
ADD COMMENT
1
Entering edit mode

Care NCBI will soon move to a NCBI API Keys system (youtube link)

I do not know when this will be set up.

Also, you could try to sleep your process after a couple hundred iterations

ADD REPLY
1
Entering edit mode

@Bastien has already noted a need to create an API key for NCBI programmatic queries.

I think this could be done faster using blastdbcmd and a local copy of nt blast database if you have that available.

ADD REPLY
0
Entering edit mode

I hope OP has a good fiber connection (around 60GB for nt)

ADD REPLY
0
Entering edit mode

I got blastdbcmd to work for one id at a time, like in the example on the web page: blastdbcmd -db nt -entry all -outfmt "%g %T" | \ awk ' { if ($2 == 9606) { print $1 } } ' | \ blastdbcmd -db nt -entry_batch - -out human_sequences.txt But I have a list of almost 5000 species, and putting the above in a loop seems unfeasible.

ADD REPLY
1
Entering edit mode

How about getting the corresponding fasta file for nt here and then retrieving the sequences you need from it?

ADD REPLY
0
Entering edit mode

genomax: good idea. I'm downloading it now. I imagine I'll use the descriptions in the fasta headers to search for my species names, because I think the taxids aren't in there. Will have to parse it somehow. If I have trouble I'll post again. Thanks.

ADD REPLY
0
Entering edit mode

It's working, except for the fact that in some places the file contains ^A characters, followed by what appears to be missing data. I would guess I should delete the whole nt file, and download it again, and see if that works.

ADD REPLY
0
Entering edit mode

Thanks for the information. blastdbcmd looks like an interesting tool. I'll either figure out a way to use it, or just modify my current way with Sys.sleep, etc. Internet connection not so bad--last time I downloaded nt, it only took about a day or two. ;)

ADD REPLY

Login before adding your answer.

Traffic: 2586 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6