I have a list of almost 5000 species of interest, for which I would like to download the sequences from Genbank, to create a custom database. I've been using rentrez, and follwing the tutorial, specifically this example:
snail_coi <- entrez_search(db="nuccore", term="COI[Gene] AND Gastropoda[ORGN]", use_history=TRUE)
But my problem is, my species list is not monophyletic. So I can't just use a single search term as [ORGN]. Instead, I read the species list into R, convert it to a character vector, and loop through it, using entrez_search, like this:
i <- 1 while(i <= length(speciesvec)){ org_search[[i]] <- entrez_search(db="nuccore", term=paste(speciesvec[i], "AND COI[Gene]", sep=" "), use_history=TRUE) i <- i + 1 }
But usually after a couple hundred iterations or so, I get kicked out with 502 bad gateway error. It says that this often happens when trying to download many records at once, and to try using web history. I believe the problem lies in that I'm only adding entries to a list object, not creating an actual web history object. I'm running the command thousands of times, instead of once, like in the example; but I can't think of any other ways to do it. I appreciate any advice.
Care NCBI will soon move to a NCBI API Keys system (youtube link)
I do not know when this will be set up.
Also, you could try to sleep your process after a couple hundred iterations
@Bastien has already noted a need to create an API key for NCBI programmatic queries.
I think this could be done faster using
blastdbcmd
and a local copy ofnt
blast database if you have that available.I hope OP has a good fiber connection (around 60GB for nt)
I got blastdbcmd to work for one id at a time, like in the example on the web page:
blastdbcmd -db nt -entry all -outfmt "%g %T" | \ awk ' { if ($2 == 9606) { print $1 } } ' | \ blastdbcmd -db nt -entry_batch - -out human_sequences.txt
But I have a list of almost 5000 species, and putting the above in a loop seems unfeasible.How about getting the corresponding fasta file for
nt
here and then retrieving the sequences you need from it?genomax: good idea. I'm downloading it now. I imagine I'll use the descriptions in the fasta headers to search for my species names, because I think the taxids aren't in there. Will have to parse it somehow. If I have trouble I'll post again. Thanks.
It's working, except for the fact that in some places the file contains ^A characters, followed by what appears to be missing data. I would guess I should delete the whole nt file, and download it again, and see if that works.
Thanks for the information. blastdbcmd looks like an interesting tool. I'll either figure out a way to use it, or just modify my current way with Sys.sleep, etc. Internet connection not so bad--last time I downloaded nt, it only took about a day or two. ;)