I think I have the answer to my own question, but I'm somewhat new to bioinformatics, and I want to make sure my strategy is sound (and that there are no easier solutions to my problem).
We need to search against the vertebrate subset of the nr database. Hence I've been tasked with finding or building a vertebrate subset of the nr database. So far, I can't find one. So to build one, I'm planning on doing the following.
- Get the nr database (I believe I have the one from ncbi or ensembl).
- Get the taxonomy database from ncbi.
- Somehow traverse the taxonomy db, and extract the taxids of all children of the parent vertebrate node (looks like the parent vertebrate is taxid 7742 from names.dump from taxdump.tar.gz) (see ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt).
- Use the list of vertebrate taxids from the previous step to extract the GIs of proteins from gi_taxid_prot.dmp.gz (see ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid.readme).
Use the list of vertebrate GIs from the previous step with the blastdb_aliastool to build an aliased blastdb of just vertebrates. Something along the lines of this:
blastdb_aliastool -gilist vertebrate_gis.txt -db nr -out nr_vertebrates -title nr_vertebrates
Am I on the right track? Am I reinventing the wheel? Does the vertebrate gi_list or vertebrate taxid_list that I'm trying to build already exist elsewhere?
Thanks!
Update It appears that blastx via the ncbi web page allows you to specify a taxid via their "Organism" field. However, the command line version of blastx doesn't offer that functionality (at least my blastx doesn't).
In addition, the command line blastn tool does allow you to set -window_masker_taxid 7742, but there isn't an equivalent for command line blastx.
It's odd that the web version and the command line version are so different. Does the web version simply do a full nr search then return only those results that match the taxid argument? That would seem odd. But if it does use the actual blast database to filter out unwanted taxonomies, why isn't that functionality included with the command line version of the tool?
(We're wanting to run 100,000 - 1,000,000 blastx queries, so doing this over the web isn't practical.)
Update Thanks to neilfws for pointing out the gilist from entrez trick! That saved me a lot of time and allowed me to implement a solution immediately. This is what I went with:
- Download the prebuilt nr database (ncbi?).
- Search the Entrez Protein database with query: "txid7742[ORGN]"
- Select "Send to File" and choose format "GI list"
Use the list of vertebrate GIs from the previous step with the blastdb_aliastool to build an aliased blastdb of just vertebrates (takes several seconds):
blastdb_aliastool -gilist vertebrates.gi_list.txt -db nr -out nr_vertebrates -title nr_vertebrates
Search against your new (aliased) database:
blastx -query query.fa -db nr_vertebrates
This seems to be working for us. The queries return good hits, and (after loading into cache) the time to search has been reduced from 6.5 minutes (all of nr) to 1.0 minutes (just vertebrate nr). I'm not sure if it would be faster taking the more native approach suggested by neilfws (I haven't gone to the trouble to test that out), but I'm happy enough with my speedier blastx! Thanks again neilfws!
Hi, I tried blastdbaliastool -gilist vertebrates.gilist.txt -db nr -out nrvertebrates -title nrvertebrates
but I got an error "BLAST Database error: GI list specified but no ISAM file found for GI."
Any idea what this is? Thanks so much
Hi, Thanks for providing your updates, they helped me to create my database.
Quick comment/question: I noticed that the number of sequences in my newly created database was smaller than the original number of converted GIs.
Everything works fine but I was wondering if this was normal? For example, maybe there are multiple identifical GIs that are attributed to a specific sequences in the new database?
I found this blog post and the author showed similar results but didn't mention any issue (http://johnstantongeddes.org/aptranscriptome/2013/12/31/notes.html)
Any insights appreciated.
This has been confirmed by the NCBI help (through email) that this is normal and probably caused by redundant sequences matching to single GIs.