I have downloaded the entire NR from NCBI and then I create a giant diamond database that I query. I'm wondering if it would be more efficient computationally if I break NR into about 100 smaller databases that I query individually.
Would this help with the resource requirements and compute time?
Keep in mind the potential effect on e-values brought about by splitting a database into chunks then combining the results, discussed here: Blast E-Value To Database Size. While that's focused on NCBI Blast, I assume the same is true for Diamond.
That is no longer needed. Recent
DIAMOND
versions can now use pre-formatted NCBI databases.NCBI now offers clustered
nr
database for web searches though it is not downloadable as yet for local use.This will save me a lot of time and compute resources! As long as it contains the taxonomy info I should be good to go. Any word on when it will be available to the public?
Normal
nr
database contains taxonomy info. I don't know if the clustered DB will include that info. It is available for use now via web interface now. You can try a sequence out to confirm.In terms of using pre-formatted NCBI databases, would we just download all the different NR db files: https://ftp.ncbi.nlm.nih.gov/blast/db/nr.00.tar.gz https://ftp.ncbi.nlm.nih.gov/blast/db/nr.01.tar.gz ... https://ftp.ncbi.nlm.nih.gov/blast/db/nr.66.tar.gz Individually and then give diamond the prefix? For example,
Would it be that type of usage?
I also see the
prepdb
command but I'm not sure if that has to be run on each component of nr. https://github.com/bbuchfink/diamond/wikiIs there a way to run Diamond against online NR database without downloading it to the local computer?
No there is not.
I think it depends on the speed of your local disks and the memory amount. On a single node, breaking up the database doesn't sound like a good idea, or that is even possible as you would likely run into I/O problems. If you have access to a cluster with speedy disks, and can run these processes on independent nodes without worrying about memory and disk I/O, I suspect there could be some speed-up. I would still think that breaking it to 5-10 parts would be more productive and could avoid the I/O bottleneck.