Most efficient way to run Diamond against a very very large database (i.e., NCBI's NR)?
1
0
Entering edit mode
21 months ago
O.rka ▴ 740

I have downloaded the entire NR from NCBI and then I create a giant diamond database that I query. I'm wondering if it would be more efficient computationally if I break NR into about 100 smaller databases that I query individually.

Would this help with the resource requirements and compute time?

protein annotation alignment diamond nr • 3.0k views
ADD COMMENT
1
Entering edit mode

Keep in mind the potential effect on e-values brought about by splitting a database into chunks then combining the results, discussed here: Blast E-Value To Database Size. While that's focused on NCBI Blast, I assume the same is true for Diamond.

ADD REPLY
1
Entering edit mode

then I create a giant diamond database that I query.

That is no longer needed. Recent DIAMOND versions can now use pre-formatted NCBI databases.

NCBI now offers clustered nr database for web searches though it is not downloadable as yet for local use.

ADD REPLY
0
Entering edit mode

This will save me a lot of time and compute resources! As long as it contains the taxonomy info I should be good to go. Any word on when it will be available to the public?

ADD REPLY
1
Entering edit mode

Normal nr database contains taxonomy info. I don't know if the clustered DB will include that info. It is available for use now via web interface now. You can try a sequence out to confirm.

db

ADD REPLY
0
Entering edit mode

In terms of using pre-formatted NCBI databases, would we just download all the different NR db files: https://ftp.ncbi.nlm.nih.gov/blast/db/nr.00.tar.gz https://ftp.ncbi.nlm.nih.gov/blast/db/nr.01.tar.gz ... https://ftp.ncbi.nlm.nih.gov/blast/db/nr.66.tar.gz Individually and then give diamond the prefix? For example,

mkdir -p ncbi_nr/
wget -P ncbi_nr/ https://ftp.ncbi.nlm.nih.gov/blast/db/nr.00.tar.gz
wget -P ncbi_nr/ https://ftp.ncbi.nlm.nih.gov/blast/db/nr.01.tar.gz
...
wget -P ncbi_nr/ https://ftp.ncbi.nlm.nih.gov/blast/db/nr.66.tar.gz

# Decompress the archives
diamond blastp -d ncbi_nr/nr -q queries.fasta -o matches.tsv

Would it be that type of usage?

I also see the prepdb command but I'm not sure if that has to be run on each component of nr. https://github.com/bbuchfink/diamond/wiki

ADD REPLY
0
Entering edit mode

Is there a way to run Diamond against online NR database without downloading it to the local computer?

ADD REPLY
0
Entering edit mode

No there is not.

ADD REPLY
0
Entering edit mode

I think it depends on the speed of your local disks and the memory amount. On a single node, breaking up the database doesn't sound like a good idea, or that is even possible as you would likely run into I/O problems. If you have access to a cluster with speedy disks, and can run these processes on independent nodes without worrying about memory and disk I/O, I suspect there could be some speed-up. I would still think that breaking it to 5-10 parts would be more productive and could avoid the I/O bottleneck.

ADD REPLY
2
Entering edit mode
21 months ago
Asaf 10k

The dataset used by diamond is a table of k-mers and a list of sequences the k-mer appears in. Assuming k-mers in nr database are not very unique you could expect a big overlap between any two 1/100th of nr so the resulting datasets will not be 1/100 the size of the complete dataset but much much bigger.

In addition, it's also a good idea to run all your queries together as the queries are also indexed and the search is not linear to the size of the query.

ADD COMMENT

Login before adding your answer.

Traffic: 1325 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6