I have currently been using a local version of the BLAST nr database to run BLASTX to identify proteins but wanted to look into setting up a local version of the clustered NR database to speed up the time for processing our samples.
NCBI however do not offer the clustered NR database locally and so wanted to ask if anyone knows the best way to go about setting up a clustered version of the nr database locally to output similar results than those offered by the NCBI clustered NR online?
Clustering nr is done using mmseqs2 as noted here and copied below.
Be ready to have gobs of hardware available to do this.
We generate ClusteredNR from the standard protein nr database with
MMseqs2 so each cluster contains proteins that are more than 90%
identical to each other and within 90% of the length of the longest
member. We select a single well-annotated protein that indicates the
function of the proteins in the cluster as the lead or representative
protein. The title of the representative protein is the title that
shows in the BLAST results. Each cluster may contain sequences for
multiple organisms (species). On the BLAST results, clusters are
identified by the name of the organism for the title protein as well
as the most recent common ancestor taxon for all organisms in the
cluster. This makes it clear when the cluster includes multiple
species.
Last part about display of the results likely requires code magic on BLAST server that you will not be able to replicate locally.
PeterC_NCBI Is there any update on if and when the clustered database will become available for local downloads?
GenoMax Sorry for the delay in reply. We're still working out the details of what exactly to provide. One idea is to provide just the representative sequences as a first pass. This could be done fairly soon. But without access to the cluster contents, those probably won't be terribly useful. Providing a way within the database or with an accessory database or script would require a much longer time. Probably not before late this year at the earliest. Happy to hear any of your thoughts about what would be useful.
One idea is to provide just the representative sequences as a first
pass.
This seems reasonable since it should capture significant part of sequence to search against ( 90% identity and 90% of length in a cluster). Can you give us an idea of what kind of % reduction this results in compared to the full nr db?
But without access to the cluster contents,
Do you mean without access to all sequence headers (or actual sequences) that are in a particular cluster?
This is a far fetched idea but perhaps a new kind of accession designation (like WP*) could be designated for clustered sequences so the header in cluster stays concise and people can retrieve full cluster contents by independent lookups.
This seems reasonable since it should capture significant part of sequence to search against ( 90% identity and 90% of length in a cluster). Can you give us an idea of what kind of % reduction this results in compared to the full nr db?
Here are the counts for the 90/90 ClusteredNR and nr
clustered nr
260,861,887 sequences; 86,666,979,341 total residues
database nr
546,502,144 sequences; 216,836,000,941 total residues
So ClusteredNR is about 40% of the size of nr.
Do you mean without access to all sequence headers (or actual sequences) that are in a particular cluster?
Yes as a first pass anyway. The more difficult problem is coming up with the right way to give you access to the cluster contents, the members of each cluster, as part of the standalone database or some service.
This is a far fetched idea but perhaps a new kind of accession designation (like WP*) could be designated for clustered sequences so the header in cluster stays concise and people can retrieve full cluster contents by independent lookups.
Yes, something like that would be helpful. We probably want to avoid creating yet another Entrez database for this, however.
GenoMax Sorry for the delay in reply. We're still working out the details of what exactly to provide. One idea is to provide just the representative sequences as a first pass. This could be done fairly soon. But without access to the cluster contents, those probably won't be terribly useful. Providing a way within the database or with an accessory database or script would require a much longer time. Probably not before late this year at the earliest. Happy to hear any of your thoughts about what would be useful.
This seems reasonable since it should capture significant part of sequence to search against ( 90% identity and 90% of length in a cluster). Can you give us an idea of what kind of % reduction this results in compared to the full
nr
db?Do you mean without access to all sequence headers (or actual sequences) that are in a particular cluster?
This is a far fetched idea but perhaps a new kind of accession designation (like
WP*
) could be designated for clustered sequences so the header in cluster stays concise and people can retrieve full cluster contents by independent lookups.Thanks GenoMax
Here are the counts for the 90/90 ClusteredNR and nr
clustered nr 260,861,887 sequences; 86,666,979,341 total residues
database nr 546,502,144 sequences; 216,836,000,941 total residues
So ClusteredNR is about 40% of the size of nr.
Yes as a first pass anyway. The more difficult problem is coming up with the right way to give you access to the cluster contents, the members of each cluster, as part of the standalone database or some service.
Yes, something like that would be helpful. We probably want to avoid creating yet another Entrez database for this, however.
Thanks for those stats. So using clustered
nr
will still require significant hardware. Though this will save time as well.