Question

Creating a local version of Clustered NR database

0

Entering edit mode

19 months ago

Wilford203 ▴ 10

I have currently been using a local version of the BLAST nr database to run BLASTX to identify proteins but wanted to look into setting up a local version of the clustered NR database to speed up the time for processing our samples.

NCBI however do not offer the clustered NR database locally and so wanted to ask if anyone knows the best way to go about setting up a clustered version of the nr database locally to output similar results than those offered by the NCBI clustered NR online?

diamond clustered-nr blast • 1.6k views

ADD COMMENT • link updated 18 months ago by GenoMax 148k • written 19 months ago by Wilford203 ▴ 10

score 1 · Answer 1 · 2023-06-02

1

Entering edit mode

19 months ago

GenoMax 148k

Clustering nr is done using mmseqs2 as noted here and copied below.

Be ready to have gobs of hardware available to do this.

We generate ClusteredNR from the standard protein nr database with MMseqs2 so each cluster contains proteins that are more than 90% identical to each other and within 90% of the length of the longest member. We select a single well-annotated protein that indicates the function of the proteins in the cluster as the lead or representative protein. The title of the representative protein is the title that shows in the BLAST results. Each cluster may contain sequences for multiple organisms (species). On the BLAST results, clusters are identified by the name of the organism for the title protein as well as the most recent common ancestor taxon for all organisms in the cluster. This makes it clear when the cluster includes multiple species.

Last part about display of the results likely requires code magic on BLAST server that you will not be able to replicate locally.

PeterC_NCBI Is there any update on if and when the clustered database will become available for local downloads?

ADD COMMENT • link 19 months ago by GenoMax 148k

0

Entering edit mode

GenoMax Sorry for the delay in reply. We're still working out the details of what exactly to provide. One idea is to provide just the representative sequences as a first pass. This could be done fairly soon. But without access to the cluster contents, those probably won't be terribly useful. Providing a way within the database or with an accessory database or script would require a much longer time. Probably not before late this year at the earliest. Happy to hear any of your thoughts about what would be useful.

ADD REPLY • link 18 months ago by PeterC_NCBI ▴ 520

0

Entering edit mode

One idea is to provide just the representative sequences as a first pass.

This seems reasonable since it should capture significant part of sequence to search against ( 90% identity and 90% of length in a cluster). Can you give us an idea of what kind of % reduction this results in compared to the full nr db?

But without access to the cluster contents,

Do you mean without access to all sequence headers (or actual sequences) that are in a particular cluster?

This is a far fetched idea but perhaps a new kind of accession designation (like WP*) could be designated for clustered sequences so the header in cluster stays concise and people can retrieve full cluster contents by independent lookups.

ADD REPLY • link 18 months ago by GenoMax 148k

0

Entering edit mode

Thanks GenoMax

This seems reasonable since it should capture significant part of sequence to search against ( 90% identity and 90% of length in a cluster). Can you give us an idea of what kind of % reduction this results in compared to the full nr db?

Here are the counts for the 90/90 ClusteredNR and nr
clustered nr 260,861,887 sequences; 86,666,979,341 total residues

database nr 546,502,144 sequences; 216,836,000,941 total residues

So ClusteredNR is about 40% of the size of nr.

Do you mean without access to all sequence headers (or actual sequences) that are in a particular cluster?

Yes as a first pass anyway. The more difficult problem is coming up with the right way to give you access to the cluster contents, the members of each cluster, as part of the standalone database or some service.

This is a far fetched idea but perhaps a new kind of accession designation (like WP*) could be designated for clustered sequences so the header in cluster stays concise and people can retrieve full cluster contents by independent lookups.

Yes, something like that would be helpful. We probably want to avoid creating yet another Entrez database for this, however.

ADD REPLY • link 18 months ago by PeterC_NCBI ▴ 520

1

Entering edit mode

Thanks for those stats. So using clustered nr will still require significant hardware. Though this will save time as well.

ADD REPLY • link 18 months ago by GenoMax 148k