Question

Number of Entries in Non Redundant DB

0

Entering edit mode

4.6 years ago

6schulte ▴ 30

Hello,

Has anyone an idea of wether there is some link to a website telling me how many sequences (as in entries) are currently to be found in the non redundant database of NCBI (nr.gz from NCBI)?

I know I can let a bash-line command run through the downloaded and unpacked db and count myself - but with about 2,000,000,000 lines that will take very long. Now, creating an index with esl-sfetch will also tell me how many entries are in nr.fa but the index creation is taking very long as well (SSI index written to file nr.fa.ssi):

esl-sfetch --index nr.fa

So yes, I am looking for an estimation of the number of entries in nr. Thanks for your help :)

nr ncbi • 1.3k views

ADD COMMENT • link 4.6 years ago by 6schulte ▴ 30

score 3 · Accepted Answer · 2020-12-30

greping (or zgreping compressed file) for ^> in fasta files should get you number of unique sequences. Since some sequences refer to multiple entries you would not count the exact number of accessions. If you need that information then you will need to parse the fasta headers (an example below).

>MBD3193859.1 hypothetical protein [Candidatus Lokiarchaeota archaeon]MBD3198741.1 hypothetical protein [Candidatus Lokiarchaeota archaeon]

$ zgrep "^>" nr.gz | wc -l
338057725

If you have nr blast indexes available then following would be another option (as of this week).

$ blastdbcmd -db nr -entry all -outfmt %a | wc -l
593806742

Looking at the results above it would appear that there are 338057725 unique sequences representing a total of 593806742 accessions.

Edit: @Mensur's method is simple to follow and can get you an updated number of unique sequences (but not accessions) for each day.

Note: Number of entries in nr likely change each day as the indexes are regenerated.

score 3 · Accepted Answer · 2020-12-30

3

Entering edit mode

4.6 years ago

Mensur Dlakic ★ 29k

Do a protein BLAST search and the result page will have a pull-down menu next to nr where you can show database details. As of yesterday the nr has 338057725 sequences.

enter image description here

ADD COMMENT • link 4.6 years ago by Mensur Dlakic ★ 29k