Question

What is a difference between uniprot and 'nr. Non-redundant GenBank...'?

1

Entering edit mode

4.1 years ago

matt ▴ 20

I would like to understand what is the difference between

'UniProt', e.g. UP000005640 URL: https://www.ebi.ac.uk/interpro/proteome/uniprot/UP000005640/

and

'nr. Non-redundant GenBank CDS translations + PDB + SwissProt + PIR + PRF, excluding those in env_nr.' as used e.g. for BLAST-ing on NCBI website, URL: https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome

enter image description here

Any comments would be very appreciated. Btw, I am only interested in the 'human' subset of both.

Genebank blast UniProt • 5.0k views

ADD COMMENT • link updated 4.1 years ago by GenoMax 151k • written 4.1 years ago by matt ▴ 20

1

Entering edit mode

Take a look at this answer here.

But I'm wondering about this too now, given this page on UniProt seems to suggest (to me) that NR and UniProt (i.e., TrEBML and SwissProt) are more equivalent to one another now than they were years ago (that answer is over a three quarters of a decade old). BTW, these are the kinds of proteins UniProt excludes.

ADD REPLY • link 4.1 years ago by Dunois ★ 2.9k

2

Entering edit mode

Since @matt is specifically looking at the proteome (LINK) on UniProt page it is only referring to curated protein entries (20380 reviewed + 56,647 un-reviewed). If you simply look at UniProtKb 2021_20 results for Human then there are 20395 (reviewed) and 175,716 (un-reviewed) entries as of today.

nr database as it says it a non-redundant collection of sequences. It may contain sequences that are partial. As of today entries labeled with Human (taxID 9606) are

 $ blastdbcmd -db nr -taxids 9606 -outfmt %a | wc -l
2929407

ADD REPLY • link 4.1 years ago by GenoMax 151k

0

Entering edit mode

As a test I ran locally 'UniProtKB' and 'nr' against a AA sequence of 'Immunoglobulin kappa constant', P01834 (https://www.uniprot.org/uniprot/P01834). While in the former I get what one expects (left), in 'nr' it is not (right - the match with highest score)

enter image description here

I obviously miss something here as I would expect 'nr' storing such basic sequences. Perhaps the https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_protein.faa.gz is not the entire human subset of 'nr'?

Thanks for all the answers so far anyways.

ADD REPLY • link 4.1 years ago by matt ▴ 20

1

Entering edit mode

That's RefSeq you're looking at, I think. If you use the blast webserver and search against NR with the taxonomic scope set to Homo sapiens, you will get this as your best hit.

ADD REPLY • link 4.1 years ago by Dunois ★ 2.9k

0

Entering edit mode

OK, I could reproduce it using the blast webserver (https://blast.ncbi.nlm.nih.gov/Blast.cgi) How do I get it locally? The file GRCh38_latest_protein.faa dos not seem to have it.

ADD REPLY • link 4.1 years ago by matt ▴ 20

1

Entering edit mode

I don't know if there's any other way that's easier than just downloading all of NR and restricting the search target to Homo sapiens. But I think that's a bit irrelevant: the record in question comes from UniProt, so the easier way would just be to search through a UniProt database (of H. sapiens sequences in this case).

I don't think this is helping you though. I have a feeling the problem you're trying to solve is something else, and this is just something you encountered along the way.

ADD REPLY • link 4.1 years ago by Dunois ★ 2.9k

0

Entering edit mode

UniProt entry says that there is Experimental evidence at protein level with following caveat

This indicates the type of evidence that supports the existence of the protein. Note that the 'protein existence' evidence does not give information on the accuracy or correctness of the sequence(s) displayed.

Further explanation is provided in help.

The value 'Experimental evidence at protein level' indicates that there is clear experimental evidence for the existence of the protein. The criteria include partial or complete Edman sequencing, clear identification by mass spectrometry, X-ray or NMR structure, good quality protein-protein interaction or detection of the protein by antibodies.

ADD REPLY • link 4.1 years ago by GenoMax 151k

0

Entering edit mode

Thanks GenoMax but it's not exactly what I asked. How can I get locally the same results as in the blast-ing you did to find https://www.ncbi.nlm.nih.gov/protein/P01834.2?

ADD REPLY • link 4.1 years ago by matt ▴ 20

1

Entering edit mode

As Dunois said below you should be able to get the same/similar result (you will have to test parameters for local blast) by downloading nr database and then limiting your searches to human entries with -taxids 9606 (human) option in your blastp command line. This will include protein entries from UniProt.

-taxids <String>
   Restrict search of database to include only the specified taxonomy IDs

That protein is there in the nr db.

$ blastdbcmd -db nr -taxids 9606 -outfmt %a | grep "P01834"
P01834.2

While human proteome is reasonably complete it is still evolving so there is bound to be some discrepancy between databases. Not sure what it is that you want to do but UniProt proteome will likely be the best best representation you have at the moment. RefSeq should be close behind since those are also human curated datasets.

ADD REPLY • link 4.1 years ago by GenoMax 151k

0

Entering edit mode

I would like to be able to do ‘blastp’ with clients proprietary sequences which cannot by processed on public website.

Until now I did it with UniProtKB but wanted to run it on 'nr' databse as well which some of my team colleagues used initially. Regarding the 'nr' database I got the following comment from NCBI User Services:

'The protein nr database is NOT organized by taxonomic breakdown. In other word, human sequences could be (and more likely are) present in every volumes.'

That means I would have to download all the 47 files (between 2-3GB each) which I wanted to avoid. I started to work with 'https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_protein.faa.gz' but that didn't pass the P01834 test.

I am in process to download all the 47 nr files which will take hours-days and eat up 80% of my free hard drive space.

ADD REPLY • link 4.1 years ago by matt ▴ 20

1

Entering edit mode

clients proprietary sequences which cannot by processed on public website.

No option but to download nr indexes and do the search locally then. Be sure to download the taxid files as well from where you are getting the nr files.

ADD REPLY • link 4.1 years ago by GenoMax 151k

0

Entering edit mode

After all, I realised I don't have that much space on my hard drive, after unpacking these 47 files it would need about at least 750GB. However, all your comments have convinced me that doing it is of not much use given we have UniProtKB. Thank you all for comments!

ADD REPLY • link 4.1 years ago by matt ▴ 20

0

Entering edit mode

NR is 750GB? My local copy clocks in at 135GB uncompressed, and it's less than a year old. The diamond database is about the same size. Did you get the NR FASTA off of here: https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ ?

ADD REPLY • link 4.1 years ago by Dunois ★ 2.9k

1

Entering edit mode

Ah, I was dowlnoading the 47 nr files from https://ftp.ncbi.nlm.nih.gov/blast/db

I didn't know there is FASTA subfolder... Thanks, I might try after all!

ADD REPLY • link 4.1 years ago by matt ▴ 20

0

Entering edit mode

I didn't know that for a long time either, it's really cool that NCBI offers all that so accessibly.

Good luck, and let us know how it went!!

Edit: just want to mention, I think it is possible to restrict a search with diamond to a specific taxon just like in blast. Take a look at their help documentation and probably also these GitHub issues.

ADD REPLY • link 4.1 years ago by Dunois ★ 2.9k

0

Entering edit mode

There is no point in downloading the fasta files since you will need to make the index yourself.

Using diamond or blast locally will require tens of GB of RAM (or it would be very slow if swap disks come into play). There are no simple solutions here. If you don't have necessary hardware available locally consider using a cloud environment.

ADD REPLY • link 4.1 years ago by GenoMax 151k

1

Entering edit mode

I'd add to GenoMax 's point that if you're going to run a search against NR on a local machine, it's probably unwise at this point to use blastp. Diamond or MMSeqs2 under maximum sensitivity would be way, way faster at some loss of sensitivity (the latest version of Diamond is as sensitive as blast is but is ~ 80x faster).

ADD REPLY • link 4.1 years ago by Dunois ★ 2.9k

0

Entering edit mode

Thanks Dunois, very useful links. On this forum I was re-directed to https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_protein.faa.gz for the human part of ''nr. Non-redundant GenBank...' database. It is twice as big as the UniProt and I wonder why.

ADD REPLY • link 4.1 years ago by matt ▴ 20

0

Entering edit mode

RefSeq entries contain isoforms.

>NP_001372292.1 tudor domain-containing protein 1 isoform 1 [Homo sapiens]
>NP_001372293.1 tudor domain-containing protein 1 isoform 3 [Homo sapiens]
>NP_001372294.1 tudor domain-containing protein 1 isoform 4 [Homo sapiens]
>NP_001372295.1 tudor domain-containing protein 1 isoform 5 [Homo sapiens]
>NP_001372296.1 tudor domain-containing protein 1 isoform 6 [Homo sapiens]
>NP_001372297.1 tudor domain-containing protein 1 isoform 7 [Homo sapiens]
>NP_001372298.1 tudor domain-containing protein 1 isoform 7 [Homo sapiens]
>NP_001372299.1 tudor domain-containing protein 1 isoform 8 [Homo sapiens]
>NP_001372300.1 tudor domain-containing protein 1 isoform 9 [Homo sapiens]
>NP_001372301.1 tudor domain-containing protein 1 isoform 10 [Homo sapiens]
>NP_001372302.1 neuroblastoma breakpoint family member 15 isoform 1 [Homo sapiens]
>NP_001372303.1 neuroblastoma breakpoint family member 15 isoform 1 [Homo sapiens]
>NP_001372304.1 neuroblastoma breakpoint family member 15 isoform 1 [Homo sapiens]
>NP_001372305.1 neuroblastoma breakpoint family member 15 isoform 1 [Homo sapiens]
>NP_001372306.1 neuroblastoma breakpoint family member 15 isoform 1 [Homo sapiens]
>NP_001372307.1 neuroblastoma breakpoint family member 15 isoform 1 [Homo sapiens]

ADD REPLY • link 4.1 years ago by GenoMax 151k