Question

Ncbi Non-Redundant Dataset (Nr) In Protein-Blast To Look For Homologs?

1

Entering edit mode

12.6 years ago

miquelduranfrigola ▴ 790

Hi all,

this must be very basic, but still. I have a protein sequence for which I want to find homologs. I go to BLAST and do, for simplicity here, a regular BLASTp.

I know that blasting against refseq_protein or swissprot is common practice, but how about nr (non-redundant protein sequences)? This includes "All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects", and as far as I've seen, it includes not only hypothetical proteins, but also different instances of the same protein (e.g. different combinations of PDB chains, etc.)

Would you guys consider a BLAST search against nr a proper "finding-homologs" exercise?

Thanks!

Miquel

blast • 16k views

ADD COMMENT • link updated 9.5 years ago by David Managadze ▴ 50 • written 12.6 years ago by miquelduranfrigola ▴ 790

score 4 · Answer 1 · 2012-10-17

Miquel, The easiest way to start off your homologue collection is via Ensembl (orthologues and paralogues) and TreeFam (orthologues). This will save you a lot of BLASTING around. You are right in that "nr" is actually highly redundant for many reasons. Thus a BLAST against UniProt 90 is much cleaner. If you really want "all" you will have to also TBLASTN against the EST and TSA divisions.... a tough job

score 2 · Answer 2 · 2012-10-17

2

Entering edit mode

12.6 years ago

John Van Dam ▴ 110

Hi Miquel,

The answer depends a bit on what it is you exactly want. Do you just want to see if there are "any" homologs? Or are you looking for specific homologs (e.g. homologs in C. elegans)?

If you want to find "any" homologs nr is fine. If you are looking for more specific homologs, other databases and settings may be more suitable. You could for instance blastp against a protein set (refseq) of a specific organism. Please remember that e-values are database size dependent and hits with just-below-threshold e-values can become insignificant in large databases such as nr.

Cheers, John

ADD COMMENT • link 12.6 years ago by John Van Dam ▴ 110

0

Entering edit mode

Thanks John. I want to see if there are "any" homologs (and, ideally, I'd like to find as many as possible). The problem I've found with "nr" is that sometimes I retrieve several instances of the same protein, perhaps with different lengths for whatever reason, which makes me doubt about its validity to find a proper collection of homologs.

ADD REPLY • link 12.6 years ago by miquelduranfrigola ▴ 790

0

Entering edit mode

Any idea about command line options for blasting against protein db of specific organism (e.g. Homo sapiens)

Thanks

ADD REPLY • link 10.5 years ago by Anushka ▴ 20

score 1 · Answer 3 · 2012-10-17

I agree with cdsouthan, Ensembl might be a good choice for you... as long as you are interested mainly in vertebrates.

You might want to take a look to the new Ensembl REST api, where you can programatically retrieve all the homologs for a certain Id (comparative genomics section). It supports several programming languages.

http://beta.rest.ensembl.org/

In addition you could check the algorithm used in Ensembl to find orthology and homology relations, which is partially based in BLAST. It might give you some ideas. http://useast.ensembl.org/info/docs/compara/homology_method.html

Ram · Answer 4 · 2015-10-30

BLAST tells you about sequence similarity but it is not enough to tell that two genes are homologs. If you have protein accessions of RefSeq, you could simply go to its page at NCBI, e.g.

http://www.ncbi.nlm.nih.gov/protein/NP_000005 Then in the right sidebar, in the section called Related information, find and click on "HomoloGene". You could also simply go to HomoloGene service and search for your proteins directly there, e.g.

http://www.ncbi.nlm.nih.gov/homologene/?term=NP_000005

HomoloGene also provides data on FTP. You can download file here and see all the proteins and genes in the current dataset.