Ncbi Non-Redundant Dataset (Nr) In Protein-Blast To Look For Homologs?
4
1
Entering edit mode
12.1 years ago

Hi all,

this must be very basic, but still. I have a protein sequence for which I want to find homologs. I go to BLAST and do, for simplicity here, a regular BLASTp.

I know that blasting against refseq_protein or swissprot is common practice, but how about nr (non-redundant protein sequences)? This includes "All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects", and as far as I've seen, it includes not only hypothetical proteins, but also different instances of the same protein (e.g. different combinations of PDB chains, etc.)

Would you guys consider a BLAST search against nr a proper "finding-homologs" exercise?

Thanks!

Miquel

blast • 15k views
ADD COMMENT
4
Entering edit mode
12.1 years ago
cdsouthan ★ 1.9k

Miquel, The easiest way to start off your homologue collection is via Ensembl (orthologues and paralogues) and TreeFam (orthologues). This will save you a lot of BLASTING around. You are right in that "nr" is actually highly redundant for many reasons. Thus a BLAST against UniProt 90 is much cleaner. If you really want "all" you will have to also TBLASTN against the EST and TSA divisions.... a tough job

ADD COMMENT
2
Entering edit mode
12.1 years ago
John Van Dam ▴ 110

Hi Miquel,

The answer depends a bit on what it is you exactly want. Do you just want to see if there are "any" homologs? Or are you looking for specific homologs (e.g. homologs in C. elegans)?

If you want to find "any" homologs nr is fine. If you are looking for more specific homologs, other databases and settings may be more suitable. You could for instance blastp against a protein set (refseq) of a specific organism. Please remember that e-values are database size dependent and hits with just-below-threshold e-values can become insignificant in large databases such as nr.

Cheers, John

ADD COMMENT
0
Entering edit mode

Thanks John. I want to see if there are "any" homologs (and, ideally, I'd like to find as many as possible). The problem I've found with "nr" is that sometimes I retrieve several instances of the same protein, perhaps with different lengths for whatever reason, which makes me doubt about its validity to find a proper collection of homologs.

ADD REPLY
0
Entering edit mode

Any idea about command line options for blasting against protein db of specific organism (e.g. Homo sapiens)

Thanks

ADD REPLY
1
Entering edit mode
12.1 years ago
Biojl ★ 1.7k

I agree with cdsouthan, Ensembl might be a good choice for you... as long as you are interested mainly in vertebrates.

You might want to take a look to the new Ensembl REST api, where you can programatically retrieve all the homologs for a certain Id (comparative genomics section). It supports several programming languages.

http://beta.rest.ensembl.org/

In addition you could check the algorithm used in Ensembl to find orthology and homology relations, which is partially based in BLAST. It might give you some ideas. http://useast.ensembl.org/info/docs/compara/homology_method.html

ADD COMMENT
1
Entering edit mode
9.1 years ago

BLAST tells you about sequence similarity but it is not enough to tell that two genes are homologs. If you have protein accessions of RefSeq, you could simply go to its page at NCBI, e.g.

http://www.ncbi.nlm.nih.gov/protein/NP_000005 Then in the right sidebar, in the section called Related information, find and click on "HomoloGene". You could also simply go to HomoloGene service and search for your proteins directly there, e.g.

http://www.ncbi.nlm.nih.gov/homologene/?term=NP_000005

HomoloGene also provides data on FTP. You can download file here and see all the proteins and genes in the current dataset.

ADD COMMENT

Login before adding your answer.

Traffic: 1986 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6