Blast Query Against Two Different Taxa
2
1
Entering edit mode
14.0 years ago
User 3822 ▴ 60

Hi,

I have a protein sequence which I first do a blastp against NR database without any limitations. It returns a possible protein family and an organism in the output with a very low E-value (something like 3E-143). This all works fine, but I am told that there's a very high possibility that the sequence comes from a nematode. So I redo blastp now limiting the taxa to nematodes. I get just one decent hit, with the follwing header

ENA|CL652565|CL652565.1 PRI0115a_H01 - PRI0115a.B21 (762) Mixed stage fosmid library of P. pacificus var. California Pristionchus pacificus genomic, genomic survey sequence.

I don't understand this header completely (what does PRI0115a_H01 - PRI0115a.B21 (762) mean?), but from what I can tell it is not showing me the protein family, just the name of the organism. My question is, should the protein family differ from the first case (I tried all reading frames), it shouldn't right? So why is it not shown this time? So can I deduce the organism from my second query and the protein family from the first? I am kinda new to bio and bioinformatics and have been directly put into all this, so don't mind if I have gotten something completely wrong.

blast sequence analysis database • 3.1k views
ADD COMMENT
6
Entering edit mode
14.0 years ago
Neilfws 49k

It might be useful to review the purpose of BLAST and how it works. BLAST stands for Basic Local Alignment Search Tool. Its function is to identify sequences in a database that are similar to the query sequence, as quickly as possible.

So first: BLAST results do not include "protein families". They are simply a list of all sequences in the database that match well to the query sequence. In your first example, the very low e-value indicates a very good match. Remember that BLAST statistics are influenced by database size and the length of the query sequence, so you also need to look at the percentages of identical/similar residues and the alignment length (does it span most of the query sequence?) to decide if you have a "good hit".

In your second example, you limited the database to a taxonomic group. There is no reason why the top hit should be the same as the top hit from the NR database. Also, I don't think that you can have used BLASTP in this case, as you mention trying different reading frames and your hit is a fosmid library sequence. I suspect that you used TBLASTN.

Don't worry about not understanding the header: there is nothing standard about it and no-one could be expected to know what it means just by looking at it. It looks as though it is some kind of code used to identify the library sequence. You would need to investigate further by following links to take you to the genomic survey sequence database for that particular project.

Finally: no, you cannot deduce organism from the second query and protein family from the first. You asked 2 different questions with each BLAST search: (1) of all known protein sequences, which is most similar to mine? and (2) of all known sequence from nematodes, which putative protein sequences are most similar to mine?

I would re-examine the hits from the NR database. If you think that your protein sequence is from nematode then some of the hits in that list - not necessarily the top hit - should be from nematode or closely-related organisms.

ADD COMMENT
1
Entering edit mode

Adding one thing: if you're searching for protein families you should take a look at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi

ADD REPLY
1
Entering edit mode

No, blastp and tblastn do not necessarily give the same result. When you search a protein database (blastp), you are searching sequences that the database provider believes to be genuine proteins. When you use tblastn, you are searching a 6-frame translation of a nucleotide database, which is different to a database of known (or putative) proteins.

ADD REPLY
0
Entering edit mode

Thank you for the reply. Yes, you are right. I was using tblastn and thinking it was blastp. But even in this case, tblastn tells me that there is a nematode nucleotide sequence similar to mine in the database (with a good E-value). But when I do a blastx with the same (nematode limited), I get no hits. What should I infer from this? The nematode sequence that tblastn showed doesn't encode anything?

ADD REPLY
0
Entering edit mode

@Michael Thanks, yes I did that. In fact blastp result includes a conserved domain analysis as well.

ADD REPLY
0
Entering edit mode

@neilfws Thanks. Yes, I was using tblastn thinking it was blastp. Let me rephrase what's happening. I have a sequence and first I do a blastp, which gives me a possible protein and the organism. I then do a tblastn on nematode database and get a hit for an organism. Then I change it to blastp on nematode database, and get no hits. What does this mean? Shouldn't the nematode for which I got a hit in tblastn also show here? Or it means that according to the database there is no nematode that encodes this protein?

ADD REPLY
0
Entering edit mode

Also I wanted to ask again, if blastp tells me that this sequence might encode for protein X (irrespective of the organism name blastp results in, I don't think the organism name that blastp gives is so important because the same protein might be encoded by multiple organisms). And a blastn tells me that this sequence might come from organism Y. Can I not say then say that my sequence might come from organism Y and encode for protein X?

ADD REPLY
1
Entering edit mode
14.0 years ago

When using TBLASTN as your BLAST algorithm, don't forget to look for the presence of stop codons in the HSP (high-scoring pair, or alignment). If you see stop codons within the HSPs for all reading frames, it is highly likely that you are comparing something other than protein-coding genes. My experience is the comparison then involves a type of repeat sequence or (retro)transposon.

If in doubt, try this search with an Alu (human) as a query.

ADD COMMENT

Login before adding your answer.

Traffic: 1915 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6