I have a list of proteins that define a family of enzyme.
I want to look if these enzymes are present in a genome I am working with. Therefore, I applied two approaches:
- blastp the above mentioned sequences against the genome
- create a hmm profile for the proteins and use hmmscan to search for them in the genome
In both case I get a variable number of hits.
However, there is something that I am missing from the biological point of view.
The local alignment of blast would suggest me that the obtained hits could be similar to the query proteins also in region that are not directly connected to the enzymatic reaction. Therefore, the hits I obtained can in theory also not have a similar function to the reference.
Creating a hmm profile instead, I will just consider the region in the proteins that are highly conserved and that are likely crucial for the functionality of the enzyme. Therefore, using this approach would give me hits that share more likely similarity in function with the reference proteins.
Am I missing something or my reasoning is correct?
Ideally if there is enough homology between query and subject you won't have much of an issue with similar domains leading to false positive results.
However hmmscan won't fix this if you're only searching for a single domain or if the conserved region you select shows up in other places. Depending on what your enzymatic domain is, it could be present in other proteins. In other words, the opposite case of what is described by OP could be true, the enzymatic parts match but the other regions don't.
HMMs are a more advanced model, but the general concept is the same as PSI BLAST. You're basically searching with a 'fuzzy' representation of a group of sequences rather than a single sequence.
I think your best bet would be to maximize alignment length or confirm each BLAST/HMM hit with a global aligner (e.g. clustalo).
I agree with most of what you wrote. There are at least two kinds of false positives in OP's case:
OP stated that using BLAST might produce the first kind of false positives but using HMMScan will avoid these particular false positives. This is correct. HMMScan with a profile HMM built with "the region in the proteins that are highly conserved and that are likely crucial for the functionality of the enzyme." will completely avoid the first kind of false positives. You are mentioning about the second kind of false positives. More analyses like comparing domain architecture, global alignment or phylogenetic analyses are needed to eliminate these false positives.
Though PSSMs and profile HMMs based searches are similar in concept, the reason why PSI-BLAST is worse for the first kind of false positives is due to its iterative searches. A PSSM generated with a promiscuous domain is going to pick up all kinds of unrelated hits.
Hi Siva, I came across this thread while search for the difference between HMM based alignment and traditional local alignment, like BLAST. In my mind, Both BLAST and HMM approach can only return us the alignment optimized for 'local similarity', neither of them can tell whether these similar regions are more likely to be homolog or promiscuous domain. So why should we refer HMM approach over BLAST? I am guessing that I missed something about their difference.