The problem you've described in your post is similar to one of my research problems from a decade back, and I did try to compare results for protein domain discovery using HMMER2 vs HMMER3 vs HHBLITS. I will try to provide some summary points that might help you. Feel free to ask any follow-up questions.
1. Obviously you must be aware that HHBLITS uses profile::profile comparison to detect remote matches, and is the most sensitive. Deprecated HMMER2 and the current HMMER employs profile::sequence comparison to detect matches, this is less sensitive, but good enough, and widely used for domain discovery and annotation of protein sequences. BLAST on the other hand uses sequence-sequence heuristics to report matches, and is the least sensitive. Usually BLAST is not used for protein domain discovery or annotation, unless it is highly conserved.
I suspect your protein domain is NOT highly conserved, or you would not be thinking about HHBLITS, am I right?
2. When homology < 30% sequence identity, the rate of finding false positives increases. When it is < 20%, it starts to become very hard to distinguish true positives from false ones. So when you want to detect remote homologous sequences, this is something you need to be careful about... Usually when you have higher sensitivity, the method will suffer from lower specificity. As a workaround, the context specific alphabet of 219 letters bumps up the specificity - in fact CS-BLAST is demonstrably more specific than regular BLAST. So you can get away with claiming that HHBLITS is BOTH sensitive (low false negative) with reasonable specificity (low false positive).
3. The computational requirements to run HHBLITS are significant - You do not describe what resources you have at your disposal. You'd need a machine with multiple cpus (more the quicker) and high RAM (64GB for some steps), very different from what you'd find on a typical desktop... I used a High Performance Computing Center famility at my university for my HHBLITS runs.
If you already did not know this, here is a very simplified summary of how HHBLITS pipeline (used to) works:
- you'd need to download the entire annotated proteome for your
species of interest,
- find homologs to each of those sequences
against a non-redundified (very large) database of proteins
- align those and thereby convert each query sequence => alignment =>
profile based not on 20 aa residues (like in HMMER), but a CS219
alphabet
- download one or more HHBLITS profiles from Pfam version 32
or whichever is the latest, pre-made by the HHSUITE research group
- Scan your database of protein profiles (for your proteomes) Vs. 1
HHBLITS Pfam profile of interest, OR
- Scan your database of HHBLITS
Pfam profiles Vs. One CS219 profile for 1 protein from your proteome
You would need to repeat this process for each species / proteomes of interest. Is it really worth it? Let me try answer this question below.
Based on my own experience, also looking at a protein domain ~ 50aa long, with very low sequence conservation, but quite high structural conservation, I concluded the following:
A. it may be better to consider HMMER2 for discovering matches to short, diverse domain sequences rather than HMMER3. Or something that is a hybrid approach - https://biologydirect.biomedcentral.com/articles/10.1186/s13062-016-0163-0
B. HHBLITS is extremely computationally intensive, and for just a handful of additional proteins that I was indeed able to add to the list of hundreds that were common discoveries across HMMER2, HMMER3 and HHBLITS, it may not be worth it. Also, these methods will return slightly different domain boundaries. And if you need to annotate proteins with domain start-stop coordinates, it becomes another headache to decide which tool's results you will use for consistency and comparison - public databases commonly use InteproScan, and to some extent PfamScan.
C. In the past (and probably still), HHBLITS was (is?) a top performer for research questions, where it was important to find a template for structural modeling - these are low throughput studies. What I performed, and what you intend to perform are high throughput (proteome scale) studies, which is NOT what HHBLITS was originally intended for.
In conclusion, this is my warning to you: Stick with HMMER2 or HMMER3 or some such variant, rather than using HHBLITS for protein domain discovery.
Thanks Anand for the details... and warning !!!
Indeed HHBLITS is very slow even for a couple of sequences without a huge computational support. Apart from identification of remote homologs,inclusion of secondary structure information in the search criteria would have been another reason for using HHBLITS.
Yes, I would like to ask a follow up question, but for a general education. How well the use of CS-BLAST circumvent the problem of finding false positives when % identity is below 20 ?
It's been a while and I cannot remember off the top of my head. I never did use CS-BLAST. You can find more info at https://en.wikipedia.org/wiki/CS-BLAST#Performance - and the relevant paper that describes and benchmarks CS-BLAST.
BTW, have you used NCBI CD-Search? https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi I don't think they have a command line interface...
In general, I would be very surprised if HMMER2/3 under-performed compared to CS-BLAST.
If you have the skillset to hack HMMER to use context specificity (CS-219 alphabet) instead of the usual 20 aa code, I expect this to perform better. And it might be a great compromise over using HHBLITS, in terms of run times and computational requirements. Might be worth asking Johannes Soding (HHBLITS PI) whether this has been done...
Good luck!
Many thanks Anand !
BTW, what protein domain are you working on? If you are comfortable sharing (in public or in private), I can take a look at it, and may be able to provide more relevant suggestions....