Greetings!
Is it possible to perform pairwise comparison of protein Secondary Structure where:
1st primary protein sequence = 'Unknown domain' or 'Uncertain annotation' (~ 20K of these, all different))
2nd primary protein sequence = Known protein domain sequence (~ 500 of these, also all different). Here the term domain satisfies both structure & sequence motif definitions of 'domain'
So no two pairwise comparisons will have the same combination of unknown query : known domain.
Goal 1 - To judge whether the unknown has a matching SS to that of the domain or not?
Goal 2 - Ultimately determine if unknown sequence is a functional domain or not?
If yes, could you refer me to such tool(s) and their corresponding manuscript(s), please?
I know there are 3D overlap and comparison tools, but I have not come tools for protein SS comparison. I look forward to advice from subject matter experts. Thanks, in advance.
You can use HHsuite and incorporate secondary structure/domain information from
dssp
I believe. I would suggest you build a HMM from your known domain sequences, and then screen all your unknowns against that.Detecting domains is very much a strong suit of HMM based tools.
My impression is that Anand wanted to score only secondary structure match. It beats the purpose of building profile HMMs to score only their secondary structures. If one goes through trouble of building a profile HMM with SS, scoring both sequence and SS features will provide higher sensitivity. That very well may be what you were suggesting, and it is an excellent strategy for homology detection. However, it is not a direct answer to the original question.
It's not a purely SS based approach no, but this approach will likely be better. If there is secondary structure conservation it should be inherent in the sequence to at least some degree. Sequence identity thresholds can be dropped very low with HMM approaches, so the sequence itself need not dominate, and an HMM built from 500 or so known-good sequences should cover a decent amount of sequence space.
Secondary structure predictions in the absence of good 3D structures can be questionable at best oftentimes.
There is nothing that can be predicted better from protein sequence than its secondary structure. This is true both for globular (83-85% 3-state accuracy) and trans-membrane proteins (~90%). That level of accuracy is an average and won't hold every single time, but it certainly is not questionable oftentimes. Also, predictions have little to do with the availability of a good 3D structure for that particular protein or domain. Good SS predictions are generated all the time even for brand new folds.
Agree with the rest of your comment, even though it doesn't answer the original question.
Well, I’d be happy to be proven wrong, but in my experience, you cannot robustly say much about a proteins structure from its sequence alone. How that protein behaves in situ, especially if it has binding partners critical to its function, is not sufficiently captured by sequence alone. In fact, for a great many protein families, the sequences can be almost entirely unalignable, and yet the same protein folds can ensue.
Moreover, even if a protein does have a resolve structure, Crystal structures are often artefacts in and of themselves. This is not a trivial problem whichever way you approach it.
As for answering the question, it does address what OP is actually trying to do. The purpose of this forum is not simply to answer the question as posed, but to educate. It is often the case that a post believes they have identified the appropriate way to answer a question, and the approach may well work, but it may not be the optimal way, or it may be a fully fledged XY question.
Giving the poster a solution to a problem, just because that is what they asked, is not ideal, and would also lead future visitors to this thread down a blind alley, if the OPs premise was not quite right*.
*This is not a direct criticism of you Anand, I’m just speaking generally about offering solutions to the task rather than to the question.
Thanks for sharing your idea. I've used hhblits in the past, so even though it is computationally more intensive, use of the CS219 alphabet and higher sensitivity of HHM<=>HHM alignment should offer enough info in the output to answer my question.
Speaking of output, from the HHsuite HHM-HHM pairwise alignment example. here is an example pasted below:
Thinking further about Healey's suggestion, some questions that arise:
1.
Should I turn on -realign option for my hhblits run? Only then can I incorporate confidence scores in my qualitative / quantitative evaluation of SS conservation between Q & T? If alignment quality dips below a certain conf. level (what should that level be?) - should I ignore or penalize that column?!
2. DSSP alphabet : H,B,E,G,I,T,S (From here) PSIPRED alphabet : E,C,H (From here) DSSP has more SS 'states' than PSIPRED (confirm this), so for same aln length, more information content from DSSP than PSIPRED. But is there a preference for using DSSP or PSIPRED with hhblits, or should I use both?
3. From example HHM-HHM alignment above, I see lines for
Q ss_pred & T ss_pred => meaning I can compare 'SS string' for BOTH Q & T, predicted by PSIPRED
T ss_dssp - but NOT for Q ss_dssp => So how can I can make the pairwise comparison, when SS prediction for JUST T is found in the hhblits output, but NOT for Q - Especially, since PSIPRED and DSSP alphabets are different are not interchangeable? Or is this as simple as a toggle to turn on "Q ss_dssp" in the output?!
4. This is probably my most important follow up question :)
It's rudimentary to simply grep extract the following 3 lines from the example below, but how exactly would I use this info to quantify SS conversation? Is there a script / parser available (may in hhsuite?) and set of rules defined somwhere?