Question

How can I perform pairwise comparison of protein secondary structure?

0

Entering edit mode

5.4 years ago

Anand Rao ▴ 640

Greetings!

Is it possible to perform pairwise comparison of protein Secondary Structure where:

1st primary protein sequence = 'Unknown domain' or 'Uncertain annotation' (~ 20K of these, all different))

2nd primary protein sequence = Known protein domain sequence (~ 500 of these, also all different). Here the term domain satisfies both structure & sequence motif definitions of 'domain'

So no two pairwise comparisons will have the same combination of unknown query : known domain.

Goal 1 - To judge whether the unknown has a matching SS to that of the domain or not?

Goal 2 - Ultimately determine if unknown sequence is a functional domain or not?

If yes, could you refer me to such tool(s) and their corresponding manuscript(s), please?

I know there are 3D overlap and comparison tools, but I have not come tools for protein SS comparison. I look forward to advice from subject matter experts. Thanks, in advance.

secondary structure protein structure alignment • 1.4k views

ADD COMMENT • link updated 5.4 years ago by Mensur Dlakic ★ 28k • written 5.4 years ago by Anand Rao ▴ 640

0

Entering edit mode

You can use HHsuite and incorporate secondary structure/domain information from dssp I believe. I would suggest you build a HMM from your known domain sequences, and then screen all your unknowns against that.

Detecting domains is very much a strong suit of HMM based tools.

ADD REPLY • link 5.4 years ago by Joe 21k

0

Entering edit mode

I would suggest you build a HMM from your known domain sequences, and then screen all your unknowns against that.

My impression is that Anand wanted to score only secondary structure match. It beats the purpose of building profile HMMs to score only their secondary structures. If one goes through trouble of building a profile HMM with SS, scoring both sequence and SS features will provide higher sensitivity. That very well may be what you were suggesting, and it is an excellent strategy for homology detection. However, it is not a direct answer to the original question.

ADD REPLY • link 5.4 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

It's not a purely SS based approach no, but this approach will likely be better. If there is secondary structure conservation it should be inherent in the sequence to at least some degree. Sequence identity thresholds can be dropped very low with HMM approaches, so the sequence itself need not dominate, and an HMM built from 500 or so known-good sequences should cover a decent amount of sequence space.

Secondary structure predictions in the absence of good 3D structures can be questionable at best oftentimes.

ADD REPLY • link 5.4 years ago by Joe 21k

0

Entering edit mode

Secondary structure predictions in the absence of good 3D structures can be questionable at best oftentimes.

There is nothing that can be predicted better from protein sequence than its secondary structure. This is true both for globular (83-85% 3-state accuracy) and trans-membrane proteins (~90%). That level of accuracy is an average and won't hold every single time, but it certainly is not questionable oftentimes. Also, predictions have little to do with the availability of a good 3D structure for that particular protein or domain. Good SS predictions are generated all the time even for brand new folds.

Agree with the rest of your comment, even though it doesn't answer the original question.

ADD REPLY • link 5.4 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Well, I’d be happy to be proven wrong, but in my experience, you cannot robustly say much about a proteins structure from its sequence alone. How that protein behaves in situ, especially if it has binding partners critical to its function, is not sufficiently captured by sequence alone. In fact, for a great many protein families, the sequences can be almost entirely unalignable, and yet the same protein folds can ensue.

Moreover, even if a protein does have a resolve structure, Crystal structures are often artefacts in and of themselves. This is not a trivial problem whichever way you approach it.

As for answering the question, it does address what OP is actually trying to do. The purpose of this forum is not simply to answer the question as posed, but to educate. It is often the case that a post believes they have identified the appropriate way to answer a question, and the approach may well work, but it may not be the optimal way, or it may be a fully fledged XY question.

Giving the poster a solution to a problem, just because that is what they asked, is not ideal, and would also lead future visitors to this thread down a blind alley, if the OPs premise was not quite right*.

*This is not a direct criticism of you Anand, I’m just speaking generally about offering solutions to the task rather than to the question.

ADD REPLY • link 5.4 years ago by Joe 21k

0

Entering edit mode

Thanks for sharing your idea. I've used hhblits in the past, so even though it is computationally more intensive, use of the CS219 alphabet and higher sensitivity of HHM<=>HHM alignment should offer enough info in the output to answer my question.

Speaking of output, from the HHsuite HHM-HHM pairwise alignment example. here is an example pasted below:

No 68 
>d1a4pa_ a.39.1.2 (A:) Calcyclin (S100) {Human (Homo sapiens), P11 s100a10, calpactin [TaxId: 9606]}
Probab=91.65  E-value=0.12  Score=40.00  Aligned_cols=62  Identities=16%  Similarity=0.149  Sum_probs=42.0

Q ss_pred             ccCCCCCCcHHHHHHHHHHHHHhhcccCccHHHHHHHHHhhhhccCCCCcCHHHHHHH-HHHHH
Q sp|Q5VUD6|FA69  140 FDKPTRGTSIKEFREMTLSFLKANLGDLPSLPALVGQVLLMADFNKDNRVSLAEAKSV-WALLQ  202 (431)
Q Consensus       140 ~d~p~~g~s~~eF~emv~~~i~~~lg~~~~l~~L~~~~~~~~d~nk~g~vs~~e~~sl-waLlq  202 (431)
                      ||+..-..|.+||.+++.......++.+.+ ...+..++..+|.|+||+|++.|...+ ..|..
T Consensus        18 yd~ddG~is~~El~~~l~~~~~~~~~~~~~-~~~v~~~~~~~D~n~DG~I~F~EF~~li~~l~~   80 (92)
T d1a4pa_          18 FAGDKGYLTKEDLRVLMEKEFPGFLENQKD-PLAVDKIMKDLDQCRDGKVGFQSFFSLIAGLTI   80 (92)
T ss_dssp             HHGGGCSBCHHHHHHHHHHHCHHHHHHSCC-TTHHHHHHHHHCTTSSSCBCHHHHHHHHHHHHH
T ss_pred             HcCCCCEEcHHHHHHHHHHhccccccccCC-HHHHHHHHHHHhCCCCCCCcHHHHHHHHHHHHH
Confidence            444433449999999998876655554332 234566677899999999999997544 44443

Thinking further about Healey's suggestion, some questions that arise:

1.

The Confidence line is only displayed when the -realign option is active.

Should I turn on -realign option for my hhblits run? Only then can I incorporate confidence scores in my qualitative / quantitative evaluation of SS conservation between Q & T? If alignment quality dips below a certain conf. level (what should that level be?) - should I ignore or penalize that column?!

2. DSSP alphabet : H,B,E,G,I,T,S (From here) PSIPRED alphabet : E,C,H (From here) DSSP has more SS 'states' than PSIPRED (confirm this), so for same aln length, more information content from DSSP than PSIPRED. But is there a preference for using DSSP or PSIPRED with hhblits, or should I use both?

T ss_dssp:      the template secondary structure as determined by DSSP (when available)
T ss_pred:      the template secondary structure as predicted by PSIPRED (when available)

3. From example HHM-HHM alignment above, I see lines for

Q ss_pred & T ss_pred => meaning I can compare 'SS string' for BOTH Q & T, predicted by PSIPRED

T ss_dssp - but NOT for Q ss_dssp => So how can I can make the pairwise comparison, when SS prediction for JUST T is found in the hhblits output, but NOT for Q - Especially, since PSIPRED and DSSP alphabets are different are not interchangeable? Or is this as simple as a toggle to turn on "Q ss_dssp" in the output?!

4. This is probably my most important follow up question :)

It's rudimentary to simply grep extract the following 3 lines from the example below, but how exactly would I use this info to quantify SS conversation? Is there a script / parser available (may in hhsuite?) and set of rules defined somwhere?

ccCCCCCCcHHHHHHHHHHHHHhhcccCccHHHHHHHHHhhhhccCCCCcCHHHHHHH-HHHHH # Q ss_pred
HcCCCCEEcHHHHHHHHHHhccccccccCC-HHHHHHHHHHHhCCCCCCCcHHHHHHHHHHHHH # T ss_pred 
444433449999999998876655554332 234566677899999999999997544 44443 # Confidence

ADD REPLY • link 5.4 years ago by Anand Rao ▴ 640

score 1 · Answer 1 · 2019-07-02

1

Entering edit mode

5.4 years ago

Mensur Dlakic ★ 28k

There was a tool for that (see here), but it seems to be unavailable. You may want to check with the authors.

ADD COMMENT • link 5.4 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Thanks for suggesting SSEA, I could not find it elsewhere. I've pinged Silvio Tosatto, hope to hear back from him...

ADD REPLY • link 5.4 years ago by Anand Rao ▴ 640