I am using HHBLITS to predict protein domains. I now want to use the collection of predicted domains for any protein in the HHBLITS output and convert it to protein domain architecture. As you can see from an example output for one protein scanned against PfamA using HHBLITS, there are multiple hits, some overlapping, and therefore conflicting - how do I go about resolving these conflicts / overlaps and on what bases do I parse such an output and convert it to a strong of protein domains, i.e. protein domain architecture?
Here is an excerpt from the HHBLITS author regarding how to solve this problem of overlap / conflict - "The probability that a pair of residues is correctly aligned is the product of the probability for the database match to be homologous (given by the values in the \verbProbab
column of the hit list) times the posterior probability of the residue pair to be correctly aligned given the database match is correct in the first place. The posterior probabilities are specified by the confidence numbers in the last line of the alignment blocks. A 0 corresponds to 0-10\%, a 9 to 90-100\%. Therefore, an obvious solution is to prune the alignments in the overlapping region such that the sum of total probabilities is maximized. There is no script yet that does this automatically."
I dont have much of a clue regarding what that means, let alone how to implement this in Perl or some other language! Could someone guide me through this please?
Thanks all for your help. - AksR
No Hit Prob E-value P-value Score SS Cols Query HMM Template HMM
1 PF04379 DUF525: Protein of un 99.3 3.2E-17 8.3E-21 123.8 0.0 87 311-409 2-88 (90)
2 PF00646 F-box: F-box domain; 85.7 0.018 5.5E-06 32.6 0.0 47 1-47 1-47 (48)
3 PF09346 SMI1_KNR4: SMI1 / KNR 83.5 0.025 8.3E-06 35.6 0.0 27 114-140 1-27 (130)
4 PF12937 F-box-like: F-box-lik 78.7 0.057 1.7E-05 30.9 0.0 38 10-47 8-45 (47)
5 PF05743 UEV: UEV domain; Int 38.5 1.2 0.00033 30.4 0.0 43 269-312 36-85 (121)
6 PF03360 Glyco_transf_43: Glyc 30.9 2 0.00052 32.3 0.0 47 247-293 56-102 (207)
7 PF05247 FlhD: Flagellar trans 23.0 3.7 0.00089 28.7 0.0 39 168-206 11-50 (104)
8 PF08745 UPF0278: UPF0278 fami 18.5 5.4 0.0013 30.5 0.0 42 112-153 61-108 (205)
9 PF09336 Vps4_C: Vps4 C termin 18.0 5 0.0013 24.3 0.0 24 103-126 34-57 (62)
10 PF10959 DUF2761: Protein of u 15.6 7.8 0.0017 27.0 0.0 16 376-391 54-69 (95)