Dear all :-)
I am trying to compute the conservation score of each position of a protein multiple sequence alignment. I already used the Shannon entropy, but I am not satisfied with it since it is not similarity-based but identity only. So I thought that maybe it could be a good idea to use a substitution matrix. I tried to implement two methods:
- Protein–Protein Interfaces: Analysis of Amino Acid Conservation in Homodimers (https://doi.org/10.1002/1097-0134(20010101)42:1%3C108::AID-PROT110%3E3.0.CO;2-O)
- the "sum-of-pairs" method from AL2CO (https://doi.org/10.1093/bioinformatics/17.8.700)
The first method gives me wrong results (maybe because I used BLOSUM62 instead of PET91 used in the article...). The second method (AL2CO) doesn't give me satisfying results.
In practice, I would like a score in [0,1] with some sensitivity to sequence redundancy. I have a workflow in python that process my alignment and calculate properties, so I try as much as possible to avoid external tools...
Do you have some bits of advice or maybe a hidden magick package that I didn't found :-)?
Best,
Thibault.
Thank you very much for your answer :-) I will have a look to it and try to implement it.
I still need a "similarity" based score, but I didn't think of Distance scipy package. I can maybe find a way to use one of the methods 🤔 combine to a BLOSUM matrix...
My second "easy" option is to use a 10 group classification for amino acids as described here: http://thegrantlab.org/bio3d/reference/entropy.html.
Hydrophobic/Aliphatic [V,I,L,M], Aromatic [F,W,Y], Ser/Thr [S,T], Polar [N,Q], Positive [H,K,R], Negative [D,E], Tiny [A,G], Proline [P], Cysteine [C], and Gaps [-,X].
Best, Thibault.