Dear all,
I was looking for a good way to calculate conservation scores over columns in an MSA. I usually use Kullback-Leibler-Divergence (kl_divergence) or Shannon entropy. However, I would like to know if it makes sense to penalize gaps, when calculating conservation. And if so how could this be implemented. What I tried now is just a very simple score such as:
score = kl_divergence * (1 - gap_frequency)
So I just use the gap_frequency to penalize columns with a high share of gaps in the alignment. However, I am unsure if this is, let's say, biologically meaningful to do. I could not find any good solution to this. Are there established methods to do this? In particular in combination with Shannon entropy, KL divergence or similar methods?
Any suggestion is appreciated!
Best, Jonathan