Question

Entropy From A Multiple Sequence Alignment With Gaps

3

Entering edit mode

14.0 years ago

hadasa ★ 1.0k

Shannon's entropy is a quantitative measure of uncertainty in a data set.In most instances it is possible to calculate the entropy scores for a multiple sequence alignment(MSA). What i would like to understand is how to take care of the gaps and how to perform a correction in case of gaps in the MSA. Does it matter?(i think it does) how do you interpret the scores in the presence of gaps?

multiple • 17k views

ADD COMMENT • link updated 13.9 years ago by Bilouweb ★ 1.1k • written 14.0 years ago by hadasa ★ 1.0k

score 5 · Answer 1 · 2010-11-23

5

Entering edit mode

14.0 years ago

Bilouweb ★ 1.1k

When you calculate Shannon's entropy, you consider an alphabet of 21 symbols (20 amino acids and a gap symbol). The problem is that a column full of gaps is conserved (entropy is high).

I found a good way to take in account gaps in the paper from William Valdar : Scoring Residue Conservation.

I calculate the entropy with a function which takes in account sequence weights and amino acid frequencies (t(x) where x is a column). Then I calculate the proportion of gaps in the column (g(x)) and finaly, my score is S = (1-t(x)) * (1-g(x))

ADD COMMENT • link 14.0 years ago by Bilouweb ★ 1.1k

0

Entering edit mode

with the function from Valdar, you can also take in account the stereochemical nature of amino acids.

ADD REPLY • link 14.0 years ago by Bilouweb ★ 1.1k

score 1 · Answer 2 · 2010-11-23

There is quite a nice description on pg 119 of the BioEdit documentation pdf. In short you can either define how many character states are possible at that position, or work from the number of observed character states. In either case gaps are fine, the second approach also deals with ambiguity codes etc.

In terms of interpreting the scores (from the same pdf)...

"An entropy plot can give an idea of the amount of variability through a column in an alignment. It is a measure of the lack of “information content” at each position in the alignment. More accurately, it is a measure of the lack of predictability for an alignment position. If there are x sequences in an alignment (say x = 40 sequences) of DNA sequences, and at position y (say y = position 5) there is an ‘A’ in all sequences, we can assume we have a lot of information for position 5 and chances are if we had to guess at the base at position 5 of another homologous sequence, we would be correct to guess ‘A’. We have maximum “information” for position 5, and the entropy is 0. Now, if there are four possibilities for each position (A, G, C or T) and each occurs at position 5 with a frequency of 0.25 (equally probable), then our information content (how well we could predict the position for a new incoming sequence) has been reduced to 0, and the entropy is at maximum variability."