Question

Bit-Score in DNA motifs for Dummies

0

Entering edit mode

7.8 years ago

ATpoint 88k

Hi,

I have a problem that I cannot get my head around. I tried to understand the basis of bit-scoring in DNA motifs. Given I have a DNA sequence motif, which is based on ENCODE ChIP-seq data for a transcription factor: So far, I worked with motifs that one obtains after running de novo motif discovery. There, the relative height of a nucleotide at a given position simply represents the relative fraction of it in the alignment. But what is exactly the difference to this bit-scored motifs? Why do certain positions show higher scores than others? Please do not link me to pages that explain bit-scores, I already tried them but simply cannot get my head around this concept. Explanation for dummies is highly appreciated.

DNA motif Bit-score • 4.9k views

ADD COMMENT • link updated 7.8 years ago by Alex Reynolds 36k • written 7.8 years ago by ATpoint 88k

score 3 · Answer 1 · 2017-07-12

Ignore the bit units for a moment. It's a measurement of information or certainty. Roughly speaking, the relative height is how certain you are to observe a particular residue or nucleotide at a particular position.

Randomness is maximum uncertainty — all events can happen with equal probability. The most simple case is flipping a fair coin and having equal chance to get a heads or tails.

The opposite of randomness is certainty — you expect some event to happen to the exclusion of most or all other possibilities, like when you roll a weighted die in a crooked casino run by Ricky Jay, and one face comes up more than all the others.

In the original paper by Crooks et al. they call the measure at each base a measurement of conservation, which is defined as the difference between the uncertainty of what you observe in reality (which is low for one or two residues that you see more frequently than all others, such as in a transcription factor binding site), and the frequency you'd expect if the biology of where TFs bind was completely random (like a factor that binds without caring about the DNA sequence: you have a pure 1-in-4 chance of seeing one of A, T, C, or G, at any position).

High heights indicate high conservation: low uncertainty.

Transcription factor binding sites are highly conserved, biologically or evolutionarily speaking, because they control how segments of DNA get turned on, and different parts of the DNA need to get controlled at different times and in specific ways, in order for the concert of proteins to do their thing and keep the organism alive.

Like tossing a spanner into an engine of a moving car, mutations to TF binding sites will more often than not break the biological machinery of organisms and so weaken or kill them before they can make copies of themselves.

Over time, therefore, organisms have evolved genomes that conserve these regulatory, functional sites, so as to stay alive long enough to reproduce.

That's why sequence logos are good visual representations of TF sites. Logos show you where and which nucleotides are conserved to the exclusion of others. Logos show how different transcription factors have evolved a preference (a higher "certainty") for binding to different sequences of DNA. Further, logos offer quantitative or informational measures of that certainty — "bit scores" — which are based on mathematics in information theory.