Perhaps consider the simple case of 2mers:
AA
AC
AG
AT
CA
...
GG
Each of these 2mers has a 1/16 probability of occuring.
We are interested in the 2mer AA
and we allow up to one mismatch. Say we hold one base constant as A
, and allow the other base to be variable, i.e.,:
A{A, C, G, T}
{A, C, G, T}A
Each of these two events can occur with probability of 1/4. However, we are counting AA
twice:
AA
AC
AG
AT
Or:
AA
CA
GA
TA
The chance of getting AA
with up to one mismatch allowed is therefore: 2/4-1/16 = 7/16.
If the subtraction part seems strange, you could also look at AA
and just count the degenerate patterns explicitly:
AC
AG
AT
CA
GA
TA
Each of these mers has a 1/16 probability, which sums to 1/16 (AA
) + 6/16 (AC
, AG
, ..., TA
) = 7/16.
In the case of 3mers:
AAA
AAC
AAG
AAT
ACA
ACC
...
GGG
Here, you have all 64 possible 3mers. The probability of getting any one specific kmer without any mismatches is 1/64
, or 1/4^3
.
If you want a specific 3mer (say, AAA
) while allowing up to one mismatch over this mer, you can hold two of the three bases constant and allow the third to change.
This allows three events, where each event is where two bases are constant and the third is variable:
AA{A, C, G, T}
A{A, C, G, T}A
{A, C, G, T}AA
Or:
AAA
AAC
AAG
AAT
Or:
AAA
ACA
AGA
ATA
Or:
AAA
CAA
GAA
TAA
The probability of each one of those three events is 1/16. If we sum the probability of these three events, we get 3/16. However, this counts the exact match AAA
three times, instead of just once. We only want to consider the AAA
event once, so we subtract the probability of the two additional AAA
events: 3/16 - 2/64 = 10/64.
More generally, this is: k/4^(k-1) - (k-1)/4^k
.
Using 3mers and 4mers, etc. you might then extrapolate how to count here for the two-mismatch case, and so on.