Probability of sequence with mismatches
3
1
Entering edit mode
6.6 years ago
stacy734 ▴ 40

Hi all,

I'm drawing a blank as to how to calculate the probability of a sequence of length N with some number of mismatches.

For example, a specific 50-mer will occur once in 4^50 ( or 1.267 E 30).

(ignoring reverse compliments for this example for clarity.)

What would it be if one mismatch were allowed, or two, or three, etc?

Any suggestions will be appreciated.

probability mismatches • 2.3k views
ADD COMMENT
1
Entering edit mode
6.6 years ago

Perhaps consider the simple case of 2mers:

AA
AC
AG
AT
CA
...
GG

Each of these 2mers has a 1/16 probability of occuring.

We are interested in the 2mer AA and we allow up to one mismatch. Say we hold one base constant as A, and allow the other base to be variable, i.e.,:

A{A, C, G, T}
{A, C, G, T}A

Each of these two events can occur with probability of 1/4. However, we are counting AA twice:

AA
AC
AG
AT

Or:

AA
CA
GA
TA

The chance of getting AA with up to one mismatch allowed is therefore: 2/4-1/16 = 7/16.

If the subtraction part seems strange, you could also look at AA and just count the degenerate patterns explicitly:

AC
AG
AT
CA
GA
TA

Each of these mers has a 1/16 probability, which sums to 1/16 (AA) + 6/16 (AC, AG, ..., TA) = 7/16.

In the case of 3mers:

AAA
AAC
AAG
AAT
ACA
ACC
...
GGG

Here, you have all 64 possible 3mers. The probability of getting any one specific kmer without any mismatches is 1/64, or 1/4^3.

If you want a specific 3mer (say, AAA) while allowing up to one mismatch over this mer, you can hold two of the three bases constant and allow the third to change.

This allows three events, where each event is where two bases are constant and the third is variable:

AA{A, C, G, T}
A{A, C, G, T}A
{A, C, G, T}AA

Or:

AAA
AAC
AAG
AAT

Or:

AAA
ACA
AGA
ATA

Or:

AAA
CAA
GAA
TAA

The probability of each one of those three events is 1/16. If we sum the probability of these three events, we get 3/16. However, this counts the exact match AAA three times, instead of just once. We only want to consider the AAA event once, so we subtract the probability of the two additional AAA events: 3/16 - 2/64 = 10/64.

More generally, this is: k/4^(k-1) - (k-1)/4^k.

Using 3mers and 4mers, etc. you might then extrapolate how to count here for the two-mismatch case, and so on.

ADD COMMENT
0
Entering edit mode
6.6 years ago
davidc • 0

Sorry if I have misunderstood:

0 mismatch 4^0 in 4^50
1 mismatch 4^1 in 4^50
2 mismatch 4^2 in 4^50
3 mismatch 4^3 in 4^50

where 4 is the different possibility for nucleotide and and the power is the number of mismatches allowed.

e.g for 2 mismatches the following 4^2 possibilities would be allowed

CC
CA
CT
CG
AC
AA 
AT
AG
TC
TA
TT
TG
GC 
GA
GT
GG
ADD COMMENT
0
Entering edit mode
6.6 years ago
stacy734 ▴ 40

Hi Alex

Thank you for the very clear explanation, and for taking the time!

Stacy

ADD COMMENT

Login before adding your answer.

Traffic: 1679 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6