Is there a simple formula to calculate the probability of finding a given sequence of nucleotides in a target sequence? I have seen this formula:
a = (g/2)^G+C × ((1-g)/2)^A+T,
where:
a = probability
g = G+C content of the target genome
C+G = number of G and C in the stretch
A+T = number of A and T in the stretch.
I tried to calculate the occurrence of a primer based targeting E. coli: GTGTCCATTTATACGGACATCCATG
as follows. The GC content of E. coli is 50.8%, thus:
a = (0.58/2)^11 × (0.42/2+)^14 = 1.22×10^-6 * 3.24×10^-10 = 3.95×10^-16
and the number of occurrences is:
n = 3.95×10^-16 × 16*10^6 = 6.32×10-9
Looks to me, that the primer should not occur at all in the E. coli genome (which is OK for a primer given that it should be present at the most once in a genome). Is the formula correct? Or is there a simpler one that does not require the power of dozen to be solved (here I had to use R to get an answer because a scientific calculator could not handle it...)?
Thank you.