Suppose we have two random sequences A and B of L nucleotide each. What is the probability to observe a common substring (kmer) of at least k nucleotides between those two strings ? What I have got so far :
#param :
k = 3 #common substring must be of at least 3 nucleotides
L = length(A) = length(B) = 12 #the two random sequences are 12 nt long
prob_kmer = 1/(4**k) # 1/64, the probability to observe a particular sequence in a random kmer
N_prob_kmer = 1 - prob_kmer # 63/64, the probability to NOT observe a particular sequence in a random kmer
nb_kmer = L - k + 1 # 10, the number of possible kmer in a sequence of length 12
nb_comparison = nb_kmer**2 # 100, the number of kmer comparison between the two sequences of length 12
P = 1 - ((N_prob_kmer)**(nb_comparison)) # 80%, probability to find at least one of the kmer of sequence A matching a kmer of sequence B.
Is this correct ? I'm concerned that subsequent k-mer do not have independent sequences (they share some nucleotide since they are overlapping) so the probabilities are not independent either... But I have no idea how to take that into account in my calculation.
Thanks.
Gamblers have a similar problem - for example in the case of finding the odds of getting heads n times in a row in m tries of flipping a coin. This problem with the coins was solved by de Moivre in 1738 (for details see the post at math.stackexchange.com). I implemented this in Python (might be helpful) [but I'm not sure this is right]:
Thanks for your answer in python, its easier for me that way. I need to think through this a bit more, but are you sure that this is correct ?
Although I don't fully understand the formula yet, I feel like this is the probability to find a specific k-mer of sequence A in sequence B, which is different than the common substring problem. For instance, what would happen with that setup ?
The longer I think about your problem, the more doubts I have concerning the answer I gave you. I need to think this through too :) I changed my answer to a comment. Please let me know if you find the solution (the problem is very interesting).
Sure. I just found that this problem has a name (Common Substrings in Random Strings) and a with a quick google search, that paper. Didn't read it yet, will update.