We are trying to find the a point to which a k-mer goes missing. If we have a k-mer that as a base of 1.1 mil, what would be the longest point at which the k-mer is too long?
We are trying to find the a point to which a k-mer goes missing. If we have a k-mer that as a base of 1.1 mil, what would be the longest point at which the k-mer is too long?
I think the question is not particularly well-posed. That is "what do you expect is the length of the shortest missing k-mer in the genome (i.e. it does not appear)?" depends on how I define my expectation of missing. However, one reasonable interpretation would be the following. I expect a k-mer to be missing if the expected number of times it occurs is less than 1. Then, in this case, the question is asking at what length do I expect there exist k-mers that I observe less than one time in the genome.
Consider the example in the question. Assume, you have a genome of 1.1M bases, and you are looking at 9-mers. Also, assume that each 9-mer appears with equal frequency (note, this is not exactly the same as assuming each nucleotide occurs with equal frequency, so it's worth noting how detailed of an analysis the question expects). You have 1,100,000 - 9 + 1 = 1099992 9-mers in this genome, and there are 4^9 = 262144 possible 9-mers. Therefore, if each 9-mer occurs with equal probability, you expect each 9-mer to occur 1099992 / 262144 = (approx) 4.196 times. What about 10-mers? That would be 1099991 / 1048576 = (approx) 1.05 times. At what k-mer length would the expected value be < 1? Given the way the question is posed, this is my interpretation of it.
If your dictionary of bases is {A, C, G, T}, and all bases are equally likely in your hypothetical genome, then the expected frequency of an n-mer is 1/4^n. For a 9-mer, then, the frequency is 1:262144.
Say you walk through your 1.1M nt genome. If you want to look at each 9-mer across this genome, there are 1.1M - 9 + 1 possible 9-mers to investigate.
At each 9-base window across the hypothetical genome, there is a 1/262144 probability of finding your 9-mer of interest.
At each position across the genome, define a "success" event as finding your k-mer of interest, and define a "failure" event as not finding it. You can use the binomial distribution to get the probability of finding the number of successes from some number of trials.
For instance, consider a six-sided die with faces "1" through "6". You roll the die twice. What are the odds that you see a face of interest — say, a six — zero, one or two times?
> dbinom(0:2, 2, 1/6)
[1] 0.69444444 0.27777778 0.02777778
In other words, there's a 69% chance that you don't see a "6" after two rolls. There's a 28% chance that you see a "6" once from the two rolls. There's a 3% chance that you roll a "6" twice in a row.
Consider a 262144-sided die. Each face of this weird die is labeled with one of each possible 9-mer. For any 9-mer, you want to know what the odds are that you roll the die 1099992 times, and you never get one particular 9-mer to come up:
> dbinom(0, 1100000-9+1, 1/(4^9))
[1] 0.0150535
In other words, you have a 1.5% chance of not finding any matches for any given 9-mer across your genome.
As k goes down, these odds quickly worsen. For example, it is quite unlikely not to find a 6-mer over a 1.1M base genome with equal base distribution:
> dbinom(0, 1100000-6+1, 1/(4^6))
[1] 2.261891e-117
In reality, bases and kmers are not distributed equally, so these odds are useful for a probability exercise, but not as useful for real-world genomic data.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I'm not sure why this was deleted after people spent effort answering this. I reverted the deletion.
This sounds like it was copied and pasted from a homework assignment. I encourage teachers to issue homework on paper, as that requires the student to at least put in the effort of retyping it when posting on Biostars.