Hey all,
I have been trying to figure out the equation to figure out what the shortest length a k-mer can be to be unique in a large genome.
Anything would help,
Thank you
Hey all,
I have been trying to figure out the equation to figure out what the shortest length a k-mer can be to be unique in a large genome.
Anything would help,
Thank you
I once had to do this for a wet-lab challenge - "what is the two shortest primer pairs can can amplify something in the human genome".
You'd be surprised, even at 10bp there are quite a few unique 10mers which only crop up once in the genome:
x is kmer length, y is occurrences in human genome.
To be honest, I no longer have the code, but the method was to take a given window size, say 10bp, then take the first 10bp of the genome fasta file and either add it to a hashtable, or, if it was already in the hashtable, mark the existing entry for that fragment in the table as repeated, or if its already marked as repeated, do nothing.
The result was apparently 1,000,000 fragments marked as repeated, and a couple of 1000 marked as unique. I repeated the process for a bunch of different window sizes, which is what gave this graph. There are 101 ways this could have been improved, but at the time I had just started learning python, and this was a nice experiment.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I am having trouble to understand you question? Are you looking for a single unique k-mer or a length at which all k-mers are unique? Could you elaborate?
probability is ( 1 / 4^n ) (n = length of sequence). Find an n value which makes the denominator larger than your genome size.
There is no shortest length equation. This genome:
...has a shortest unique kmer of 1 (T).
On the other hand, if you remove the T, the genome has no unique kmers except for itself, and if it is circular, there are no unique kmers, period. So, it's completely sequence-specific. Though as Adrian points out, you can estimate the length at which a kmer might be unique.
For a specific genome, you can find it like this (using the BBMap package):
...etc. Eventually, the khist file will indicate that some kmers have a count of 1. At that point, you know the answer.