How to find the shortest k-mer length that is unique in a large genome
1
0
Entering edit mode
9.2 years ago

Hey all,

I have been trying to figure out the equation to figure out what the shortest length a k-mer can be to be unique in a large genome.

Anything would help,

Thank you

genome • 5.5k views
ADD COMMENT
1
Entering edit mode

I am having trouble to understand you question? Are you looking for a single unique k-mer or a length at which all k-mers are unique? Could you elaborate?

ADD REPLY
1
Entering edit mode

probability is ( 1 / 4^n ) (n = length of sequence). Find an n value which makes the denominator larger than your genome size.

ADD REPLY
1
Entering edit mode

There is no shortest length equation. This genome:

GCGCGC...GCTGC...GCGCGC
#          ^

...has a shortest unique kmer of 1 (T).

On the other hand, if you remove the T, the genome has no unique kmers except for itself, and if it is circular, there are no unique kmers, period. So, it's completely sequence-specific. Though as Adrian points out, you can estimate the length at which a kmer might be unique.

For a specific genome, you can find it like this (using the BBMap package):

kmercountexact.sh in=genome.fa k=1 out=khist1.txt
kmercountexact.sh in=genome.fa k=2 out=khist2.txt
kmercountexact.sh in=genome.fa k=3 out=khist3.txt

...etc. Eventually, the khist file will indicate that some kmers have a count of 1. At that point, you know the answer.

ADD REPLY
4
Entering edit mode
9.2 years ago
John 13k

I once had to do this for a wet-lab challenge - "what is the two shortest primer pairs can can amplify something in the human genome".

You'd be surprised, even at 10bp there are quite a few unique 10mers which only crop up once in the genome:

x is kmer length, y is occurrences in human genome.

ADD COMMENT
0
Entering edit mode

How did you compute this?

ADD REPLY
3
Entering edit mode

To be honest, I no longer have the code, but the method was to take a given window size, say 10bp, then take the first 10bp of the genome fasta file and either add it to a hashtable, or, if it was already in the hashtable, mark the existing entry for that fragment in the table as repeated, or if its already marked as repeated, do nothing.

The result was apparently 1,000,000 fragments marked as repeated, and a couple of 1000 marked as unique. I repeated the process for a bunch of different window sizes, which is what gave this graph. There are 101 ways this could have been improved, but at the time I had just started learning python, and this was a nice experiment.

ADD REPLY

Login before adding your answer.

Traffic: 1852 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6