Question

How to find the shortest k-mer length that is unique in a large genome

0

Entering edit mode

9.6 years ago

abalakrishnan95 • 0

Hey all,

I have been trying to figure out the equation to figure out what the shortest length a k-mer can be to be unique in a large genome.

Anything would help,

Thank you

genome • 5.9k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 9.6 years ago by abalakrishnan95 • 0

1

Entering edit mode

I am having trouble to understand you question? Are you looking for a single unique k-mer or a length at which all k-mers are unique? Could you elaborate?

ADD REPLY • link 9.6 years ago by thackl ★ 3.0k

1

Entering edit mode

probability is ( 1 / 4^n ) (n = length of sequence). Find an n value which makes the denominator larger than your genome size.

ADD REPLY • link 9.6 years ago by Adrian Pelin ★ 2.7k

1

Entering edit mode

There is no shortest length equation. This genome:

GCGCGC...GCTGC...GCGCGC
#          ^

...has a shortest unique kmer of 1 (T).

On the other hand, if you remove the T, the genome has no unique kmers except for itself, and if it is circular, there are no unique kmers, period. So, it's completely sequence-specific. Though as Adrian points out, you can estimate the length at which a kmer might be unique.

For a specific genome, you can find it like this (using the BBMap package):

kmercountexact.sh in=genome.fa k=1 out=khist1.txt
kmercountexact.sh in=genome.fa k=2 out=khist2.txt
kmercountexact.sh in=genome.fa k=3 out=khist3.txt

...etc. Eventually, the khist file will indicate that some kmers have a count of 1. At that point, you know the answer.

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 9.6 years ago by Brian Bushnell 20k

Ram · Answer 1 · 2015-10-05

4

Entering edit mode

9.6 years ago

John 13k

I once had to do this for a wet-lab challenge - "what is the two shortest primer pairs can can amplify something in the human genome".

You'd be surprised, even at 10bp there are quite a few unique 10mers which only crop up once in the genome:

x is kmer length, y is occurrences in human genome.

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 9.6 years ago by John 13k

0

Entering edit mode

How did you compute this?

ADD REPLY • link 9.6 years ago by reza.jabal ▴ 580

3

Entering edit mode

To be honest, I no longer have the code, but the method was to take a given window size, say 10bp, then take the first 10bp of the genome fasta file and either add it to a hashtable, or, if it was already in the hashtable, mark the existing entry for that fragment in the table as repeated, or if its already marked as repeated, do nothing.

The result was apparently 1,000,000 fragments marked as repeated, and a couple of 1000 marked as unique. I repeated the process for a bunch of different window sizes, which is what gave this graph. There are 101 ways this could have been improved, but at the time I had just started learning python, and this was a nice experiment.

ADD REPLY • link 9.6 years ago by John 13k