Question

Where does the choice of k = 51 for de Bruijn graphs of large genomes come from?

0

Entering edit mode

2.3 years ago

sebastian.schmidt.helsinki ▴ 10

From hearsay I know that de Bruijn graphs of large genomes (e.g. human) are usually constructed with k = 51, or that k = 51 is at least a good initial choice.

I however am unable to find any source for this, does anyone know where it is coming from?

de-bruijn-graph • 933 views

ADD COMMENT • link updated 2.3 years ago by Matthias Zepper 5.0k • written 2.3 years ago by sebastian.schmidt.helsinki ▴ 10

score 2 · Accepted Answer · 2022-08-03

2

Entering edit mode

2.3 years ago

Matthias Zepper 5.0k

For which application and what sequencing technology?

For efficient alignment, k = 51 is clearly too big. For genome assembly, a k of 51 is still in a reasonable range, but already quite excessive. You can run KmerGenie to estimate the optimal size to assemble a given genome including its repeats. However, the larger the k, the fewer reads cover it, such that assemblies with large k-mer size sort of already resemble the greedy algorithm. Since memory is much less of a concern nowadays than it was considering the available compute hardware in the 1990ies, one can be a bit more permissive, but something in the range of 31-35 might do well for most assemblies nonetheless, in particular if your base call error rate isn't 0.

What is correct, however, is that odd k-mer sizes are usually preferable. An even k-mer length can generate DNA palindromes, which generates ambiguity in the de Bruijn graph.

ADD COMMENT • link 2.3 years ago by Matthias Zepper 5.0k

0

Entering edit mode

Thanks for the detailed answer! The application would be genome assembly of short reads. Well actually, what we are doing is storing a k-mer set in small space, so the question would be very general about any kind of k-mer based method. Then it is probably hard to answer though.

ADD REPLY • link 2.3 years ago by sebastian.schmidt.helsinki ▴ 10

1

Entering edit mode

Well, unfortunately, the nitty-gritty details required for of algorithm design escape me. But I would recommend taking a look at:

Ben Langmead's four lectures on genome assembly in the Data Science of Sequencing course for the basic theory.
The Kallisto paper and the KmerGenie papers, both of which elaborate on the k-mer size and the theoretical background.
This excellent review on challenges and algorithmic solutions in genome assembly.

In general, though, high quality genome assemblies nowadays use a combination of short-reads and long reads or Hi-C data. No whatsoever optimization regarding the k-mer size is going to provide you with similar gains in quality of the assembly like the incorporation of this additional information.

ADD REPLY • link 2.3 years ago by Matthias Zepper 5.0k