Question

Kmergenie k-mer estimate and multiple k-mers

0

Entering edit mode

8.3 years ago

ed3716 • 0

Hi,

(a) According to what I read, kmergenie do not recommend to us the single k-mer estimate for multiple kmer assemblers. However, in case of not using default kmer values of assembler, it should make sense to use the single k-mer estimate of kmergenie between the different k-mer values provided to the assembler?

(b) in case of having a kmergenie output graph not following the typical curve of a normal distribution (thus showing not single but multiple peaks), should I don't trust the recommended value? And in case of looking for more than one k-mer, should I considered the highest peakS as the more recommended k-mers for an assembly?

Thanks!

Edu

Assembly kmer kmergenie • 4.0k views

ADD COMMENT • link updated 7.8 years ago by Rayan Chikhi ★ 1.5k • written 8.3 years ago by ed3716 • 0

0

Entering edit mode

This is a tough one to answer because you start with the requirement that you will not use kmergenie as intended - in fact you want to do something it specifically says not to do on the abstract of the program's guide. While i'd imagine using the kmer sizes suggested by the multiple peaks in the output for your multi-kmer aligner would be a good idea, it could be that as soon as you pick one peak, other peaks become less beneficial, and it's impossible given the output of kmergenie to know how things change.

So long story short, I'm not sure anyone will know what to do, and you should probably try lots of different things and see which gives you the best alignment.

ADD REPLY • link 8.3 years ago by John 13k

score 0 · Answer 1 · 2016-10-02

The graphs produced by K-mer genie are specificity vs sensitivity comparisons... depending on the kmer size you get a set of sequences that are unique vs the total amount of unique sequences. If you have too little unique sequences, you can't expand your contigs... if they aren't unique enough you'll end up with a messy assembly where unrelated sequences are able to overlap... So you will always end up with a choice between long contigs vs high quality contigs Every assembler that uses de bruijn graphs will handle this differently and so to answer your question.

A) No, one single value from KmerGenie does not ensure the best assembly... so it's best to try different kmers and use stats like N50/L50, assembly size, read coverage and maybe gene annotations and just compare some different kmers to see how the stats differ between each run/assembler. Then picking whatever measurements you think are most important to your project to define what is the best assembly.

B) If you get Multiple peaks, keep in mind that the y-axis isn't 0-10 but it's often in the power^7 so small differences might actually be pretty big differences if you look at the numbers. Secondly the optimal number is just the highest point but should not be used as the definitive answer. This comes back to the sensitivity vs specificity... higher kmer means they become more specific (unique), ergo might result in higher quality contigs, although shorter contigs. So the choice depends on what you need/want from you assembly. Are you building a de novo reference genome? Try a range of higher k-mers to get somewhat higher quality contigs Want to do some basic GWAS analysis that don't require complete chromosomes? Try some of the lower peaks to get longer contigs and thus more data to mess around with

But in the end the differences will probably lie more in the assembler than in the kmer...

score 0 · Answer 2 · 2017-02-28

Hi,

Apologies for the delay to answer, I'm not getting Biostar notifications lately.

(a): if you're doing multi-k assembly, I'd recommend using the default k values from the assembler unless you have a good reason not to. Adding a more k values in-between the min-k and the max-k should not hurt though. Thus adding kmergenie's k value possibly won't hurt your assembly, but also possibly won't significantly improve it.

(b): in case of two peaks, please use the diploid model; in case of 3 or more significant peaks, yes, the histogram is then outside the scope of what kmergenie expects, thus I wouldn't trust the results. Lesley's remark that the y axis is logarithmic is a good one. "Peaks" at small y values are irrelevant.

Rayan