I have already answered this question indirectly in one of your previous queries, although it was a different context.
Most modern assemblers know how to pick the best k-mer size as long as they are given enough options to work with. SPAdes has a -k
option which by default is set to auto
, and the program will sample various k-mer sizes before picking the best. Since you seem to be a fan of error-correction, for corrected data you can specify the --only-assembler
option since there is no need to correct anything. Personally, I would give the program uncorrected data and let it do its own error correction. A last piece of advice regarding this assembler: you may get tempted to use the --careful
option, but for most datasets that will be unnecessary. In my hands that option will yield better results only for single genomes sequences at extremely high depth.
Same advice for MEGAHIT: it has several options to specify k-mers as a list, as a min-max range with fixed steps, or as a preset group of numbers. If no option is chosen, it will sample [21,29,39,59,79,99,119,141]
which is a sensible option because it covers a huge range of k-mers. While it is worth checking other options for more customization, one can't go wrong by going with default values and letting the assemblers figure it out. It is worth feeding error-corrected reads to this assembler, and it will generally do better with corrected than with raw reads.
The thread you linked is for a different version of
soapdenovo
.Please follow what the program version you plan to use accepts. If
soapdenovo2
wants a number between13 and 31
you are not going to be able to use a number that is outside those bounds. It is sometimes worth trying multiple runs out to see what works best than trying to start with what seems to be an optimal setting. Every dataset is different and general recommendations may not always produce the best result.