Question

Estimate K-mer size for de novo assembly

0

Entering edit mode

2.6 years ago

melissachua90 ▴ 70

I want to estimate K-mer size before performing de novo assembly for paired Illumina reads (using SoapDenovo2). My reads length is 151bp.

What are the best K-mer estimation software? I've tried kmergenie using conda but it exited with an error: ModuleNotFoundError: No module named 'readfq'

Hence, I'm looking for an alternative or to fix the error.

Soapdenovo2 accepts odd numbers between 13 and 31. However, according to discussions (How To Choose The K Value Of Kmer In Soapdenovo?), it seems that the K-mer size should be 1/2 to 2/3 of read length, which in my case would be ~75-90, exceeding the soapdenovo2 threshold.

What are your suggestions?

de novo assembly soapdenovo estimation kmergenie K-mer • 3.2k views

ADD COMMENT • link updated 2.6 years ago by Mensur Dlakic ★ 28k • written 2.6 years ago by melissachua90 ▴ 70

0

Entering edit mode

The thread you linked is for a different version of soapdenovo.

Please follow what the program version you plan to use accepts. If soapdenovo2 wants a number between 13 and 31 you are not going to be able to use a number that is outside those bounds. It is sometimes worth trying multiple runs out to see what works best than trying to start with what seems to be an optimal setting. Every dataset is different and general recommendations may not always produce the best result.

ADD REPLY • link 2.6 years ago by GenoMax 148k

score 0 · Answer 1 · 2022-04-29

I have already answered this question indirectly in one of your previous queries, although it was a different context.

Most modern assemblers know how to pick the best k-mer size as long as they are given enough options to work with. SPAdes has a -k option which by default is set to auto, and the program will sample various k-mer sizes before picking the best. Since you seem to be a fan of error-correction, for corrected data you can specify the --only-assembler option since there is no need to correct anything. Personally, I would give the program uncorrected data and let it do its own error correction. A last piece of advice regarding this assembler: you may get tempted to use the --careful option, but for most datasets that will be unnecessary. In my hands that option will yield better results only for single genomes sequences at extremely high depth.

Same advice for MEGAHIT: it has several options to specify k-mers as a list, as a min-max range with fixed steps, or as a preset group of numbers. If no option is chosen, it will sample [21,29,39,59,79,99,119,141] which is a sensible option because it covers a huge range of k-mers. While it is worth checking other options for more customization, one can't go wrong by going with default values and letting the assemblers figure it out. It is worth feeding error-corrected reads to this assembler, and it will generally do better with corrected than with raw reads.