I have read the papers about EULER, velvet and soapdenovo, but I am still confused about how to choose the K values. It is a common practice to test several K values and choose the best one among them according to the results. But I think there may be some clues indicating the proper range of K and the K values should not be tested blindly. For example, obviously the K should be less than the maximal length of the reads. Is there a way to estimate roughtly the proper range of K values according to the genome size, sequenceing depth, reads length or something else? How do you choose the K value? Many thanks.
And then, may need to perform several trial runs with different K-mer around, and select the best one
How do I know which K-mer gives the better results? Thanks!
After assembly, you will calculate some statistics such as contig N50 N90, scaffold N50 N90, and total scaffold length. Usually, a better Kmer gives larger contig/scaffold N50/90 values. But the total scaffold length should not deviate too much from the estimated genome size (You should estimate the genome size using an experimental method such as flow cytometry).
Quite reasonable. Many thanks.
I thought in a perfect experiment, we'd want a single contig that covers the whole genome. Why "longer kmer will result in few long contigs" is a bad thing?
Imperfect coverage and sequencing errors.. Sufficiently many error-free k-mers need to cover each position in a contig. Take a look at the kmergenie paper for a longer discussion.
Is it really true? Shorter kmers will unable overgo repetitions of the same or longer length, but on the other hand it help you to guild more dense graph (basically two reads will be in connected in graph only in the case of overlap of the size of kmer). Therefore I guess it depens a lot on coverage you have, lower coverage you have the smaller kmer you have to choose because otherwise even non complex regions wont be resolved.