I'm working in a non-model organism for which we do have a reference genome. I was recently learning about the k-mer sprectrum and learned that it could be used to estimate the length of a genome following a pretty straightforward tutorial.
This method however requires to set a length k for the k-mers, and according to some authors, a proper measure should itself depend on the genome length (Chor et la 2009), which is reasonable.
Thus, I run jellyfish following the previous tutorial and estimate the genome length trying both k=10 and k=14, based on the genome length of 120Mb from a closely related species.
The conflict comes with the different outputs. With k=10 the spectrum looks great, as a typical unimodal distribution. The peak of the distribution is also close to the real coverage, so again this is great news. But the genome length estimate is about 3-4Mb!! if I use k=14 the opposite happens: the distribution is all skewed to the left, but the estimate is roughly 128Mb, which is not weird given the organism.
How would you double check which of these outcomes is correct? Another question is, if I wouldn't have this previous idea on the genome length (~120Mb) how would one deal with this problem?
It is not simpler to get a VCF file with base pair resolution and count the number of total positions? isn't this adequate? how do you measure the genome length in practice?
Here is one more guide for genome size estimation. It says that
so perhaps the
k
you used is too small.This is a new mapping based approach.
What about counting the number of bases in the reference?
How do you think that VCF file was obtained in the first place? Could it be based on the reference genome?
The reference genome is given in FASTA format, thus it is not that straightforward to count the number of bases.
The VCF is an option for sure, but it also depends on the quality of alignments, and the full reference genome could not be completely represented in the alignment files and consequently in the VCF.