Hello everyone,
Most of us working on large repetitive genomes are probably familiar with the kmer distribution analysis on raw short reads, where we find the peak for the diploid portion of the genome, a hunch in case of polyploidy and a smaller flatter peak for the duplicated portion of the genome. This is usually done to make kmer based genome size estimation.
My question is, does it make sense to look at distribution of kmers in already assembled sequences? And if it does make sense, is it more logical to use large or short kmers?
I looked at the 7mer, 21mer, 55mer and 155mer distribution in an assembly of beet (Plant, eudicotyledon). It's 'just' a peakless descending curve, where sometimes a hunch is distinguishable. On a biological level, is this anyway informative?
Cheers, Alex
Not sure about your exact question, but I've used
KAT
to compare k-mers from short reads used to make an assembly to the actual assembly to get an idea of the assembly quality, see https://kat.readthedocs.io/en/latest/walkthrough.html#genome-assembly-analysis-using-k-mer-spectra