Question

Does it make any sense to make a kmer analysis in assembled sequences?

0

Entering edit mode

6.2 years ago

al.bodrug ▴ 50

Hello everyone,

Most of us working on large repetitive genomes are probably familiar with the kmer distribution analysis on raw short reads, where we find the peak for the diploid portion of the genome, a hunch in case of polyploidy and a smaller flatter peak for the duplicated portion of the genome. This is usually done to make kmer based genome size estimation.

My question is, does it make sense to look at distribution of kmers in already assembled sequences? And if it does make sense, is it more logical to use large or short kmers?

I looked at the 7mer, 21mer, 55mer and 155mer distribution in an assembly of beet (Plant, eudicotyledon). It's 'just' a peakless descending curve, where sometimes a hunch is distinguishable. On a biological level, is this anyway informative?

Cheers, Alex

Assembly genome sequence • 2.6k views

ADD COMMENT • link updated 6.2 years ago by Corentin ▴ 660 • written 6.2 years ago by al.bodrug ▴ 50

2

Entering edit mode

Not sure about your exact question, but I've used KAT to compare k-mers from short reads used to make an assembly to the actual assembly to get an idea of the assembly quality, see https://kat.readthedocs.io/en/latest/walkthrough.html#genome-assembly-analysis-using-k-mer-spectra

ADD REPLY • link 6.2 years ago by jean.elbers ★ 1.7k

score 1 · Answer 1 · 2019-08-09

This does not really make sense, kmer analysis is more useful when applied to reads:

The assemblies are often representing only one haplotype, so you will not be able to guess the ploidy from the assembly.

Do not forget that the x-axis on the kmer plot represent the frequency of the kmer (how many time it appears in your sequence), this is often used to assess the read coverage. However, in an assembly you have a "coverage of 1" (apart from the repeat sequences), this explain the peakless curve.

However, as jean.elbers mentioned in the comments, if you have access to the raw reads you can perform the k-mer analysis on them, and with KAT (Kmer Analysis Tool) you can compare the kmer content of your reads against the assembly to assess the completeness and duplication levels.

Large or short k-mers depend on the genome, 7 seems very short though (the assumption is that kmers should represent a unique sequence, if you are choosing a short kmer you may have several kmers with the same sequence).

Here is a tutorial for genome size estimation from a kmer analysis (but there are plenty of other resources online): https://bioinformatics.uconn.edu/genome-size-estimation-tutorial/

Not directly related to your question but still may be of interest to you, the effect of kmer size in assembly: https://github.com/rrwick/Bandage/wiki/Effect-of-kmer-size