How does the kmer length affect the total assembly size and the N50 statistic with single end-reads and why?
How does the kmer length affect the total assembly size and the N50 statistic with single end-reads and why?
Very generally spoken:
With larger kmer size there is a better chance of avoiding ambiguities in the graph between similar regions (repeats, paralogs,...). Ambiguities occur if kmer exist multiple times within the genome. (Unresolvable) ambiguities terminate contigs, hence larger kmer sizes in theory increase N50. However, large kmer sizes are much more sensitive to sequencing errors, heterozygosity and coverage.
Assembly size depends on how the assembler handles small ambiguities (bubbles) in the graph and how it handles low coverage paths. With small kmer sizes and some sensitivity to bubbles, you are more likely to generate a single contig for a slightly noisy region. This might be good in case of SNPs or bad if you merge repeats, .... This effect leads to smaller assembly size for small kmers. With large kmers you are more likely to generate different fragments for exiting variants, which increases assembly size. However, if these fragments fail for example internal coverage or length cutoffs, the final assembly may actually be smaller.
I would like to add something. I have assembled many times small genomes such as the E.coli one, and compared my assemblies (with different kmer values) with some trusted genomes using programs such as Mauve
In my hands..
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
On 4, for kb-long reads, overlap graph is usually better. On 6, with pacbio, you often get one contig for a whole bacterial genome. Recent preprint shows that you can achieve the similar with oxford nanopore.
PacBio is terrific with bacterial genomes, that is true. Still poor data and very expensive for higher organisms. And I must confess I start loosing my hope with Oxford
Time will say..