Sorry for a naive question but l cannot get my head around this. Say, I want to do RNA-seq de novo assembly and I have a lot of bases with erroneous low quality calls and/or heterozigosity in my single sample. Should I then use low-end or high-end k-mer sizes to get less fragmented assembly? My intuition is that I should use short words since then the reads with ambigious bases will generate more words and some of them (without any mistakes/polymorphism) can be used to generate unambigious graph? But I get the impression that the opposite might be true from reading around the topic, ie. longer k-mer size would be advised in this case. What are your thoughts/experience? NB, this question is just about strategy used by graph-based assemblers to deal with ambigious bases, which I hope to get clarified.
Thanks, Jeremy, that is what I suspected. But could you elaborate on how the graph is resolved with high k-mer value. Since these ambiguities do not disappear when we have longer words, does it simply mean that words with ambiguities present at lower frequency will be ignored?
Miscalls and heterozygous alleles create bubbles in the debruijn graph, not ambiguities in the sense that repeats cause multiple vertices which never converge. The bubbles can be popped to produce one or more similar transcripts.
Nice summary Jeremy. One small thing - actually, sequencing errors and het alleles also create overlaps and connections between different parts of the graph, not just bubbles. For the human genome, at k=21, 50% of sequencing errors create wat would be a bubble, except it overlaps the reference genome - ie errors create connections between bits of the genome that aren't supposed to be connected. As you increase k, this happens less. At k=31 it drops to 25%, and at k=55 to 15%. This is a property of the specific genome in question.