Question

What K-Mer Size For High Sequencing Error Rate?

2

Entering edit mode

12.4 years ago

maymay ▴ 30

Sorry for a naive question but l cannot get my head around this. Say, I want to do RNA-seq de novo assembly and I have a lot of bases with erroneous low quality calls and/or heterozigosity in my single sample. Should I then use low-end or high-end k-mer sizes to get less fragmented assembly? My intuition is that I should use short words since then the reads with ambigious bases will generate more words and some of them (without any mistakes/polymorphism) can be used to generate unambigious graph? But I get the impression that the opposite might be true from reading around the topic, ie. longer k-mer size would be advised in this case. What are your thoughts/experience? NB, this question is just about strategy used by graph-based assemblers to deal with ambigious bases, which I hope to get clarified.

rna-seq differential-expression assembly • 5.1k views

ADD COMMENT • link updated 12.4 years ago by Jeremy Leipzig 22k • written 12.4 years ago by maymay ▴ 30

score 3 · Answer 1 · 2012-07-03

3

Entering edit mode

12.4 years ago

Jeremy Leipzig 22k

http://www.biostars.org/post/show/4263/velvet-assembly-problem/#4280

If your kmer value is low there will be more chances for reads to overlap but also many path ambiguities in your graph and your assembly will be very fragmented (but very large). If your kmer value is high you will have a very stringent, small assembly, with a higher N50.

The same principle applies to transcript assembly.

ADD COMMENT • link 12.4 years ago by Jeremy Leipzig 22k

0

Entering edit mode

Thanks, Jeremy, that is what I suspected. But could you elaborate on how the graph is resolved with high k-mer value. Since these ambiguities do not disappear when we have longer words, does it simply mean that words with ambiguities present at lower frequency will be ignored?

ADD REPLY • link 12.4 years ago by maymay ▴ 30

0

Entering edit mode

Miscalls and heterozygous alleles create bubbles in the debruijn graph, not ambiguities in the sense that repeats cause multiple vertices which never converge. The bubbles can be popped to produce one or more similar transcripts.

ADD REPLY • link 12.4 years ago by Jeremy Leipzig 22k

0

Entering edit mode

Nice summary Jeremy. One small thing - actually, sequencing errors and het alleles also create overlaps and connections between different parts of the graph, not just bubbles. For the human genome, at k=21, 50% of sequencing errors create wat would be a bubble, except it overlaps the reference genome - ie errors create connections between bits of the genome that aren't supposed to be connected. As you increase k, this happens less. At k=31 it drops to 25%, and at k=55 to 15%. This is a property of the specific genome in question.

ADD REPLY • link 12.4 years ago by zam.iqbal.genome ★ 1.9k

score 1 · Answer 2 · 2012-07-03

1

Entering edit mode

12.4 years ago

Leonor Palmeira 3.9k

Hmmm... I wouldn't assemble a dataset that has high sequencing error rate. You know, the "garbage in, garbage out" thingy?

I would rather try to assess the reason for these low quality calls, have you looked at the QC report from the sequencing machine? This might give you an indication of what is going on in this sample, maybe also of how these low qualities are distributed and possibly why this is happening?

ADD COMMENT • link 12.4 years ago by Leonor Palmeira 3.9k

0

Entering edit mode

I am not talking about any particular dataset, also as stated in my post, ambigiuities can come from polymorphic loci too. This is just a thereotical reasoning and would be grateful for any enlightnment in terms of how de Bruijn graph assembly works in general in such cases.

ADD REPLY • link 12.4 years ago by maymay ▴ 30

0

Entering edit mode

My bad... Could you edit your question so that it is clear that you're asking this from a theoretical point of view? Say, replacing "high sequencing error rate" with "low quality values" and adding some clarification in the text.

ADD REPLY • link 12.4 years ago by Leonor Palmeira 3.9k

0

Entering edit mode

You are asking a question that mixes the two concepts, errors with heterozygosity. Once you introduce errors those will trump heterozygosity. In general (and this is just my opinion) the quality of genome assembly is difficult to predict beforehand. It will be determined primarily by the properties of your sample structure, library preparation and sequencing quality. Only after that can you even hope to detect finer structures, and only if the quality and coverage of data supports it.

ADD REPLY • link 12.4 years ago by Istvan Albert 101k

0

Entering edit mode

To de Brujin graph assemblers, it does not matter what is the source of ambigious base - whether it is an error or heterozygosity... I am just interested in how the algorithm deals with such situation when using low and high k-mer values...

ADD REPLY • link 12.4 years ago by maymay ▴ 30

score 1 · Answer 3 · 2012-07-03

You are specifically talking about cases where there are only a few differential bases among reads that makes assemblers consider them to be different k-mers?

These differential bases can be due to sequencing error or real biological heterozygocity. For sequencing error, I would just replace the base with an ambiguous base. For heterozygocity, I would attempt to "correct" the base to maybe the major allele population and just make a note of where that snp is.

Dealing with the error/heterozygocity isn't the problem here though. Finding and distinguishing them is the difficult part. The ABI SOLiD SAET (SOLiD accuracy enhancing tool) software does this supposedly. I am not aware of any in depth analysis on how well it does it though. Here is a brief weblog on how it works: http://kevin-gattaca.blogspot.co.uk/2010/07/nuts-and-bolts-behind-abis-saet.html

Softgenetic also have a condensation tool that'll perform a similar process described here: http://www.softgenetics.com/NextGENe_1.html