Question

K-Mer Correction In Rna-Seq Data For Transcriptome Assembly?

9

Entering edit mode

13.0 years ago

Ryan Thompson ★ 3.6k

In whole-genome high-throughput sequencing data, one expects a clear separation between high-frequency k-mers (signal) and low-frequency k-mers (noise arising from sequencing errors):

k-mer frequency distribution

Software such as Quake exists to take advantage of this separation to identify and correct low-frequency k-mers that represent sequencing errors. Removing these low-frequency k-mers should greatly reduce the memory usage of de-Bruijn graph-based assemblers, since every k-mer takes up the same amount of memory regardless of whether it occurs once or 1 billion times.

However, for RNA-Seq transcriptome assembly, the situation is different. Coverage is not even remotely uniform, so one cannot automatically assume a reasonable separation between the noise and signal peaks. Or to put it another way, the Quake website explicitly mentions that it is designed for use with WGS data with a coverage of at least 15x, and in an RNA-Seq experiment, many low-expressed transcripts will probably occur at well below 15x coverage.

So, is k-mer correction like that performed by Quake appropriate as a preprocessing step for RNA-Seq data before running a de-nove assembly with something like Velvet/Oases or Trinity, or is it likely to misidentify k-mers from low-coverage genes are error k-mers and attempt to correct them inappropriately?

rna transcriptome next-gen sequencing assembly • 9.1k views

ADD COMMENT • link updated 11.3 years ago by johnstantongeddes ▴ 410 • written 13.0 years ago by Ryan Thompson ★ 3.6k

score 3 · Answer 1 · 2011-11-12

3

Entering edit mode

13.0 years ago

Torst ▴ 980

When you align your reads to your reference genome/exome/transcriptome, the alignment process already allows for some subsitutions (and maybe insertions and deletions). This works whether the errors are due to small differences between the reference and your organism, or due to actual sequencing errors. All those "low frequency kmers" won't get ignored, they will still be aligned to their closest match if they aren't too erroneous. You need to assess how many UNALIGNED/UNMAPPED reads you are getting. If that is too high, you can consider correcting your reads using k-mer frequency methods, but I suspect you won't need to. The chance of a corrected read now aligning to a different part of your genome is low.

ADD COMMENT • link 13.0 years ago by Torst ▴ 980

0

Entering edit mode

I completely agree, but perhaps the poster was thinking about transcriptome assembly, when there might be a greater need for k-mer correction?

ADD REPLY • link 13.0 years ago by Mikael Huss 4.8k

0

Entering edit mode

Oops, yes, I somehow managed to write that entire question without once writing the word "assembly". I'll edit my question to clarify.

ADD REPLY • link 13.0 years ago by Ryan Thompson ★ 3.6k

score 2 · Answer 2 · 2013-07-25

I realize this is an old post, but I've recently come across the same issue. The best solution I've found is digital normalization from C. Titus Brown's group. The paper on arxiv states that "Digitial normalization ... normalizes average coverage to a specified value, reducing sampling variation while removing reads, and also removing the many errors contained within those reads."

Hope this helps someone else!