K-Mer Correction In Rna-Seq Data For Transcriptome Assembly?
2
9
Entering edit mode
13.1 years ago
Ryan Thompson ★ 3.6k

In whole-genome high-throughput sequencing data, one expects a clear separation between high-frequency k-mers (signal) and low-frequency k-mers (noise arising from sequencing errors):

k-mer frequency distribution

Software such as Quake exists to take advantage of this separation to identify and correct low-frequency k-mers that represent sequencing errors. Removing these low-frequency k-mers should greatly reduce the memory usage of de-Bruijn graph-based assemblers, since every k-mer takes up the same amount of memory regardless of whether it occurs once or 1 billion times.

However, for RNA-Seq transcriptome assembly, the situation is different. Coverage is not even remotely uniform, so one cannot automatically assume a reasonable separation between the noise and signal peaks. Or to put it another way, the Quake website explicitly mentions that it is designed for use with WGS data with a coverage of at least 15x, and in an RNA-Seq experiment, many low-expressed transcripts will probably occur at well below 15x coverage.

So, is k-mer correction like that performed by Quake appropriate as a preprocessing step for RNA-Seq data before running a de-nove assembly with something like Velvet/Oases or Trinity, or is it likely to misidentify k-mers from low-coverage genes are error k-mers and attempt to correct them inappropriately?

rna transcriptome next-gen sequencing assembly • 9.1k views
ADD COMMENT
3
Entering edit mode
13.1 years ago
Torst ▴ 980

When you align your reads to your reference genome/exome/transcriptome, the alignment process already allows for some subsitutions (and maybe insertions and deletions). This works whether the errors are due to small differences between the reference and your organism, or due to actual sequencing errors. All those "low frequency kmers" won't get ignored, they will still be aligned to their closest match if they aren't too erroneous. You need to assess how many UNALIGNED/UNMAPPED reads you are getting. If that is too high, you can consider correcting your reads using k-mer frequency methods, but I suspect you won't need to. The chance of a corrected read now aligning to a different part of your genome is low.

ADD COMMENT
0
Entering edit mode

I completely agree, but perhaps the poster was thinking about transcriptome assembly, when there might be a greater need for k-mer correction?

ADD REPLY
0
Entering edit mode

Oops, yes, I somehow managed to write that entire question without once writing the word "assembly". I'll edit my question to clarify.

ADD REPLY
2
Entering edit mode
11.4 years ago

I realize this is an old post, but I've recently come across the same issue. The best solution I've found is digital normalization from C. Titus Brown's group. The paper on arxiv states that "Digitial normalization ... normalizes average coverage to a specified value, reducing sampling variation while removing reads, and also removing the many errors contained within those reads."

Hope this helps someone else!

ADD COMMENT

Login before adding your answer.

Traffic: 2176 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6