Question

Is It Possible That Uniquely Mapped Read Was Mapped To The Wrong Place?

3

Entering edit mode

12.2 years ago

Chai_AF ▴ 80

Is it possible that uniquely mapped read was mapped to the wrong place? How is it possible then given that all base in the read have high quality score?

• 4.0k views

ADD COMMENT • link updated 12.2 years ago by Arun 2.4k • written 12.2 years ago by Chai_AF ▴ 80

1

Entering edit mode

This is a very poorly worded question. What do you mean by wrong place? What do you mean by uniquely mapped read?

Quality score really is just the confidence of the read calls themselves, so if you are referring to errors I think that would be a minor detail. There are many factors which affect where a read is mapped to on a reference.

Please give us more information and a clear sense of what you are actually asking.

ADD REPLY • link 12.2 years ago by Josh Herr 5.8k

score 12 · Answer 1 · 2012-09-12

Because:

Nearly all mapping algorithms use heuristics. They cannot guarantee to find the best alignment in terms of the highest Smith-Waterman score.
Even if you could afford SW, the best scored position may still be wrong because the underlying scoring matrix used in SW is inaccurate and the matrix actually changes with regions (e.g. around long homopolymer runs). The sequence evolution may not always follow SW (e.g. in a microsatellite), either.
Even if you knew the precise scoring system, your reference genome may be wrong and in that case nothing you do can fix the problem, unless you fix the reference genome first.
Also as you mentioned, read sequences can be wrong. Even if the base quality is high, there is still a tiny change that a sequencing error may put the read at a wrong place. Given billions of reads, you will have quite a few reads affected by sequencing errors.

score 6 · Answer 2 · 2012-09-12

This is almost a philosophical question. With shorter reads I can imagine a situation where a sequence maps uniquely and perfectly with high score and good quality bases but in fact has a (high quality base) sequence error in it that prevented it from mapping where it "was supposed to". Back in the days of 17bp and 21bp SAGE we actually thought about this a fair bit. But, now with 100 or 150 bp reads, if you have a perfect/unique match you tend to trust it. Having said that, I can still think of hypothetical situations involving homologous genes, variant bases, and errors where strange things could happen. It also depends how you define unique (i.e., unambiguous) mapping. What if a read maps with 100% identity to one place and 90% identity to another? What about 99.0%? What about 99.9%? At what point do you consider two alignments so similar that the read is no longer uniquely mapping? The answer is not simple is not simple and has to take into account the complexity of your sequence library and region it is mapping to, error rates, variations between genome being sequenced and reference genome being aligned to etc.

score 3 · Answer 3 · 2012-09-13

I generated a test data with biological replicates using flux-simulator to test mapping efficiency of tophat (2.0.0 and 2.0.4) and GSNAP. The data had between 10-15 million reads. I filtered out reads that are not uniquely mapped. About 90-95% of the reads mapped uniquely (with GSNAP > tophat 2.0.4 > tophat 2.0.0) with the differences not being that substantial between GSNAP and tophat 2.0.4. And because its cooked up data, I was able to go back and check if the simulated read location was the same as the mapped location. About 1-1.5% of the reads mapped "wrongly" ( with tophat 2.0.4 having lesser wrong mappings than GSNAP). When I mean wrongly, I mean that there is no EXACT match between the simulated and mapped. They might have same starting position but might have a soft clipping at the 3' end; I counted these to be wrong. To sum up, they are very less, in my opinion.