Question

Reference Sequence Accuracy

0

Entering edit mode

11.1 years ago

goodcow ▴ 20

In modern sequencing, reads are aligned with a reference sequence.

What happens if a read differs from the reference genome by a SNP or sequencing error or something? How do we choose where to map it to?

More fundamentally, how can we be certain that the reference sequence is correct?

sequencing • 2.3k views

ADD COMMENT • link updated 3.8 years ago by Ram 45k • written 11.1 years ago by goodcow ▴ 20

1

Entering edit mode

how would you define "correct" ? :-)

ADD REPLY • link 11.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Reflecting what the true sequence is

ADD REPLY • link 11.1 years ago by goodcow ▴ 20

0

Entering edit mode

That begs the question, "The true sequence of what?" For example, since there's not a single universal human genome, the reference is just an abstract approximation of something that should generally match (in fact, it's a composite of multiple individuals).

ADD REPLY • link 11.1 years ago by Devon Ryan 105k

0

Entering edit mode

Good point, I hadn't thought about that. I suppose that over the sequencing of multiple genomes of the same type, errors in highly conserved regions that would be the same could be phased out by consensus; similarly, areas of variation could be identified.

ADD REPLY • link 11.1 years ago by goodcow ▴ 20

0

Entering edit mode

11.1 years ago

Ashutosh Pandey 12k

Another somewhat related question I asked a few years back:

Error In Reference Genome

ADD COMMENT • link 11.1 years ago by Ashutosh Pandey 12k

Ram · Accepted Answer · 2014-06-13

N.B., many people do de novo assembly due to having no (or a poor quality) reference sequence, so aligning to a reference isn't really a must.

You might want to read any of the papers on an aligner, such as bwa or bowtie, since they discuss the mismatch issue. There are many ways to find imperfect alignments, but searching for a seed (a contiguous subset of a sequence) and then extending that is a common method. You then look for the alignment with the highest score, which is often dependent upon: the edit-distance between the alignment and the reference, the phred quality of mismatching bases, the presence of Ns in a given region of the reference, and a set of mismatch penalties and match bonuses. Alignments are then given a MAPQ score depending on how likely they are to be correct (the supplemental material in the original MAQ paper has a really nice section on how this can be done).

Regarding knowing how closely the reference matches whatever you're looking at, which is what I would interpret "correct" to mean, there's no single answer to that. The easiest method is to just perform the alignment and see how well it works. If you get a high non-alignment rate, then maybe de novo assemble some of the unmapped reads and then blast them. That'll give you an idea if they might be describing an area where the reference sequence too poorly matches your sample (or perhaps they end up mapping to some bacteria that just happen to be contaminating your samples...).

Actually building a reference genome is a pretty involved process if the result is going to be of good quality and there are a variety of metrics performed along the way that need to be looked at in their totality to get an idea of how well things went.