Question

Reference Bias In Alignments

3

Entering edit mode

12.6 years ago

learnerforever ▴ 520

I have read several articles referring to "reference bias in mapping reads to a reference". How is "reference bias" defined? What is it conceptually? Could someone share an example with potential consequences in variant calling/RNA-seq analyses?

thanks!

mapping • 6.3k views

ADD COMMENT • link updated 12.6 years ago by lh3 33k • written 12.6 years ago by learnerforever ▴ 520

score 4 · Answer 1 · 2012-05-02

4

Entering edit mode

12.6 years ago

Fwip ▴ 500

If you're mapping reads to a reference, the result is going to resemble the reference fairly closely, or the mapper wouldn't be able to do its job. If you used a different reference, the output would resemble that one instead. Conceptually, that's where the "bias" comes in.

ADD COMMENT • link 12.6 years ago by Fwip ▴ 500

score 2 · Answer 2 · 2012-05-04

2

Entering edit mode

12.6 years ago

lh3 33k

I do not think reference bias has a clear definition. In my definition, it denotes the effect that reads possessing the reference allele are mapped better. This will cause many artifacts. For example, at heterozygotes, the reference allele gets more support. It has a major impact to allele specific expression with RNA-seq.

ADD COMMENT • link 12.6 years ago by lh3 33k

0

Entering edit mode

I've observed the same effect in a deep-coverage pooled sequencing experiment of mitochondrial DNA. When I attempt to estimate the MAF from the pooled data, my estimate is systematically lower than expected because reads that contain the alternate allele have a lower probability of aligning.

ADD REPLY • link 12.6 years ago by dfornika ★ 1.1k

score 1 · Answer 3 · 2012-05-03

A specific example: you have measured RNA abundance with RNA-seq in two strains of mice. The mouse reference is the C57BL/6 strain. The first strain in your experiment is a close relative of C57BL/6, while the second strain is a wild-derived strain that is evolutionarily distant from C57BL/6. The close-relative genome is very similar to C56BL/6, while the wild-derived strain is much more divergent. When you align RNA from these strains against the reference, there will be more polymorphisms distinguishing wild vs. reference when compared to close-relative vs. reference. These polymorphisms may affect your alignment success rates in a way that biases your read counts against the more distantly related strain.

score 1 · Answer 4 · 2012-05-03

This table http://www.nature.com/nrg/journal/v10/n4/box/nrg2554_BX1.html. and its caption might shed some light:

There are fewer novel single nucleotide variants in J. Craig Venter's genome owing to that fact that his genome was partially represented in the Celera human genome assembly2 and variants in that assembly were subsequently mined and deposited into dbSNP117.

You can also imagine that in a few cases, probably very few, short reads from some of Craig Venter's loci would align to the reference whereas reads from another individual might not align at all (too many mismatches).

I feel the need to take a shower after writing this.