I have read several articles referring to "reference bias in mapping reads to a reference".
How is "reference bias" defined? What is it conceptually?
Could someone share an example with potential consequences in variant calling/RNA-seq analyses?
If you're mapping reads to a reference, the result is going to resemble the reference fairly closely, or the mapper wouldn't be able to do its job. If you used a different reference, the output would resemble that one instead. Conceptually, that's where the "bias" comes in.
I do not think reference bias has a clear definition. In my definition, it denotes the effect that reads possessing the reference allele are mapped better. This will cause many artifacts. For example, at heterozygotes, the reference allele gets more support. It has a major impact to allele specific expression with RNA-seq.
I've observed the same effect in a deep-coverage pooled sequencing experiment of mitochondrial DNA. When I attempt to estimate the MAF from the pooled data, my estimate is systematically lower than expected because reads that contain the alternate allele have a lower probability of aligning.
A specific example: you have measured RNA abundance with RNA-seq in two strains of mice. The mouse reference is the C57BL/6 strain. The first strain in your experiment is a close relative of C57BL/6, while the second strain is a wild-derived strain that is evolutionarily distant from C57BL/6. The close-relative genome is very similar to C56BL/6, while the wild-derived strain is much more divergent. When you align RNA from these strains against the reference, there will be more polymorphisms distinguishing wild vs. reference when compared to close-relative vs. reference. These polymorphisms may affect your alignment success rates in a way that biases your read counts against the more distantly related strain.
There are fewer novel single nucleotide variants in J. Craig Venter's
genome owing to that fact that his genome was partially represented in
the Celera human genome assembly2 and variants in that assembly were
subsequently mined and deposited into dbSNP117.
You can also imagine that in a few cases, probably very few, short reads from some of Craig Venter's loci would align to the reference whereas reads from another individual might not align at all (too many mismatches).
I feel the need to take a shower after writing this.
I've observed the same effect in a deep-coverage pooled sequencing experiment of mitochondrial DNA. When I attempt to estimate the MAF from the pooled data, my estimate is systematically lower than expected because reads that contain the alternate allele have a lower probability of aligning.