Question

why is variant calling difficult?

1

Entering edit mode

5.8 years ago

abc ▴ 10

I'm very new to this space, but from what I've read I'm not sure I understand what the process of variant calling looks like -- if we're given a reference genome, and then a specific genome of interest, what makes finding variants, which are just differences from the reference, so hard to identify?

For example, at a specific location the sequence, why is it wrong to simply take the majority of whatever bp is there and say "at this position, we are __% confident that individual X has this bp"? And since you would expect see heterozygosity at many loci, what does it even mean if you see "60% A and 40% T" at a given location?

It seems like alignment would be the actual tricky part of the process, not the 'calling' part.

I think i'm probably making some faulty assumptions or do not fully understand the problem, and I'd appreciate any explanation.

genome SNP • 2.4k views

ADD COMMENT • link updated 5.8 years ago by igor 13k • written 5.8 years ago by abc ▴ 10

score 5 · Answer 1 · 2019-02-24

Variant calling is difficult because the alignments are mathematical constructs that work by maximizing various predetermined rewards and penalities and by those imply that the simplest "explanation" is correct. Biology does not work this way. I have a chapter called Misleading alignments in the Biostar Handbook that I will summarize below:

Imagine that the sequence below is subjected to two insertions of Cs at the locations indicated with carets:

CCAAACCCCCCCTCCCCCGCTTC
     ^       ^

The two sequences, when placed next to one another, would look like this:

CCAAACCCCCCCTCCCCCGCTTC
CCAAACCCCCCCCTCCCCCCGCTTC

A better way to visualize what’s happening in this example is shown in the expected alignment that would reflect the changes that we have introduced:

CCAAA-CCCCCCCT-CCCCCGCTTC
||||| |||||||| ||||||||||
CCAAACCCCCCCCTCCCCCCGCTTC

Now suppose we did not know what the changes were. Can we discover the variation that we introduced by using a global aligner? Let’s see:

global-align.sh CCAAACCCCCCCTCCCCCGCTTC CCAAACCCCCCCCTCCCCCCGCTTC

Here is what we obtain:

CCAAACCCCCCC--TCCCCCGCTTC
||||||||||||  .||||||||||
CCAAACCCCCCCCTCCCCCCGCTTC

The variation indicated by the alignment is different: instead of the two insertions of C it shows one insertion of CT followed by a mismatch, adding insult to injury the variation is also shown to take place at a completely different location altogether.

You can see how difficult would be to call this correctly!

score 3 · Answer 2 · 2019-02-24

I feel that I answer some of these questions in my previous answer, here: A: DP in VCF files?

Others can make further comment specifically on the problems relating to alignment. However, a big challenge for aligners in my own mind is the high percent of the genome that exhibits sequence similarity. With short read NGS, it is virtually impossible to accurately align reads derived from certain regions. A few examples are VWF, which has a very large pseudogene on chr22, I believe, exons of some of the cytochrome (CYP) genes, and then genes that were born from duplication events throughout evolution.

The last PI to whom I mentioned this to correct him looked ashen / stone faced. In any case, no wonder that Illumina purchased PacBio, whose focus is on longer reads. Nanopore has also become popular (again focused on long reads).

Kevin

score 1 · Answer 3 · 2019-02-24

As others have already mentioned, alignment is part of the problem. There is additionally the inherent systematic error in the actual reads. Even if the alignment is completely accurate, the observed variants may not be real. Meacham et al. discussed that aspect and provide a nice illustration:

Types of errors

A screenshot from the IGV browser [21] showing three types of error in reads from an Illumina sequencing experiment: (1) A random error likely due to the fact that the position is close to the end of the read. (2) Random error likely due to sequence specific error- in this case a sequence of Cs are probably inducing errors at the end of the low complexity repeat. (3) Systematic error: although it is likely that the GGT sequence motif and the GGC motifs before it created phasing problems leading to the errors, the extent of error is not explained by a random error model.