Question

Is 1 deleted T in a T stretch always detected, assuming enough coverage of course?

0

Entering edit mode

5.7 years ago

Marvin ▴ 220

Let's assume the following reference sequence:

ATTTTTTTTTTTTTTTTTTTTA

The individual has a deletion of 1 T in that stretch of 20 T's. Could be the left-most, the middle, the right-most. We don't know.

Let's assume I have sequenced 100 reads which cover that region. So the mapper has to make a choice for every single one of the 100 reads: which T is deleted? That means that in an unfortunate distribution I might never see the deleted T?

Let's assume the mapper chooses the first T to be deleted for reads 1 to 5. For the other 95 reads it chooses some other position. That means later during the variant valling, the caller will assume "Nothing special here. I see 95 reads which say there's a T and I see 5 reads which say there's nothing. Probably just a T like the reference".

Next position, the second T: Let's assume for read 6 to 10, the mapper has decided that the deleted T is the second T. For the other 95 reads it chooses some other position. Again, the caller will say "Alright I got 95 times a T and 5 times nothing. Probably just a T"

In reads 11 to 15, the 3rd T is deleted. In reads 16 to 20 the 4th T is deleted ... [...] ... and in reads 95 to 100 the 20th T is deleted. So if the mapper chooses the deleted T for every read and the distribution is unfortunate, I will never call that deleted T. If the mapper had instead chosen the first T to be the deleted T for all of the 100 reads, I would have seen the deletion. How is this problem avoided?

I know that a mapper is just an algorithm, so it will probably never come to the situation described above (unless the programmer decided "multiple options for the position of the T deletion? hey, just pick one randomly" -> but for the reason described above that would be a mistake right?). Or is the opposite eventually true: is it 100% certain that a mapper will choose the same T to be the deleted T for ALL (or just some? 90%? 80%? what's the case?) of the 100 reads? That would mean, I would never miss the deletion and later I can left-align, right-align or do whatever with the variant.

stretch mapper caller • 1.7k views

ADD COMMENT • link updated 5.7 years ago by Kevin Blighe 89k • written 5.7 years ago by Marvin ▴ 220

0

Entering edit mode

left align indels

ADD REPLY • link 5.7 years ago by rbagnall ★ 1.8k

0

Entering edit mode

Thank you but that is not the problem that I have described. I know the text is a little bit long but I'm talking about unfortunate mapping which might lead to not even seeing a variant. The link instead describes what to do after seeing the same variant in different callers.

ADD REPLY • link 5.7 years ago by Marvin ▴ 220

0

Entering edit mode

Homopolymer stretches are common sources of false indel and variant calls. I guess you simply do not have a variant and all you see are artifacts. Plase search around for advice on how to deal with longer homopolymer stretches such as this one. Polymerases seem to have trouble dealing with them so both short read NGS and Sanger sequencing is not reliable there.

ADD REPLY • link 5.7 years ago by ATpoint 89k

0

Entering edit mode

"I guess you simply do not have a variant and all you see are artifacts" -> No, in my made up example the biological truth is that there is 1 T deleted. And all reads are perfetly fine without artifacts. The question is, what does the mapper do with regards to asking the question "which T do I mark as deleted" 100 times. My assumption is that it would choose the same position 100 times but I'm not 100% certain about that.

ADD REPLY • link 5.7 years ago by Marvin ▴ 220

2

Entering edit mode

Sorry.., you are right, that was for vcf files, but the principal is the same for BAM files.

You will never which T was deleted. The convention is to left align indels at these positions in each read, and this can be done on BAM files using, for example, the GATK LeftAlignIndels tool.

For some genotyping tools, like GATK Haplotype caller, it is not necessary to post process the BAM file in this way as Haplotype Caller will do this on-th-fly

ADD REPLY • link 5.7 years ago by rbagnall ★ 1.8k

0

Entering edit mode

I totally forgot about the step "indel realignment" in GATK best practices. I think that is exactly the step which avoids the problem I've described above.

Even in a scenario where the programmer had opted for choosing a random T for each read, the probability for that worst-case distribution which I've described above would have been extremely low (1 over 20^100). However in this scenario there would be multiple distributions in which the caller says "the del's are noise, this is a T" at every position (so the overall chances of the swallowed-T-case occuring would be higher than 1 over 20^100).

So basically I think there are 2 factors which come into play: the mapper is an algorithm which always does the same thing for each read, therefore it leads to at least somewhat similar results for each read at that indel (unless of course the programmer chose random positions on purpose). Therefore there is no uniform distribution across the space of possibilities. And then after the "somewhat similar" results are produced, the indel realignment takes care of the rest (second factor).

ADD REPLY • link 5.7 years ago by Marvin ▴ 220

score 0 · Answer 1 · 2019-12-05

These cases are obviously going to be handled differently based on the combination of the aligner and the variant caller that is being used. One would have to delve into the 'inner workings' of the internal algorithms. I don't believe anybody has time to do that, though.

From my own experience, variant calls in 'short' homopolymers (e.g. <15bp) can be consistent IF there is good read coverage that spans the entire homopolymer. For much larger homopolymers and regions of repeat sequence, there really is no hope for short-read NGS technologies.

If you wanted to mask for —or isolate— these problematic regions, then, ideally, you would have a BED file of known homopolymers and utilise that somewhere in your variant calling pipeline. I'm beginning to think that no single-pass pipeline can call all variants with high sensitivity / specificity, and that separate runs of the pipeline will have to be run for special cases like homopolymers, regions of repeat sequence, regions that exhibit sequence homology, etc., with different parameter configurations for each. Take a look here for ideas how to identify homopolymers in your reference genome: Code golf: detecting homopolymers of length N in the (human) genome

Kevin