Question

GATK HC not calling variants at the edges after clipping probes

0

Entering edit mode

7.5 years ago

Floydian_slip ▴ 170

Hi, I am facing a strange issue with GATK 3.4. I have a set of PE fastq files. I first ran the variant calling pipeline using the following steps:

bwa mem --> sam to bam --> mark duplicates using picard --> RealignerTargetCreator GATK 3.4 --> IndelRealigner --> HaplotypeCaller --> GenotypeGVCFs

Then I saw that the probe sequences that were used to design the region of interest (its a custom-designed panel) overlapped with the some variants of interest. So, the genotypes of some of these variants was 0/1 (due to probes seqs in those positions) instead of the the true 1/1 (I am using a gold standard sample).

So, then I clipped the first 27 bases of each read (the probes are all around that range 22-31 nts with 27 being the most common length). And then I ran the same above pipeline but some of the variants were not called in spite of all reads having those variants.

I am attaching a snapshot of the bam file alignment with the variant in orange in the center. Top track is the first case (before clipping the probes) where the probes are present in the upper reference block without the variant and the lower block with the insert that carry the variant and hence the 0/1 call.

https://ibb.co/dTpRfw

Snapshot

Lower track is after the clipping of the probe sequences and so only the insert with the variant are left. But the variant is still not called in the VCF file.

I read in another thread that this happens because the tool needs 50bp on either side to do proper reassembly. Is that correct? If so, how do I get around it? Should I not use the IndelRealignment? Any other suggestion to solve this issue?

Thanks a bunch in advance!!

GATK sequencing • 1.9k views

ADD COMMENT • link 7.3 years ago by Floydian_slip ▴ 170

0

Entering edit mode

Just to let you all know that using Unified Genotyper called the variants that HC missed. So using UG now.

ADD REPLY • link 7.3 years ago by Floydian_slip ▴ 170

0

Entering edit mode

Great - thanks for the feedback. Be wary of the Unified Genotyper, though (not even recommended for use anymore, as far as I'm aware).

ADD REPLY • link 7.1 years ago by Kevin Blighe 89k

score 1 · Answer 1 · 2017-11-14

I imagine that the variant is remaining undetected due to 'read position' bias, because it's now only being called at the end of one set of reads and lacks support from other mappings. Take a look at ReadPosRankSumTest:

Seeing an allele only near the ends of reads is indicative of error, because that is where sequencers tend to make the most errors. However, some variants located near the edges of sequenced regions will necessarily be covered by the ends of reads, so we can't just set an absolute "minimum distance from end of read" threshold. That is why we use a rank sum test to evaluate whether there is a difference in how well the reference allele and the alternate allele are supported.

I think that you will find it difficult to call this. If you managed to reduce the read end bias, you would then call false-positives in other reads.

One method that I would encourage you to try is to 'reshuffle' your reads and then re-call variants. Use Picard DownsampleSam on your aligned BAM and extract reads 'randomly' at frequencies of 0.9, 0.8, 0.7, etc., and then call variants on the outputs. I would expect the variant to be called correctly in one of these as, by 'reshuffling', the probabilities of calling the variant are also altered.