Question

GATK: SNP coverage too high?

0

Entering edit mode

7.4 years ago

Marvin ▴ 220

There are 68 reads covering my SNP of interest:

samtools view mybam.bam 6:1611802-1611802 | wc -l
68

This matches what I see in IGV (after turning off all the filters in the preferences of course!).

Now I call with GATK:

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I mybam.bam -L 6:1611802-1611802

and I get this output:

6   1611802 .   G   GGGC    1618.73 .   AC=2;AF=1.00;AN=2;DP=58;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=59.32;QD=34.24;SOR=0.984    GT:AD:DP:GQ:PL  1/1:0,37:37:99:1656,111,0

Take a look at the last column. The depth (DP) at this position is 37. It's not 68 because GATK has filtered out 68-37 = 31 reads apparently. But now check out this output:

INFO  13:01:45,228 MicroScheduler - 44 reads were filtered out during the traversal out of approximately 275 total reads (16.00%) 
INFO  13:01:45,228 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter 
INFO  13:01:45,228 MicroScheduler -   -> 43 reads (15.64% of total) failing DuplicateReadFilter 
INFO  13:01:45,228 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter 
INFO  13:01:45,229 MicroScheduler -   -> 1 reads (0.36% of total) failing HCMappingQualityFilter 
INFO  13:01:45,229 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter 
INFO  13:01:45,229 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter 
INFO  13:01:45,229 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter 
INFO  13:01:45,229 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter

This cannot be the statistics I'm looking for. I would like to know for what reason those 31 reads have been filtered out. So what is the last output talking about? 275 reads? That's not the coverage of this SNP ... ? I need this exact statistics but the values do not match my expectations. I would expect total reads to equal 68 and then the sum of those "MicroScheduler -> x" values to be 31 (since 31 reads have been filtered out).

gatk • 2.1k views

ADD COMMENT • link updated 7.4 years ago by harold.smith.tarheel ★ 5.0k • written 7.4 years ago by Marvin ▴ 220

0

Entering edit mode

Hello,

it's just a guess, but GATKs HaplotypeCaller is doing a denovo assembly around the region it suspects a variant. So I guess the 275 reads in the log file are the reads with which the assembly is done.

To see how GATK remap the reads you can follow this how-to.

fin swimmer

ADD REPLY • link 7.4 years ago by finswimmer 16k

score 2 · Answer 1 · 2017-07-07

2

Entering edit mode

7.4 years ago

harold.smith.tarheel ★ 5.0k

Your variant is an insertion ( G -> GGGC). Insertions and deletions are challenging to align, especially near the ends of reads. GATK performs local realignment (not de novo assembly) to resolve mapping discrepancies. The realignment window extends beyond the position of the variant, and therefore contains more reads.

As for why some of the reads are filtered, the MicroScheduler info indicates that most of them are duplicate reads (and one has low mapping quality).

ADD COMMENT • link 7.4 years ago by harold.smith.tarheel ★ 5.0k

1

Entering edit mode

From the HaplotypeCaller manual:

The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region.