There are 68 reads covering my SNP of interest:
samtools view mybam.bam 6:1611802-1611802 | wc -l
68
This matches what I see in IGV (after turning off all the filters in the preferences of course!).
Now I call with GATK:
java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I mybam.bam -L 6:1611802-1611802
and I get this output:
6 1611802 . G GGGC 1618.73 . AC=2;AF=1.00;AN=2;DP=58;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=59.32;QD=34.24;SOR=0.984 GT:AD:DP:GQ:PL 1/1:0,37:37:99:1656,111,0
Take a look at the last column. The depth (DP) at this position is 37. It's not 68 because GATK has filtered out 68-37 = 31 reads apparently. But now check out this output:
INFO 13:01:45,228 MicroScheduler - 44 reads were filtered out during the traversal out of approximately 275 total reads (16.00%)
INFO 13:01:45,228 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter
INFO 13:01:45,228 MicroScheduler - -> 43 reads (15.64% of total) failing DuplicateReadFilter
INFO 13:01:45,228 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO 13:01:45,229 MicroScheduler - -> 1 reads (0.36% of total) failing HCMappingQualityFilter
INFO 13:01:45,229 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO 13:01:45,229 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO 13:01:45,229 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
INFO 13:01:45,229 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter
This cannot be the statistics I'm looking for. I would like to know for what reason those 31 reads have been filtered out. So what is the last output talking about? 275 reads? That's not the coverage of this SNP ... ? I need this exact statistics but the values do not match my expectations. I would expect total reads to equal 68 and then the sum of those "MicroScheduler -> x" values to be 31 (since 31 reads have been filtered out).
Hello,
it's just a guess, but GATKs HaplotypeCaller is doing a denovo assembly around the region it suspects a variant. So I guess the 275 reads in the log file are the reads with which the assembly is done.
To see how GATK remap the reads you can follow this how-to.
fin swimmer