This has been a 'burning' issue of mine since 2014 when I was working in a clinical setting and when I naively implemented the GATK pipeline (with HaplotypeCaller; HC) for the purpose of identifying germline variants in patient samples passing through the laboratory. It first began when the clinical scientists noticed that my NGS reports frequently omitted variants that they knew were present in the patient samples via Sanger sequencing. I could then easily spot these missed variants in IGV, but there was no apparent connection between, e.g., GC content, MAPQ, read depth, repetitive sequence, allele fraction, etc., and the missed variants - they just happened in every single sample.
Upon recommendation from a former Sanger Institute employee, I then later tested out samtools mpileup
, and it detected everything. I also tested out GATK and SAMtools on a 1000 Genomes whole genome sample, and again SAMtools detected known variants in the sample, where GATK could not.
After exploration, I found a way to 'assist' the GATK to find the missing variants, a process that basically involved sub-sampling the aligned BAMs to lower read depths and then re-calling variants on each sub-set. A modified version of the pipeline is here, but this version now involves bcftools mpileup
: https://github.com/kevinblighe/ClinicalGradeDNAseq
So, it seems that GATK HC's issues were / are in part dependent on read depth. THIS GitHub issue, started in 2017, points the finger at input padding, i.e., the -ip
command line parameter, and makes for interesting reading. In particular, take a look at THIS comment from that same GitHub issue.
In my conversations over the years to people about this, some seem to not care too much about it, while others are concerned but continue to use the GATK. From my perspective, GATK should only be used in a research setting.
Related thread: A: Best tool for variant calling
Kevin
From where I see it, the missing variants can be devided into two types, one type contains multiple mismatches in short window caused by misalignment on the homologous region. This kind of variants either be called with low LP or not be called in vcf at all. I can see these variants are with low MAPQ from IGV due to multiple mismatch. However, I can't find anything in the HC documentation about taking account of MAPQ during HC calling. It does consider base quality during pairHMM. But that doesn't seem to prevent these variants from being called to vcf (low base quality may cause low likelihood of haplotype by pairHMM. However, they don't set a minimum likelihood of allele and the allele with highest likelihood is emited into vcf anyway )
For the first type that you mention, how can one be sure that the alignments are correct, though? For some repetitive regions / regions of homologous sequences, it is literally impossible for an aligner to align (with confidence) short reads. Calling variants in those regions should be done by alternative technology, such as Sanger, or via alternative capture methods, such as long-range PCR followed by NGS.
Unfortunately, for a clinical lab, this is why so much care must be taken when setting up a clinical test. Despite this fact, though, I have already seen more than enough companies who are just running the standard ~150bp Illumina MiSeq paired-end sequencing, aligning with BWA and calling variants with GATK or BaseSpace, and calling that a clinical test.
I remember sometime people suggest disgarding germline variants if they are occur in vcf within short window. I guess it is reasonable to assume they are more likely to be caused by misalignments.
Alternatively, HC should set up a lower bound for likelihood of allele before normalization are reporting. Say if it got possible genotype of AA, AC and CC at given site and all of these likelihood estimated by PairHMM are really low, it should disgard this site. This is seems so appearent. Maybe they did this but I missed.