Hello,
I produced a few VCFs, using clara parabricks deepvariant and gatk haplotype caller (in the regular way and also using clara-parabricks haplotype caller, which yielded identical results)
The problem is that the QUAL score of both different extremely; using haplotype caller a lot of variants were in the thousands, while using deepvariant most were in the 30-50s.
Here is an example of a variant we are interested in:
chr5 112839942 . C T 2002.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.126;DP=187;ExcessHet=3.0103;FS=2.554;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=10.94;ReadPosRankSum=-3.706;SOR=0.775 GT:AD:DP:GQ:PL 0/1:91,92:183:99:2010,0,1931
the example above was made using haplotype caller, while the one below was made using deepvariant.
chr5 112839942 . C T 35.1 PASS . GT:GQ:DP:AD:VAF:PL 0/1:34:184:92,92:0.5:35,0,41
I am aware that the later example is filtered and is lacking a column, but I'm wondering how there is such massive difference on the quality scores of both, if anyone could give me a clue I'd be very thankful!
thanks for your time.
I'm afraid the method used to calculate GQ is not defined in the VCF spec. It's up to the caller to produce a value.
Hi Victor, A few questions to try to diagnose: 1) was the pre-processing of both prior to VCF generation identical? 2) what settings were used to run these samples? the exact commands issued might help. 3) anything else we should know? e.g. in one case the sample was jointly called, in the other case, it was called singly.
Hello,
the treatment of both was the same, both of those were generated from the same exact .bam file (made with fq2bam) of a wes sequencing with parabricks. one using clara-parabricks deepvariant and the other also using parabricks haplotype caller, we also tried using both regular gatk haplotype caller and parabricks germline (which generates bams and vcfs also using haplotypecaller). All the methods using haplotype caller had the same results.
With the exception of deepvariant that used the --use-wes-model flag, everything else was the default. each sample was called individually
thanks for the help!
So, I've not used these tools or read the docs in detail, so please critically evaluate this...
Having said that, check out the haplotype caller page, which states:
Overall, this seems to me to be saying that the niche application of the haplotype caller tool is to generate sample metadata. That the gatk and nvidia implementation of gatk issue similar/identical results is not surprising (if anything is probably a bit reassuring). I would double check what nvidia's workflow is doing, it may just be calling the appropriate commands from
gatk
...It seems
haplotypecaller
could be used to optimize/maximize variant call accuracy, but for that you would want to include theBQSR report
from another variant caller. Overall, it doesnt seem to me that haplotype caller is really meant to be a dedicated variant caller per se, but rather is meant to help you understand the quality assurance metrics associated with a .BAM file.Could probably check with support from nvidia itself to confirm/disconfirm these ideas, if no one else weighs in here.
What is the sequencing depth of the sample?
for this one specific it is 147 (.584).