Hi!
I am trying to run the Gatk HaplotypeCaller (human data):
./gatk-4.1.2.0/gatk HaplotypeCaller\
--reference ref.fa \
--input file.bam \
--genotyping-mode GENOTYPE_GIVEN_ALLELES \
--alleles allele_chunk_file.vcf \
--intervals file.bed \
--output out/file.vcf
After running the above command for any given sample, only ~ 3 sites are called and all of them have a 'LowQual' filter attached, for example:
chr16 89919714 . C A 0 LowQual AC=0;AF=0.00;AN=2;DP=35;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.257 GT:AD:DP:GQ:PL 0/0:35,0:35:99:0,105,1210
chr16 89919722 . T C 0 LowQual AC=0;AF=0.00;AN=2;DP=32;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.206 GT:AD:DP:GQ:PL 0/0:32,0:32:96:0,96,1148
chr16 89919736 . C T 0 LowQual AC=0;AF=0.00;AN=2;DP=31;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.223 GT:AD:DP:GQ:PL 0/0:31,0:31:92:0,92,976
chr16 89919746 . G A 168.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.137;DP=34;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=4.96;ReadPosRankSum=0.955;SOR=0.409 GT:AD:DP:GQ:PL 0/1:26,8:34:99:176,0,674
chr16 89920138 . G C 0 LowQual AC=0;AF=0.00;AN=2;DP=30;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.307 GT:AD:DP:GQ:PL 0/0:30,0:30:90:0,90,862
chr16 89957794 . G A 0 LowQual AC=0;AF=0.00;AN=2;DP=43;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.428 GT:AD:DP:GQ:PL 0/0:43,0:43:99:0,129,1554
chr16 89957798 . A G 545.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-4.803;DP=44;ExcessHet=3.0103;FS=15.365;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=12.40;ReadPosRankSum=-0.858;SOR=2.363 GT:AD:DP:GQ:PL 0/1:21,23:44:99:553,0,622
chr17 81629785 . C T 693.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=3.231;DP=57;ExcessHet=3.0103;FS=2.273;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=12.17;ReadPosRankSum=-1.027;SOR=1.012 GT:AD:DP:GQ:PL 0/1:33,24:57:99:701,0,908
chr20 34077942 . A . 0 LowQual AC=0;AF=0.00;AN=2;DP=47;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.509 GT:AD:DP:GQ:PL 0/0:47,0:47:99:0,144,2147483647
chr20 34197406 . C G 0 LowQual AC=0;AF=0.00;AN=2;DP=58;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.327 GT:AD:DP:GQ:PL 0/0:58,0:58:99:0,174,2030
I've checked the bam file and both the known sites VCF and the called sites VCFs in IGV. There are many instances where the call should be homozygous reference but there is no call in the called VCF. There is clearly sufficient read coverage of one of the known sites VCF calls to call it as homozygous reference but there is no call in the called VCF.
Now I'm trying to find a command that can find the 0/0 genotype and distinguish it between really low coverage
I believe GATK's HaptoypeCaller is behaving correctly: "For each potentially variant site, the program applies Bayes' rule, using the likelihoods of alleles given the read data". Generally, GATK will only emit segregating sites, as in most cases homozygote reference sites are uninformative in many use cases.
If you are using a recent version of GATK (which I don't think you are since the
--genotyping-mode
argument appears to be deprecated), you can use either of these--output-mode [EMIT_ALL_CONFIDENT_SITES|EMIT_ALL_ACTIVE_SITES]
asEMIT_VARIANTS_ONLY
is used by default. Docs for this option here.Thanks for your reply! I have another problem with GATK Haplotypecaller. Is there a way to keep the RS field in output vcf file. This is my command:
The input vcf looks like this:
But the output does not include rsid
Thanks)
I've not used a SNP database in my analyses on non-model organisms, but I believe you are looking for the VariantAnnotator tool. It has the
--dbsnp
parameter to annotate variants with dbSNP IDs where applicable. And according to the best practices for germline short variant discovery, VariantAnnotator should be one of the final steps in the pipeline.