Hi,
I'm encountering issues with my post-calling VCF filtration process.
After obtaining an annotated VCF using FreeBayes, I attempted to apply hard filtering parameters similar to those recommended by GATK. However, I noticed that aside from QD, almost all metrics either had unrealistic values like FS 0 or were entirely missing (designated as "."), such as MQ, MQRankSum, and ReadPosRankSum (annotated by GATK VariationAnnotator).
Here's a summary of the annotations from the VCF:
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">
##INFO=<ID=QR,Number=1,Type=Integer,Description="Reference allele quality sum in phred">
##INFO=<ID=QA,Number=A,Type=Integer,Description="Alternate allele quality sum in phred">
##INFO=<ID=SRF,Number=1,Type=Integer,Description="Number of reference observations on the forward strand">
##INFO=<ID=SRR,Number=1,Type=Integer,Description="Number of reference observations on the reverse strand">
##INFO=<ID=SAF,Number=A,Type=Integer,Description="Number of alternate observations on the forward strand">
##INFO=<ID=SAR,Number=A,Type=Integer,Description="Number of alternate observations on the reverse strand">
This absence of expected annotations raises concerns about the integrity of the annotations provided. I'm considering whether I should re-annotate all metrics using GATK or proceed with the existing annotations and focus solely on filtering using QD from the GATK-added file (ensuring CHROM|POS|REF|ALT match the variants called by FreeBayes) or default QUAL. Would using QD or QUAL alone suffice for effective filtration?
I appreciate any suggestions or insights. I find the GATK annotations like QualByDepth (QD), FisherStrand (FS), StrandOddsRatio (SOR), RMSMappingQuality (MQ), MappingQualityRankSumTest (MQRankSum), and ReadPosRankSumTest (ReadPosRankSum) particularly beneficial due to their comprehensive nature and recommended thresholds, which simplify the filtering process for me as a master's student without extensive background in setting detailed filter thresholds.
Thank you for your assistance.
Filtration from gatak: https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-filtering-germline-short-variants
sorry, I don't understand. Your vcf was called using freebayes : why do you expect to find GATK annotations in this VCF ?
Not like that. Sorry. I used Gatk variant annotator to called them separately with used reference to the freebayens calling. I thought, that in that way I would be able to get the annotations and add them separately into the annoted file with bcftools annotate. Still it maybe is wrong to annotate with GATK, pick the values with bcftools query and add it without bcftools annotate. Sorry about the post, it may be badly translated.
Problem with the annotations by GATK was that my annoted vcf were called by freebayens caller. So the total data like MQ, FS or DP doesn't exist, because the metric is done per sample.
So technically I can't use GATK, because I would need to recall all variations again. The only useful metric I can create is QD with total QUAL and summed DP per sample.
Forgot to say, why am I trying to not to redo the calling. Mainly because I have limited RAM of PC and don't have possibility to get to university PC.