In my group we had a number of "problematic" sequencing runs, so I was asked to ensure that the variants outputted by my analyses were sufficiently covered and within the limits of the sensitivity of the validation instrument (5%) to ensure a correct validation.
Upon looking at my VCFs and the spec, though, I noticed that the DP field for each sample in a multisample VCF reports all the reads found, regardless if they are tge reference or the alternate base(s). The GATK's VCFs have the AD field, but it is not recommended, at least according to their documentation, to use them because it includes unfiltered reads.
Considering that I have full access to all the files generated for the analysis, what's the best course of action ot extract coverage for the reference and the variant given one site in the VCF file?
Thanks in advance.
Isn't there a DP4 field in the vcf showing read coverage for ref/alt on both strands (that makes 4 numbers). But, for some reason, the four numbers do not necessarily add up to the DP field, maybe some filtered reads don't count?
Correct. DP is not filtered, DP4 is.
Yes,
DP4
should be good if you want allele counts aggregated across all samples. If you want this broken down per sample, GATK'sAD
field is the only out-of-the-box solution I know (as far as I know,samtools
doesn't do anything similar).