I'm doing some matched comparison of samples, and I'm trying to filter the results by depth. However, I'm not sure on how to use the DP field per sample.
Let's make an example: suppose we have matched Sample A and Sample B, and at a particular locus we have a mutation (SNP).
Case 1
- DP for Sample A reports 10
- DP for Sample B reports 15
Case 2
- DP for Sample A reports 10
- DP for Sample B reports none (no DP in genotype)
My problem is how to interpret Case 2 (and similar scenarios, e.g. with Sample A with no DP). Given that DP in samples (at least the ones used by the GATK) are reads that pass the quality control metrics, which scenarios are most likely here?
- Nothing can be done, the locus for that specific sample may be wild type or not but filtered read depth is not sufficient to determine that (in R terms, this would mean
NA
) - The locus is assumed wild type due to lack of supporting information (reads)
- A wild type locus does not have DP information
This matters to me because I'm currently filtering matched samples where DP is both present and higher than a threshold, and I was wondering if I wasn't too restrictive.
For reference, these results refer to indels generated with the GATK's UnifiedGenotyper in indel mode.
You could check the pileup at that particular locus just to make sure that the issue is from lack of reads spanning the particular genomic location in that sample. If that is the case, I presume you cannot make a direct comparison for this SNP between the two samples.