Context
After developing our NGS pipeline, we observed that while our sensitivity is 100%, we also call many False Positive (FP) variants. Upon examining these FPs, I identified two combined approaches that can effectively detect the majority of these FP variants without eliminating True Positives (TP).
- The first approach:
Calculate the mapping quality of the reads that support the ALT. Calculate the mapping quality of the reads that support the REF. Calculate the difference: (mapping quality of the reads that support the ALT) - (mapping quality of the reads that support the REF). Results show that a group of FP variants can be detected because these FPs occur due to mapping errors when sequencing genes with sequences similar to pseudogenes (e.g., NF1). See the plot below:
The three homozygous TPs on the top right are there because this approach doesn't work with variants where VAF is 100%. I need to improve my code but I am not worried about these cases. The FP with a ~85% VAF is caused by a PCR error. Nothing can be done with these cases because both reads show high quality. The group of FPs with a ~45% VAF are variants called due to poly-T tracks. Nothing can be done with these cases either.
- The second approach
The second approach is similar but uses the base quality score instead of the mapping quality.
Results show that when the sequencing machine makes errors, a lower Q score value is assigned compared to when the machine does not make errors. The differences between the reads that support the REF and the reads that support the ALT can be considered when detecting FPs without eliminating TPs. See plot below:
Combining both approaches, I can filter and detect FPs and TPs at least in the training data I have. These reads come from Amplicon analysis and this approach works well for both constitutional and somatic analysis.
A cutoff of 4 for the mean mapping quality and a cutoff of 5 for the mean base quality score in the plot will filter TPs and FPs quite successfully.
The problem and question:
Instead of estimating by eye, I have been working with statistical tests. For instance, GATK has VariantAnnotator with the option BaseQualityRankSumTest that can implement my second approach to annotate variants by taking the differences in base quality scores and applying a Mann-Whitney-Wilcoxon Rank Sum Test. The outcome provides a numerical value. Values close to 0 mean that there are no differences between the quality of the base calls of the reads that support the REF and those that support the ALT. The problem is that this is not working for me. The following plot shows the same data but now displays the BaseQualityRankSumTest on the Y-axis:
I have also tried to calculate Youden's J Statistic: Calculate Youden's J statistic (tpr - fpr) to find the optimal threshold. However, the value that returns is useless because it is completely out of the range (e.g. for the mapping quality is 37.
What am I doing wrong??
Just commenting, but this is really really clever. Nice work! I would imagine you are going to have to work closely with a bio-statistician to get a reliable answer on this.
Thanks for your comment. Interestingly, the same statistical test works for the MappingQualityRankSumTest. The Z score I get from the mapping quality split FP and TP nicely and I can take this info to determine FP variants.