Working in a clinical genomics lab, I have been asked to developed a small application that creates a report informing the probes of the somatic NGS analysis that have a low depth.
I think the best approach is to use the gVCF file that looks like this
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Patient_name_here.bam
chr1 36931696 . T . 100 PASS DP=839 GT:GQ:AD:DP:VF:NL:SB:NC 0/.:100:830:839:0.0107:24:-100.0000:0.0071
chr1 36931697 . T . 100 PASS DP=832 GT:GQ:AD:DP:VF:NL:SB:NC 0/0:15:829:832:0.0036:24:-100.0000:0.0154
chr1 36931698 . T . 100 PASS DP=837 GT:GQ:AD:DP:VF:NL:SB:NC 0/0:36:836:837:0.0012:24:-100.0000:0.0095
chr1 36931699 . A . 100 PASS DP=836 GT:GQ:AD:DP:VF:NL:SB:NC 0/0:36:835:836:0.0012:24:-100.0000:0.0107
chr1 36931700 . C . 100 PASS DP=818 GT:GQ:AD:DP:VF:NL:SB:NC 0/0:14:814:818:0.0049:24:-100.0000:0.0320
chr1 36931701 . A . 100 PASS DP=841 GT:GQ:AD:DP:VF:NL:SB:NC 0/0:20:838:841:0.0036:24:-100.0000:0.0047
chr1 36931702 . A . 100 PASS DP=825 GT:GQ:AD:DP:VF:NL:SB:NC 0/0:19:822:825:0.0036:24:-100.0000:0.0237
chr1 36931703 . T . 100 PASS DP=833 GT:GQ:AD:DP:VF:NL:SB:NC 0/0:26:832:833:0.0012:24:-100.0000:0.0142
chr1 36931704 . A . 100 PASS DP=833 GT:GQ:AD:DP:VF:NL:SB:NC 0/0:11:829:833:0.0048:24:-100.0000:0.0142
chr1 36931705 . C . 100 PASS DP=838 GT:GQ:AD:DP:VF:NL:SB:NC 0/0:27:837:838:0.0012:24:-100.0000:0.0083
chr1 36931706 . T . 100 PASS DP=791 GT:GQ:AD:DP:VF:NL:SB:NC 0/.:100:783:791:0.0101:24:-100.0000:0.0639
chr1 36931707 . G . 100 PASS DP=836 GT:GQ:AD:DP:VF:NL:SB:NC 0/0:36:836:836:0.0000:24:-100.0000:0.0107
chr1 36931708 . A . 100 PASS DP=833 GT:GQ:AD:DP:VF:NL:SB:NC 0/0:11:829:833:0.0048:24:-100.0000:0.0142
chr1 36931709 . A . 100 PASS DP=831 GT:GQ:AD:DP:VF:NL:SB:NC 0/0:11:827:831:0.0048:24:-100.0000:0.0166
chr1 36931710 . G C 100 PASS DP=831 GT:GQ:AD:DP:VF:NL:SB:NC 0/0:36:831:831:0.0000:24:-100.0000:0.0166
Because they want to identify probes that have bases with low DP (e.i DP<50). I will use a bed file to identify the coordinate of this probes and then I will look at if bases in these regions have low DP. The key point is that we don't know a criteria to classify probes. For example, if I consider a bad probe the ones that have only 1 base which DP is lower than 50, this is a very strict method, and actually the probe would be good enough to call variants. The geneticists told me that they discard a SNPs if the DP is lower than 50. And they prefer bases with a depth >200. Between 50 and 200, is a region that can be considered depending on the frequency of the variant but normally they prefer do not analysis variants between 50 to 200 DP.
With this info, do you have a good criteria to distinguishes between good and bad probes.
Extra info: We have already a report that says the coverage of the genes sequenced. DP<50 bases are considered not coverage. For example is a probe has 100 bases and 2 bases have a DP lower than 50, the coverage of that region will be 98%
UKAS accreditation complain saying that if we have for example a gene with a coverage equal that 95%. How can I know if that 5% is due to a entire probe with all bases lower than 50% or on the contrary if that 5% is due to the results of small gaps of low DP in many different probes. Therefore, they asked us to complement that report indicating the number of probes with low DP. How to establish what probes have low DP and which one have a acceptable DP is what they didn't say. Being a new bioinformatician in this field, I am asking you some advice.
Does this make sense to you?
Thanks
Manuel
Not exactly my area of expertise, but this old answer might be helpful?
Other than that: Why don't you plot a histogram of the DP distribution of your probes? When you generate such aggregate histograms for each probe over 1000 patients, you should get quite reliable ballpark estimates what an acceptable coverage for each probe is?