Question

A criteria to classify probes based on the DP of its bases

0

Entering edit mode

2.3 years ago

Manuel ▴ 50

Working in a clinical genomics lab, I have been asked to developed a small application that creates a report informing the probes of the somatic NGS analysis that have a low depth.

I think the best approach is to use the gVCF file that looks like this

 #CHROM             POS        ID            REF         ALT         QUAL    FILTER   INFO      FORMAT                                           Patient_name_here.bam
chr1       36931696             .               T              .               100         PASS      DP=839 GT:GQ:AD:DP:VF:NL:SB:NC                0/.:100:830:839:0.0107:24:-100.0000:0.0071
chr1       36931697             .               T              .               100         PASS      DP=832 GT:GQ:AD:DP:VF:NL:SB:NC                0/0:15:829:832:0.0036:24:-100.0000:0.0154
chr1       36931698             .               T              .               100         PASS      DP=837 GT:GQ:AD:DP:VF:NL:SB:NC                0/0:36:836:837:0.0012:24:-100.0000:0.0095
chr1       36931699             .               A             .               100         PASS      DP=836 GT:GQ:AD:DP:VF:NL:SB:NC                0/0:36:835:836:0.0012:24:-100.0000:0.0107
chr1       36931700             .               C             .               100         PASS      DP=818 GT:GQ:AD:DP:VF:NL:SB:NC                0/0:14:814:818:0.0049:24:-100.0000:0.0320
chr1       36931701             .               A             .               100         PASS      DP=841 GT:GQ:AD:DP:VF:NL:SB:NC                0/0:20:838:841:0.0036:24:-100.0000:0.0047
chr1       36931702             .               A             .               100         PASS      DP=825 GT:GQ:AD:DP:VF:NL:SB:NC                0/0:19:822:825:0.0036:24:-100.0000:0.0237
chr1       36931703             .               T              .               100         PASS      DP=833 GT:GQ:AD:DP:VF:NL:SB:NC                0/0:26:832:833:0.0012:24:-100.0000:0.0142
chr1       36931704             .               A             .               100         PASS      DP=833 GT:GQ:AD:DP:VF:NL:SB:NC                0/0:11:829:833:0.0048:24:-100.0000:0.0142
chr1       36931705             .               C             .               100         PASS      DP=838 GT:GQ:AD:DP:VF:NL:SB:NC                0/0:27:837:838:0.0012:24:-100.0000:0.0083 
chr1       36931706             .               T              .               100         PASS      DP=791 GT:GQ:AD:DP:VF:NL:SB:NC                0/.:100:783:791:0.0101:24:-100.0000:0.0639
chr1       36931707             .               G             .               100         PASS      DP=836 GT:GQ:AD:DP:VF:NL:SB:NC                0/0:36:836:836:0.0000:24:-100.0000:0.0107
chr1       36931708             .               A             .               100         PASS      DP=833 GT:GQ:AD:DP:VF:NL:SB:NC                0/0:11:829:833:0.0048:24:-100.0000:0.0142
chr1       36931709             .               A             .               100         PASS      DP=831 GT:GQ:AD:DP:VF:NL:SB:NC                0/0:11:827:831:0.0048:24:-100.0000:0.0166
chr1       36931710             .               G             C             100         PASS      DP=831 GT:GQ:AD:DP:VF:NL:SB:NC                0/0:36:831:831:0.0000:24:-100.0000:0.0166

Because they want to identify probes that have bases with low DP (e.i DP<50). I will use a bed file to identify the coordinate of this probes and then I will look at if bases in these regions have low DP. The key point is that we don't know a criteria to classify probes. For example, if I consider a bad probe the ones that have only 1 base which DP is lower than 50, this is a very strict method, and actually the probe would be good enough to call variants. The geneticists told me that they discard a SNPs if the DP is lower than 50. And they prefer bases with a depth >200. Between 50 and 200, is a region that can be considered depending on the frequency of the variant but normally they prefer do not analysis variants between 50 to 200 DP.

With this info, do you have a good criteria to distinguishes between good and bad probes.

Extra info: We have already a report that says the coverage of the genes sequenced. DP<50 bases are considered not coverage. For example is a probe has 100 bases and 2 bases have a DP lower than 50, the coverage of that region will be 98%

UKAS accreditation complain saying that if we have for example a gene with a coverage equal that 95%. How can I know if that 5% is due to a entire probe with all bases lower than 50% or on the contrary if that 5% is due to the results of small gaps of low DP in many different probes. Therefore, they asked us to complement that report indicating the number of probes with low DP. How to establish what probes have low DP and which one have a acceptable DP is what they didn't say. Being a new bioinformatician in this field, I am asking you some advice.

Does this make sense to you?

Thanks

Manuel

NGS • 501 views

ADD COMMENT • link updated 2.3 years ago by Matthias Zepper 5.0k • written 2.3 years ago by Manuel ▴ 50

0

Entering edit mode

Not exactly my area of expertise, but this old answer might be helpful?

Other than that: Why don't you plot a histogram of the DP distribution of your probes? When you generate such aggregate histograms for each probe over 1000 patients, you should get quite reliable ballpark estimates what an acceptable coverage for each probe is?

ADD REPLY • link 2.3 years ago by Matthias Zepper 5.0k