Question

Statistical tests for NGS profile/enrichment plots

1

Entering edit mode

5.6 years ago

ATpoint 89k

Enrichments I am looking for a general approach / statistical test to claim statistical significance in enrichment plots as shown below, e.g. from ChIP- or ATAC-seq. Lets assume x-axis was a window around peak centers and y-axis was normalized clunts. I realize this has been discussed before (Chip-Seq Enrichment Profile Significance?) but the thread is quite old and does not fully cover /answer my question, so please see below:

Problems:

1) Would one use the summit read count (so at 0bp in the below plots) or rather a certain window around the center position? I guess it is difficult to define an exact center that is representative (especially if one has many samples/conditions) so a small window would probably make sense, but what about the size? I realize that there is no bullet-proof answer as it is dataset-dependent but please feel free to share your best practices.

2) What if the peak is not "sharp" with one clear summit like for TFs but more complex, such as the ones one gets from H3K27ac or H3K9me3 (FigureD in below image) with a peak-valley-peak pattern or even broad marks such as H3K27me3. Would one define a window to to span the entire peak-valley-peak area followed by summarizing counts over this window?

3) Which test to use? Intuitively I guess a Wilcoxon test makes sense? Still, (correct me if wrong) from what I've seen this test easily produces very high significance when sample size is large, which is the case when dealing with counts from thousands of peaks. Is this concern justified? Is Wilcoxon appropriate or should one use something more tailored to large numbers of elements in the test?

4) Finally, say I perform multiple comparisons between different samples, would one need to correct for multiple testing here, even though we are dealing with low numbers of tests in comparison to e.g. DEG analysis where one performs thousands or tens-of-thousands of tests?

I would be interested in your best practices. Eventually I will have to incorporate / automate it into my R pipeline for ChIP-seq so advice on a general approach and suggestions of appropriate statistics would be preferred over tool suggestions. Still, any comments will be appreciated.

enter image description here

ChIP-seq Wilcoxon • 2.7k views

ADD COMMENT • link updated 12 months ago by rls_08 ▴ 40 • written 5.6 years ago by ATpoint 89k

score 2 · Answer 1 · 2020-01-07

You've done 3 random samples in those plots. I think that is the correct approach, but instead of 3, I'd go for 1000. You can then either

Plot the randon/density of the random profiles, and overlay the test profiles for visual significance
Compute an empirical p-value/FDR for each point on the x axis by asking how many of your random profiles exceed your test profile at that point. As you'll have 10,000 of these point, you definitely need to consider multiple testing correction here.

The best way to do this would be to produce the full matrix of all TSS and then select random TSSs to pool over, rather than selecting random TSS and then computing their profile.

score 2 · Answer 2 · 2020-01-07

2

Entering edit mode

5.6 years ago

Devon Ryan 105k

We generally prefer to compare each bin in the profile. The deepStats package should aid with this if you're using deepTools, since it's using the output from computeMatrix. A wilcoxon for each position can be used and then the p-values corrected for multiple comparisons. I suppose one could use bump hunting or other methods to then determine exact regions of difference with appropriate locally corrected p-values, but I'm not aware of anyone having actually implemented that.

ADD COMMENT • link 5.6 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks for the answer. I guess one could try something like calculating p-values for the single bins and then combine adjacent bins that are significant to actually define (or rather approximate depending on bin size) the parts of the profile that are significantly enriched (or depleted). Will follow up on this suggestion.

ADD REPLY • link 5.6 years ago by ATpoint 89k

0

Entering edit mode

Devon Ryan , do you know if there is any tutorial out there on how to use the deepStats package? I don't see any activity on their github for the last 3 years and the documentation there is limited. Do you happen to know of any other package or tool to determine if the curves in the deeptools metaplots are significantly different?

ADD REPLY • link 12 months ago by rls_08 ▴ 40