Question

Interpret Bedtools Overlap?

0

Entering edit mode

3.7 years ago

kstangline ▴ 80

I have a fairly simple question regarding bedtools.

I've been asked to find the intersections between two types of sample peaks (ChIP-seq peaks). The goal is to see if they're similar or not (i.e. can we use our new method if it gives similar results/peaks to our old method).

I've used the following formula to find the reproducibility:

bedtools intersect -u -a sample1.bed -b sample2.bed -wa | wc -l

I then took the intersection value and divided it by the total of -a (sample1) to get the reproducibility rate. In other words, I'm showing the % of peaks in sample 1 that are reproduced in sample 2.

How would I interpret these results to a wet lab scientist if the reproducibility (overlap) is > 60%?

From my understanding, anything > than 60% (overlap) reproducibility is considered a good score because it's less likely to have occurred by chance?

Would I need to calculate a p value to show that there is a really good overlap?

bed • 802 views

ADD COMMENT • link updated 3.7 years ago by Istvan Albert 101k • written 3.7 years ago by kstangline ▴ 80

score 1 · Answer 1 · 2021-03-25

Think about it this way: Is 50% a surprising chance to win a coin toss? How about having a 50% chance to win the lottery?

The point I am trying to make is that the value of an observation, interpreted as novel information, relates to how unlikely (aka informative) it is.

60% is only meaningful if you also knew how unlikely it was to get 60% by chance alone. In a sense, that likelihood is what p-values try to capture.

In your case, you would need to quantify how likely is that you could get 60% overlap even if the phenomena of interest (that you associate with overlap) would not be present. Or what fraction would overlap if you picked ChIP-seq data for similar tissues and states but conditions that contradict your hypotheses. With that, you can build up a confidence level as to what is credible overlap and what are accidental, systemic similarities.

Also, I would not call this "reproducibility", that means something else in my opinion. What you observe is replication, your replicates recapitulate some but not all the information.