Question

How Do You Calculate If Two Sets Of Genomic Regions Overlap Significantly?

20

Entering edit mode

14.3 years ago

Stew ★ 1.4k

Given two sets of genomic locations, such as two bed files of ChIP-Seq peaks, how would you calculate if they overlap more than would be expected by chance?

I have been trying a couple of methods.

1) Using the BEDTools shuffleBed tool to randomise one or both of the files repeatedly (say 1000 times) then running intersectBed to determine a distribution of expected overlaps. I then work out an empirical p-value based on the number of the random samples that overlaps to a greater degree than my real overlap.

2) Using the GSC (genome structural correction) tools from Encode.

I have been thinking that method 1 limited to specific regions, rather than whole genome, might be a better approach. I think the assumption that the regions could be anywhere on the genome when shuffled is not accurate. This is something that method two tries to address, but I just can not seem to get it work reliably.

So how should I calculate the significance of my overlaps? My aim is to look at the overlap of peaks between biological replicates of CHiP-Seq and between different peaks callers etc.

chip-seq coverage statistics bed • 21k views

ADD COMMENT • link updated 7.0 years ago by bwd ▴ 90 • written 14.3 years ago by Stew ★ 1.4k

0

Entering edit mode

To anyone arriving to this question, at this other question there's a useful list of tools and some more detailed answers

ADD REPLY • link 5.6 years ago by bernatgel ★ 3.4k

Ram · Answer 1 · 2011-02-11

7

Entering edit mode

14.3 years ago

Ian 6.1k

When comparing set A to set B, that are from ChIP-seq data, I have used a method where I obtain random regions to resemble B (the same length) from the mapable region of, in my case, the human genome. This data can be obtained from the test UCSC browser 'CRG mapability'. The overlap between A and B, to A and random can be tested by a Fisher Exact Chi-square. This way random regions cannot come from areas of the genome that could never be binding regions.

I am also looking at alternative methods.

You might be interested in the The Genome HyperBrowser http://hyperbrowser.uio.no/hb/. Where you can ask questions just like this. It offers are range of null hypotheses, some complex like maintaining inter-feature distance, and randomising the feature calls, etc. I quite like the look of this and intend to test it out myself.

David Huen (Cambridge, UK) gave a seminar and discussed this problem. He has an R package called 'cooccur', see http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2910723/ which does a similar sort of comparison.

ADD COMMENT • link 14.3 years ago by Ian 6.1k

0

Entering edit mode

Thanks for the answer, I am trying Cooccur now.

ADD REPLY • link 14.3 years ago by Stew ★ 1.4k

0

Entering edit mode

Hi! I am also on my way to use Cooccur package, but I am experiencing some problem like the package works pretty well with the example data. but when I load in my data

>library(Cooccur)
a<-read.table("path")
b<-read.table("path")

everything goes is ok till here.

>explore_pairs(c("a","b"))

This is where I get stuck.
ERROR:

Error in sum(sapply(tf1_s, score_sample, tf2_hits = tf2_s, hit_list = hit_l)) :
  invalid 'type' (list) of argument

Thank you, It will be very kind of you if you could email me on my id abhisheksinghnl@gmail.com

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 13.7 years ago by Dataminer ★ 2.8k

0

Entering edit mode

make sure your data have exactly three columns named 'seq' 'start' 'end' in that order

ADD REPLY • link 13.7 years ago by Malcolm.Cook ★ 1.5k

score 5 · Answer 2 · 2011-02-11

5

Entering edit mode

14.3 years ago

Casey Bergman 18k

This is a trickier problem than it first seems, and there appears to be no consensus yet on the best method to do this. For a summary of most methods available, see the review by Fu and Adryan on this topic: http://www.ncbi.nlm.nih.gov/pubmed/19763325

For interval-based data, I second Ian's suggestion about Huen's method, which follows on from work by Haiminen et al.: http://www.ncbi.nlm.nih.gov/pubmed/18691400.

ADD COMMENT • link 14.3 years ago by Casey Bergman 18k

0

Entering edit mode

That review looks interesting, but unfortunately it is not open access, and we do not have a subscription. Peter Bickel's group at Berkeley did a lot of work on this for ENCODE, that we used on the last CHiP-chip paper, but his methods do not seem to be taking off in the CHiP-Seq world.

ADD REPLY • link 14.3 years ago by Stew ★ 1.4k

0

Entering edit mode

That review: http://people.pwf.cam.ac.uk/qf205/Papers/FuAdryan2009.pdf

ADD REPLY • link 13.7 years ago by Malcolm.Cook ★ 1.5k

0

Entering edit mode

And now in PubMedCentral, in case the personal link above disappears: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3475982/

ADD REPLY • link 12.2 years ago by David Quigley 11k

score 1 · Answer 3 · 2018-04-30

MSPC is designed specifically for this purpose!

Download it from this page, and use it as the following:

MSPC -i rep1.bed -i rep2.bed -i rep3.bed -r biological -s 1E-8 -w 1E-4 -g 1E-9 -c 2 -a 0.005

where

-s sets stringent threshold: peaks in input samples with p-value more significant than this threshold, are considered "stringent" peaks;
-w sets weak threshold: peaks in input samples with p-values less significant that this threshold, are considered "weak" peaks;
-g how strong should be p-value of overlapping peaks so to consider them "significant";
-c a peak should overlap with how many peaks at minimum to be considered for analysis;
-a BH multiple testing correction.

See this page for a complete list of arguments.