Given two sets of genomic locations, such as two bed files of ChIP-Seq peaks, how would you calculate if they overlap more than would be expected by chance?
I have been trying a couple of methods.
1) Using the BEDTools shuffleBed tool to randomise one or both of the files repeatedly (say 1000 times) then running intersectBed to determine a distribution of expected overlaps. I then work out an empirical p-value based on the number of the random samples that overlaps to a greater degree than my real overlap.
2) Using the GSC (genome structural correction) tools from Encode.
I have been thinking that method 1 limited to specific regions, rather than whole genome, might be a better approach. I think the assumption that the regions could be anywhere on the genome when shuffled is not accurate. This is something that method two tries to address, but I just can not seem to get it work reliably.
So how should I calculate the significance of my overlaps? My aim is to look at the overlap of peaks between biological replicates of CHiP-Seq and between different peaks callers etc.
To anyone arriving to this question, at this other question there's a useful list of tools and some more detailed answers