Hello I was wondering about general considerations for performing overlap of genomic regions and doing Monte Carlo-type statistics.
Below I have made a description of how I do it, unfortunately Im not fully confident that this is correct, so I'll appreciate any thought on this.
E.g. I have an experimental dataset (A) of 10 bp coordinates, this dataset constitutes approx. 5,000 entries all across the genome.
Then I have another experimental dataset (B) (ChIP-seq) of ~1,000 bp coordinates, and ~50,000 entries all across the genome.
If I perform overlap/intersection with BEDTools I get my overlap. E.g. 2000 entries from A.
But then I also want to find overlaps in the vicinity of the ChIP-seq peaks, so I extend the size of these peaks e.g. by 1,000 bp on each side, then there are still 50,000 entries but the amount of the genome that is searched becomes larger, and some entries may also overlap now.
So I do the intersection again of A and B, and count entries in A only once. This gives me e.g. 3,000 entries from A.
So for the simulations, I use random intervals that look like dataset B. E.g. I pick 50,000 1,000 bp coordinates randomly, and intersect with A, and do this 1,000 times. Then I get e.g. an average of 500 entries from A.
For overlaps in the vicinity I calculate the total size of dataset B and generate random intervals of the same length and total size in bp as dataset B (size-matched sampling).
I hope you can follow this way of thinking.
So the question basically is, is this correct? And how far can I extend my intervals before the overlap becomes artificial? The largest sizes I'm overlapping are ~15% of the genome in dataset B, and this gives me almost all entries from A. This is far higher than in 1,000 simulations.
Any thoughts are appreciated, e.g. is this better to turn it around and make entries in A larger?