General Considerations For Genomic Overlaps?
1
0
Entering edit mode
10.6 years ago
plfalcon81 • 0

Hello I was wondering about general considerations for performing overlap of genomic regions and doing Monte Carlo-type statistics.

Below I have made a description of how I do it, unfortunately Im not fully confident that this is correct, so I'll appreciate any thought on this.

E.g. I have an experimental dataset (A) of 10 bp coordinates, this dataset constitutes approx. 5,000 entries all across the genome.

Then I have another experimental dataset (B) (ChIP-seq) of ~1,000 bp coordinates, and ~50,000 entries all across the genome.

If I perform overlap/intersection with BEDTools I get my overlap. E.g. 2000 entries from A.

But then I also want to find overlaps in the vicinity of the ChIP-seq peaks, so I extend the size of these peaks e.g. by 1,000 bp on each side, then there are still 50,000 entries but the amount of the genome that is searched becomes larger, and some entries may also overlap now.

So I do the intersection again of A and B, and count entries in A only once. This gives me e.g. 3,000 entries from A.

So for the simulations, I use random intervals that look like dataset B. E.g. I pick 50,000 1,000 bp coordinates randomly, and intersect with A, and do this 1,000 times. Then I get e.g. an average of 500 entries from A.

For overlaps in the vicinity I calculate the total size of dataset B and generate random intervals of the same length and total size in bp as dataset B (size-matched sampling).

I hope you can follow this way of thinking.

So the question basically is, is this correct? And how far can I extend my intervals before the overlap becomes artificial? The largest sizes I'm overlapping are ~15% of the genome in dataset B, and this gives me almost all entries from A. This is far higher than in 1,000 simulations.

Any thoughts are appreciated, e.g. is this better to turn it around and make entries in A larger?

bedtools genomic overlap statistics • 2.7k views
ADD COMMENT
1
Entering edit mode
10.6 years ago

If I understand correctly you want to know whether the A intervals are spatially related to the B intervals, right?

Instead of extending the A intervals, I would assign to each A interval the closest B interval and use this distance to compare the real data and with the randomizations.

I think these ideas have been implemented in these packages:

ADD COMMENT

Login before adding your answer.

Traffic: 1513 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6