I have the following problem:
I have 1bp coordinates in the genome which represent vector integration sites. These integration sites (IS) form clusters, meaning they occur in some relatively small areas (1-10kb) in the genome while the vast majority remains untouched.
I have a function which predicts these clusters based on only two arguments:
- maximum distance between two IS to be counted into the same cluster
- minimum number of IS to be counted as cluster
I need to compare two different data sets, which differ in their number of IS (approx. 100,000 vs. 1,000,000). So far I tried to sample the bigger data set to match the number of IS of the smaller one, however, I need to do this multiple times to get a decent representation of the original data. This is where the problem is: How can I predict the clusters for each sampling and then find a "mean" cluster position or let's say get likelihood for a cluster to occur at that position.
The cluster position is a range (i.e. chr1:8492643-8498245) and i get roughly 5000 of such ranges across the genome for each iteration.
I looked into intersect from bedtools but it seems it needs some sort of reference file, which it compares the other genomic ranges to. But I would rather have to compare all samplings between all samplings, like a pairwise comparison
I also tried a R solution, however, when I aggregate using the exact cluster positions I get a size bias towards bigger clusters, as these probably occur more often when sampling a small proportion of the data.
Any hints on packages or tools I could use or did anyone ever have a similar problem and a solution?
Any help is highly appreciated!!
It's not immediately clear what operations you are doing, but the following might help in that it describes a scalable approach to pairwise operations on an arbitrary number of inputs: http://bedops.readthedocs.io/en/latest/content/usage-examples/multiple-inputs.html
Thanks for pointing out this handy little tool! I checked it out and indeed it does help finding intersecting clusters in a pairwise comparison. It even lets you specify a minimum overlap. However, the output only tells you which cluster or genomic range in general interacts with what. It doesn't seem to have to option to generate new coordinated e.g. a new bed file with "common" or "mean" genomic ranges.
You could pipe the output of a
bedmap --echo-map
to abedmap --merge
operation to get a common genomic range for overlapping intervals within a pairwise combination.I don't know what a "mean" genomic range is, however. Maybe you could investigate piping
bedmap --echo-map-size
tocalc
or similar, generate a mean mapped region size, and pipe that toawk
or similar, along with a start position offset from the original reference BED element, which prints out a computed BED element.I'd definitely read up on the
--echo-map-*
options as they might help you generate new elements with the features you want.