the first awk $1>1
filters out the intervals that do not overlap with any other intervals, so if I want to keep them, I can just get rid of this filter
The awk step filters out an element which overlaps itself - in that, of course, every element overlaps itself by 100%, which will always be higher than or equal to any specified threshold. If you want to keep these elements, then you can remove the awk statement.
Not filtering might be problematic where you have elements A and B that overlap, but they do not meet the mutual threshold. Consider the following intervals:
chrN 50 1000
chrN 60 10000
While the first element overlaps the second by ~99%, the second element overlaps the first by ~9%. If you do not filter, then you still get both elements back, despite them not meeting the threshold you set. They overlap, so even if you are to merge them afterwards with bedops --merge or similar, then you end up violating your overlap threshold.
Instead, I would suggest you do the first approach, filtering any self-overlapping elements. Then do a union of this result with any elements that overlap themselves and only themselves ("exclusive" intervals). The unioned set should have disjoint (non-overlapping) elements.
To explain further:
$ bedmap --count --echo-map-range --fraction-both 0.9 --delim '\t' intervals.bed \
| awk '$1>1' - \
| cut -f2- - \
| sort-bed - \
| uniq - \
> mutuallyOverlappingIntervals.bed
$ bedops --merge intervals.bed > mergedIntervals.bed
$ bedmap --echo-map --exact --skip-unmapped intervals.bed mergedIntervals.bed > exclusiveIntervals.bed
$ bedops --everything exclusiveIntervals.bed mutuallyOverlappingIntervals.bed > finalAnswer.bed
I suspect this would work better at adding self-overlapping intervals with mutually-overlapping intervals, since this avoids the possibility of overlaps between elements in these two subsets.
In other words, the intervals in the final answer should be disjoint - counting bases in these elements should not result in double-counts - and should also respect the mutual overlap threshold, in the case where elements had overlapped.
For reciprocal overlap (90% w.r.t both the intervals being considered for merging), should't the fraction option be --fraction-both instead of fraction-either?
I think you are correct. If the lengths of elements A and B are sufficiently different, then element A's overlap of element B may be of a much higher or lower fraction of A's length than B's length. Please use --fraction-both. Sorry, I'll edit my answer.
Alex, this is great!
Thanks a ton.
Just two things that I want to clarify:
the first awk
$1>1
filters out the intervals that do not overlap with any other intervals, so if I want to keep them, I can just get rid of this filter(...right? That's how I understood this. For what I am trying to do I will want to keep them ((I'm trying to split the chromosome in regions such that highly overlapping regions aren't broken)) - I will use the bedops complement for this once I gather my intervals)
For reciprocal overlap (90% w.r.t both the intervals being considered for merging), should't the fraction option be
--fraction-both
instead offraction-either
?Again, thanks a ton for your timely help. It is MUCH appreciated!
You may also consider using Homer. It has an options from one file to multiple. Bedops is one of the good ones but it created duplicates for my case.