It's an interesting quirk. Let's step back and draw some cartoons to see if that might help. I think the following case may help suggest a solution, or the start to one.
It seems like a "worst-case" scenario for you seems to be when merging elements A
and B
that overlap by your threshold, and B
and C
overlap by your threshold, but A
and C
are adjoining (and disjoint) or otherwise do not overlap by your threshold criterion.
Let's sketch this out. Here, chrZ
is some weird abstract chromosome, and we have elements A
, B
, and C
that span this chromosome:
|-----|-----|-----|-----| chrZ
[-- A --)
[-- B --)
[-- C --)
A simple merge on a 0.5 or 50% overlap threshold will correctly give:
|-----|-----|-----|-----| chrZ
[-- AB --)
[-- ABC --)
[-- BC --)
In this worst-case scenario, the middle element from the result set (ABC
) is calculated because B
overlaps both A
and C
by 50%, and so a merge gives ABC
.
It sounds like what you really want is something that looks like this:
|-----|-----|-----|-----| chrZ
[-- AB --)
[-- BC --)
In other words, you want to remove any merge results from the set {AB
, ABC
, BC
} which contain a nested element from the set of {A
, B
, C
}. (Or, the larger set of merges for more than three elements, where there are full or partial pairwise 50% overlaps, more generally.)
A nested element is one for which its start and stop coordinates fall within the bounds of its parent interval.
A merge element result that contains a nested element is one generated from a use case where one of the elements contributing to that merge result does not meet your threshold over all pairwise comparisons. I think this property can be exploited.
In the above worst-case scenario, merge result ABC
contains the nested element B
, where both the start and stop coordinates of B
fall within the start and stop coordinates of ABC
.
You could use awk
to strip out such results from a bedmap
call, by using awk
to test the start and end coordinates of each input coordinate from in.bed
against the merged range result that comes out of --echo-map-range
:
$ bedmap --fraction-map 0.5 --echo-map-range --echo --delim '\t' in.bed | awk -vFS="\t" -vOFS="\t" '{ if (($2 == $5) || ($3 == $6)) { print $1, $2, $3 } }' | sort -u | sort-bed - > out.bed
Note that this also pipes results from bedmap
and awk
to sort -u
, in order take care of duplicate --echo-map-range
results (generated from other, more relaxed overlap situations) and there is a final pipe to sort-bed
, in order to ensure generating a sorted BED file that is ready for further set operations.
I think this works, but I may be wrong by missing a use case. Please feel free to give it a go and post back sample data, if you believe it doesn't meet your needs.
Thanks for the suggestions, but because I wrote the post rapidly, I didn't explain myself correctly as I gave incorrect details. I have used bedtools intersect with -f and -F 0.5 and -e, so if only one region is covered enough. The thing is, [A B], [B C] and [C D] accomplish this requisite, but not A with C.
In short, I don't wish to merge all regions just because B and C also pass the filter. As I stated before, I thought about doing merge sequentially (A with B, then [AB] doesn't overlap 0.5 with C. Now I check B with C so I merge then and finally [BC] with D so I would have [AB] and [BCD]). I will update the initial post.