Question

Bed file: merging intervals that overlap a certain percentage

1

Entering edit mode

9.0 years ago

shwethacm ▴ 240

Hello,

I have a BED file that has many intervals that are overlapping.

My objective is to merge the intervals to see how much of the chromosome the bed file spans.

BEDtools merge was my natural go-to method, however, there's a catch.

I only want to merge overlaps that reciprocally overlap by say, 90%. BEDtools merge as far as I know, doesn't have this feature. Does anyone know any tool that I can use that can do this?

Thanks a bunch and merry Christmas!

BEDtools Bed • 8.4k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.0 years ago by shwethacm ▴ 240

0

Entering edit mode

Alex, this is great!

Thanks a ton.

Just two things that I want to clarify:

the first awk $1>1 filters out the intervals that do not overlap with any other intervals, so if I want to keep them, I can just get rid of this filter

(...right? That's how I understood this. For what I am trying to do I will want to keep them ((I'm trying to split the chromosome in regions such that highly overlapping regions aren't broken)) - I will use the bedops complement for this once I gather my intervals)
For reciprocal overlap (90% w.r.t both the intervals being considered for merging), should't the fraction option be --fraction-both instead of fraction-either?

Again, thanks a ton for your timely help. It is MUCH appreciated!

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by shwethacm ▴ 240

0

Entering edit mode

You may also consider using Homer. It has an options from one file to multiple. Bedops is one of the good ones but it created duplicates for my case.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by morovatunc ▴ 560

1

Entering edit mode

9.0 years ago

Alex Reynolds 36k

the first awk $1>1 filters out the intervals that do not overlap with any other intervals, so if I want to keep them, I can just get rid of this filter

The awk step filters out an element which overlaps itself - in that, of course, every element overlaps itself by 100%, which will always be higher than or equal to any specified threshold. If you want to keep these elements, then you can remove the awk statement.

Not filtering might be problematic where you have elements A and B that overlap, but they do not meet the mutual threshold. Consider the following intervals:

chrN      50      1000
chrN      60     10000

While the first element overlaps the second by ~99%, the second element overlaps the first by ~9%. If you do not filter, then you still get both elements back, despite them not meeting the threshold you set. They overlap, so even if you are to merge them afterwards with bedops --merge or similar, then you end up violating your overlap threshold.

Instead, I would suggest you do the first approach, filtering any self-overlapping elements. Then do a union of this result with any elements that overlap themselves and only themselves ("exclusive" intervals). The unioned set should have disjoint (non-overlapping) elements.

To explain further:

$ bedmap --count --echo-map-range --fraction-both 0.9 --delim '\t' intervals.bed \
    | awk '$1>1' - \
    | cut -f2- - \
    | sort-bed - \
    | uniq - \
    > mutuallyOverlappingIntervals.bed

$ bedops --merge intervals.bed > mergedIntervals.bed
$ bedmap --echo-map --exact --skip-unmapped intervals.bed mergedIntervals.bed > exclusiveIntervals.bed
$ bedops --everything exclusiveIntervals.bed mutuallyOverlappingIntervals.bed > finalAnswer.bed

I suspect this would work better at adding self-overlapping intervals with mutually-overlapping intervals, since this avoids the possibility of overlaps between elements in these two subsets.

In other words, the intervals in the final answer should be disjoint - counting bases in these elements should not result in double-counts - and should also respect the mutual overlap threshold, in the case where elements had overlapped.

For reciprocal overlap (90% w.r.t both the intervals being considered for merging), should't the fraction option be --fraction-both instead of fraction-either?

I think you are correct. If the lengths of elements A and B are sufficiently different, then element A's overlap of element B may be of a much higher or lower fraction of A's length than B's length. Please use --fraction-both. Sorry, I'll edit my answer.

ADD COMMENT • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by Alex Reynolds 36k

0

Entering edit mode

$ bedmap --echo-map --exact --skip-unmapped intervals.bed mergedIntervals.bed > exclusiveIntervals.bed

Isn't necessary to add uniq - to that step in case there are duplicated regions in your original intervals.bed file?

As in:

bedmap --echo-map --exact --skip-unmapped intervals.bed mergedIntervals.bed | sort-bed - | uniq - > exclusiveIntervals.bed

I used this as an example of intervals.bed to merge and map:

chr1    4394800 4395357
chr1    4416903 4417046
chr1    4426956 4427083
chr1    4497875 4498980
chr1    4507891 4507997
chr1    4544029 4544271
chr1    4598562 4598840
chr1    14357732    14358035
chr1    14357732    14358035

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 8.3 years ago by Fernando Rossello • 0

1

Entering edit mode

bedmap just calculates overlaps based on whatever input reference file you give it. So you could just pipe unique intervals (assuming they are sorted):

$ uniq intervals.bed | bedmap --options - mergedIntervals.bed | ...

In this case, the use of the hyphen in bedmap in between options and the merged interval file denotes that it is consuming standard input from the upstream process (uniq) in place of the reference file.

You could also do this afterwards, but it is slightly more efficient to filter upstream of bedmap.

ADD REPLY • link 8.3 years ago by Alex Reynolds 36k

0

Entering edit mode

Great! Thanks for clarifying.

ADD REPLY • link 8.3 years ago by Fernando Rossello • 0

Ram · Accepted Answer · 2015-12-21

With BEDOPS bedmap:

$ bedmap --count --echo-map-range --fraction-both 0.9 --delim '\t' intervals.bed \
    | awk '$1>1' - \
    | cut -f2- - \
    | sort-bed - \
    | uniq - \
    > answer.bed

If the intervals are unsorted or are of unknown sort state, first sort before mapping:

$ sort-bed intervals.unsorted.bed > intervals.bed

On further thought, I think you may want to filter single-overlap cases; please see discussion further in this thread.

Edit: Changed --fraction-either to --fraction-both to do a correct mutual overlap test.