EDIT - I suggest using closest-features
in an answer at the bottom of this page. I would use that approach instead of this one. It will run fast and allow more relaxed assumptions about input.
Running under the assumption that your input intervals are entirely disjoint — none overlap each other — you can use BEDOPS tools to solve this easily.
First, sort your intervals with BEDOPS sort-bed:
$ sort-bed < intervals.unsorted.bed > intervals.bed
As a first pass, use bedmap directly with a symmetric one-base padding (via --range):
$ bedmap --echo --echo-map --range 1 --delim '\t' intervals.bed > answer.bed
Take a look at answer.bed
. Because we use a symmetric padding, this result will contain multiple intervals, if one interval either follows before or after another by one base.
Per your question's guidelines, we only want mapped intervals that follow afterwards. One extra step is needed to filter out elements in answer.bed
where an interval contains itself and another interval that comes before it:
$ bedmap --echo --echo-map --range 1 --delim '\t' intervals.bed \
| awk '$3 < $8' - \
> filtered.answer.bed
The addition of the awk test filters out results where the stop position of a reference (unpadded) interval is greater than or equal to the start position of any mapped (padded) interval.
Depending on what "follow" means for reverse-stranded elements, there is an easy way to deal with that case. Feel free to follow up with what that means to you.
Also, if your input intervals are not guaranteed to be disjoint, follow up with that qualification and I'll see if I can modify my answer to help in that case.
I seriously love these detailed answers you give. Sometimes I want to put your name under the tag when I run into a problem that falls into the bedtools / bedops category just so make sure you take a look at it.
Thank you for the detailed answer.
Running under the assumption that your input intervals are entirely disjoint - none overlap each other
There are overlapping intervals.
Depending on what "follow" means for reverse-stranded elements
Sorry, that was not clear. It means (think of reads) <-interval2- <-interval1-. Or in bed format:
I hope it is clearer now.
To deal with the stranded case, you would split intervals into two files, one for each strand:
For each strand, do separate tests.
For the forward strand elements, filter as shown in my original answer:
In the reverse-stranded case, change the test:
If you want the results in one file:
In the case where you have non-disjoint elements, you can filter them out and run the bedmap operation on the disjoint elements:
Working on disjoint elements alone reduces the number of (pairwise) tests on overlapping elements, which can require many comparisons, and can save some time.
Next, generate the set of non-disjoint or overlapping elements:
Run a forwards-sweeping pairwise comparison between all the overlapping elements, such as with the test that piet describes below. In other words, start at element
i
and compare it with elementsi+1
,i+2
, etc.Since the intervals are sorted on the start position (and chromosome), you can optimize this by only doing pairwise tests on an interval pair, up until the point at which the stop position of the first interval is less than the start position of the second interval. Once that condition is met, you can go to the next interval pair and start sweeping again.
For the stranded case, you can split by strand via awk and then do things similarly for disjoint elements. For overlapping, reverse-stranded elements, you could sweep backwards through the file one line at a time, testing interval pairs and applying the opposite position condition to optimize pairwise tests. In other words, you can stop testing a pair when start position of the first interval is greater than the stop position of the second interval.