It looks like a pretty straightforward scripting problem. You could efficiently solve overlaps at the end with bedops --partition
, which is not available in other toolkits, as far as I know.
Here's a script you can use to subsample from a shuffled list of intervals:
#!/usr/bin/env python
import sys
import random
import argparse
def main():
parser = argparse.ArgumentParser(description='Subsample within intervals in BED file')
parser.add_argument("-n", "--n", type=int, help='sample count')
args = parser.parse_args()
c = 0
for line in sys.stdin:
elems = line.strip().split('\t')
start = int(elems[1])
stop = int(elems[2])
length = stop - start
sample_start_offset = random.randint(0, length)
sample_length = random.randint(1, length - sample_start_offset - 1)
elems[1] = str(start + sample_start_offset)
elems[2] = str(start + sample_start_offset + sample_length)
sys.stdout.write("%s\n" % ('\t'.join(elems)))
c += 1
if c == args.n:
break
if __name__ == "__main__":
main()
Here's how it could be used to sample 123 random subintervals:
$ subsample.py --n=123 < <(shuf intervals.bed) | sort-bed - | bedops --partition - | shuf -n 123 - | sort-bed - > answer.bed
How this works:
The command <(shuf intervals.bed)
is a bash process substitution. You can use this to feed a shuffled list of intervals into subsample.py
(the above-listed Python script). You can use shuf
or any tool to shuffle lines of a file.
If your original intervals file is whole-genome scale and won't fit into the memory allotment for shuf
(i.e. you get out-of-memory errors using shuf
), I have a tool called sample that almost always gets around this limitation.
The Python script subsample.py
parses each randomly-selected interval for its start and stop position, using this to generate a subinterval with a randomly selected start position and length, chosen such that the subinterval will fall entirely within the parent interval.
The output of this script is BED-formatted, but it is unsorted. We pipe this to sort-bed
to sort it, so that downstream tools like bedops
can work with it.
It is possible that subintervals overlap. So the sorted BED is piped to bedops --partition
to make all elements disjoint.
If we split any subintervals, we'll end up with more subintervals than we originally asked for. So we pipe disjoint elements to shuf -n
to grab N random subintervals of interest, and pipe that again to sort-bed
to get sorted BED output. (We could use head -n
, but on reflection, I think that could introduce bias in the sample.)
It looks a bit involved, but I suspect this will run a lot faster, especially on large inputs and sample sizes.
Thanks for providing another solution. I am going to look into it and see if there are other problems coming up.