Question

Subsampling Genomic Space

0

Entering edit mode

12.4 years ago

Alex Reynolds 36k

Let's say I have a BED file that defines my genomic space as follows:

chrN    a1   b1
chrN    a2   b2
...

I would like to uniformly sample (with or without replacement) segments of length i from the genomic space that this BED file defines. (Assume that i is equal to or smaller in length than the smallest segment in the BED file.)

My first naive approach is to expand the BED input to:

chrN    a1        a1 + i
chrN    a1 + 1    a1 + i + 1
chrN    a1 + 2    a1 + i + 2
...
chrN    b1 - i    b1
chrN    a2        a2 + i
...
chrN    b2 - i    b2
...

Then I would sample from this new set by mapping each line to an ID (say, a line number) and uniformly sampling line numbers as identifiers. Once I have the identifier, I get back the subrange element.

Blowing up the BED input like this and setting up the map is a lot of work for large inputs. Is there a smarter/faster/less-memory-intensive way to approach this task?

bed r statistics • 1.9k views

ADD COMMENT • link updated 12.4 years ago by matted 7.8k • written 12.4 years ago by Alex Reynolds 36k

score 3 · Answer 1 · 2012-08-17

Consider a hierarchical approach: when making each sample, you'll first select the BED region you want, then select the interval within that region that you will output.

When drawing among possible BED regions, you need to consider the number of possible segments that could come from each region. If a region is length L, the number of possible segments of length i from that is L-i+1. Now you can mentally annotate the BED file by the number of segments that could come from each region. For example, with i=20:

chr1    10    50 (21 possible segments)
chr1    100    120 (1 possible segments)
chr1    200    230 (11 possible segments)

Now you can see that there are 33 possible segments. Pick an integer uniformly between 1 and 33 (inclusive), and use it select the correct segment. The easiest way to do this is to consider the choices as a cumulative distribution function. For example:

chr1    10    50 (21 possible segments, gets values 1 to 21)
chr1    100    120 (1 possible segments, gets values 22 to 22)
chr1    200    230 (11 possible segments, gets values 23 to 33)

To sample without replacement, you can adjust the number of possible segments as you go.

This is the same direction you went in for your question, but it will calculate the quantities explicitly, rather than expanding the BED file to calculate them indirectly. That is, your expanded BED file for my example would have 33 lines.