Question

How To Generate Random Size Matched Background Regions

0

Entering edit mode

11.5 years ago

epigene ▴ 590

Guys,

Many papers do this kind of enrichment of a particular factor (TF or histone peaks) at their regions of interest. They typically get a randomly drawn, size matched control regions to estimate the background distribution. I only know how to draw a group of fixed-length random background regions using bedtools.

However, the regions of interest are typically of variable length. In this case, how do you get a random group of background regions with the same size distribution?

Thanks!

genome • 4.4k views

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 11.5 years ago by epigene ▴ 590

0

Entering edit mode

You could randomize the start coordinate of each entry of your BED file and then shift the end coordinate by the known lenght. If you are familiar with running scripts we can write your a few line long program that would do just that.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 11.5 years ago by Istvan Albert 102k

0

Entering edit mode

thanks. how do you determine which entry has what length randomly? same thing? just get a random length from a pool of lengths?

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 11.5 years ago by epigene ▴ 590

0

Entering edit mode

Perhaps build bins of observed size ranges, making a table of relative observed frequencies across all the bins. The first bin might contain the number of regions that you observe that are from 1 to 1000 bases long. The second bin would contain the number of regions from 1001 to 2000 bases long, and so on. Divide all counts by the total number of regions to get relative frequencies, which will all sum to 1. Smaller-width bins will reflect observed data more finely.

You can then sample uniformly from 0 to 1, inclusive. Apply the sample to the inverse of the cumulative frequency table that makes up your bin frequencies. This sample points to a particular bin containing a size range.

Given the range in this bin, as a simplification, you could take a uniform random sample across the range to get a specific size value. If you know more about how the observed regions make up the selected bin, you could sample with a different distribution and parameters.

Sample across that bin to make a random background fragment. Sample over all bins again to make a population of fragments.

If your regions have sizes that depend upon some other variable (say, proximity to some feature, like a TF or TSS, which might define a bin of relative start positions), you might expand your observed bin definition to a joint relative probability table made up of two or more variables. When you sample, you retrieve a bin representing ranges for two or more variables, like size, start position relative to some feature, etc. Sample across that bin to make a random background fragment. Sample over all bins again to make a population of fragments.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.6 years ago by Alex Reynolds 36k

Ram · Answer 1 · 2014-09-18

Here is a simply script I wrote for this purpose, using bedtools random and a input.bed as background.

# ===============================================================
# Script to generate a random size-matched background regions
# Author: Xianjun
# Date: Jun 3, 2014
# Usage:
# toGenerateRandomRegions.sh input.bed > output.random.bed
# or
# cat input.bed | toGenerateRandomRegions.sh -
# ===============================================================

bedfile=$1
# other possible options can be the genome (e.g. hg19, mm9 etc.) or an inclusion or exclusion regions (e.g. exons)

# download hg.genome
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg19.chromInfo"  > hg19.genome

cat $bedfile | while read chr start end rest
do
    let l=$end-$start;
    bedtools random -g hg19.genome -n 1 -l $l
done