I was also interested in this problem and I tried to find a solution in R:
We can take advantage of the regioneR
package and its createRandomRegions
function: it lets you sample regions and you control region size, variation in size, and overlapping output regions or not. The function allows you to specify a genome and also mask any region from which you do not want to sample regions.
The easiest solution would be to input your genes as the background genome from which to sample the regions, but the genome input can only have one region per chromosome. Thus, the final solution is to input the genome and mask all regions which are not genes. For example:
# Create GRanges for the whole genome and the genes in the genome
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(BSgenome.Hsapiens.UCSC.hg38)
# Genes, or any BED of regions you like
txdb.genes <- genes(txdb)
txdb.genes <- reduce(txdb.genes, ignore.strand = T)
# Genome
genome <- GRanges(
seqnames=seqnames(BSgenome.Hsapiens.UCSC.hg38),
ranges=IRanges(start=rep(1, length(BSgenome.Hsapiens.UCSC.hg38)), end=seqlengths(BSgenome.Hsapiens.UCSC.hg38)),
strand="*")
# Mask anything in the genome which are not genes
mask <- setdiff(genome, txdb.genes, ignore.strand = T)
table(countOverlaps(txdb.genes, mask)) # check no genes are in the mask
# Create random regions
library(regioneR)
regions_in_genes <- createRandomRegions(nregions=1000, length.mean=1000, length.sd=0, genome = genome, mask = mask, non.overlapping = T)
table(countOverlaps(regions_in_genes, txdb.genes)) # Check that all your regions overlap with genes
table(countOverlaps(regions_in_genes, txdb.genes, type = "within")) # Check that all your regions are WITHIN genes
UPDATE:
I have found a sufficient solution:
This bash script takes a 1bp "peak" every 100 bp inside all genes and prints them to a new file. One can then take ranom positions from this File and enlarge the peaks with bedtools slop.
Thank you for your time!
A couple thoughts:
Your sampling approach will favor longer genes. Samples will contain more 1bp "pseudo"-peaks from longer genes than shorter genes, which can introduce unwanted bias. This may or may not present a problem for your analysis, but you should be aware of this bias when calling the result "random".
Further,
RANDOM
is a reserved variable inbash
, which you may or may not be able to override. It is recommended to use lowercase names for variables inbash
, so that you don't end up using reserved POSIX variable names and get unintended (garbage) results.I wrote a
bash
script here that will draw a uniform sample with replacement ofN
intervals of lengthL
from thehg38
genome:There is a short demo of how to use it at the bottom of the page, which samples ten 10kb intervals and writes them as sorted output, ready for further set operations:
The chromosome names and sizes can be replaced if you are using a different genome assembly.
This script uses the rejection sampling approach I mentioned in my answer below. This approach takes into account the length of chromosomes over
hg38
when drawing random integers, by rejecting any integers that are longer than the genome, as well as rejecting those which create intervals that extend over the end bounds of a chromosome.Once you have samples, it is a simple matter to filter them, by way of piping to
bedmap
:Repeat until enough samples are collected, as needed.