Question

Pick Random regions inside genes

0

Entering edit mode

3.2 years ago

tirichl ▴ 20

Hi,

To create a background for an experiment, i want to pick N genomic peaks with length L, but only inside genes. Doing this for the human genome is easily done with bedtools random, but i require only peaks which are located inside genes. I have a bed file with all human genes, the length and the number of peaks, but i struggle to solve this problem. Maybe there is an existing tool or script that could solve this task?

Thank you for your help!

ChIP-seq bash R Linux Python • 1.4k views

ADD COMMENT • link updated 3.1 years ago by Alex Reynolds 36k • written 3.2 years ago by tirichl ▴ 20

0

Entering edit mode

UPDATE:

I have found a sufficient solution:

RANDOM () {

INCR=$(echo "$i + 1" | bc)
echo -e "$CHR\t$i\t$INCR\t$PROT" >> FILE


   }

    while IFS=" " read -r line; do

echo "Starting $PROT."

START=$(echo $line | awk '{print $2}')
export START
END=$(echo $line | awk '{print $3}')
export END
CHR=$(echo $line | awk '{print $1}')
export CHR
PROT=$(echo $line | awk '{print $4}')
export PROT



for (( i=$START; i<$END; i=i+100)); do RANDOM "$i" & done

echo "Finished $PROT."



    done < test_genes

This bash script takes a 1bp "peak" every 100 bp inside all genes and prints them to a new file. One can then take ranom positions from this File and enlarge the peaks with bedtools slop.

Thank you for your time!

ADD REPLY • link 3.2 years ago by tirichl ▴ 20

0

Entering edit mode

A couple thoughts:

Your sampling approach will favor longer genes. Samples will contain more 1bp "pseudo"-peaks from longer genes than shorter genes, which can introduce unwanted bias. This may or may not present a problem for your analysis, but you should be aware of this bias when calling the result "random".

Further, RANDOM is a reserved variable in bash, which you may or may not be able to override. It is recommended to use lowercase names for variables in bash, so that you don't end up using reserved POSIX variable names and get unintended (garbage) results.

I wrote a bash script here that will draw a uniform sample with replacement of N intervals of length L from the hg38 genome:

There is a short demo of how to use it at the bottom of the page, which samples ten 10kb intervals and writes them as sorted output, ready for further set operations:

$ ./sample_uniform_hg38_intervals.sh 10 10000 | sort-bed -
chr1    37174603    37184603
chr10   14542026    14552026
chr11   32419542    32429542
chr12   64434681    64444681
chr21   8019910 8029910
chr5    174494714   174504714
chr6    34761384    34771384
chr7    3579514 3589514
chrX    26732001    26742001
chrY    46893746    46903746

The chromosome names and sizes can be replaced if you are using a different genome assembly.

This script uses the rejection sampling approach I mentioned in my answer below. This approach takes into account the length of chromosomes over hg38 when drawing random integers, by rejecting any integers that are longer than the genome, as well as rejecting those which create intervals that extend over the end bounds of a chromosome.

Once you have samples, it is a simple matter to filter them, by way of piping to bedmap:

$ N=12345
$ L=5000
$ ./sample_uniform_hg38_intervals.sh ${N} ${L} | sort-bed - | bedmap --echo-map --skip-unmapped --fraction-map 1 genes.bed - > answer.bed

Repeat until enough samples are collected, as needed.

ADD REPLY • link 3.1 years ago by Alex Reynolds 36k

score 1 · Answer 1 · 2021-09-30

Rejection sampling may be useful here. Generate M random intervals (M >> N) of length L and put them into a sorted file called samples.bed. Use bedmap --skip-unmapped --echo-map --fraction-map 1 genes.bed samples.bed to get all samples contained entirely within genes.bed. Repeat sampling until you have N intervals.

score 1 · Answer 2 · 2021-09-30

1

Entering edit mode

3.2 years ago

ATpoint 85k

You can create the random regions from the entire genome and then use bedtools intersect to keep only those that overlap fully (check -f and -F options) with the genes. For that you need a BED file with gene coordinates. You may need to repeat that until you have enough regions.

ADD COMMENT • link 3.2 years ago by ATpoint 85k

score 0 · Answer 3 · 2021-09-30

I was also interested in this problem and I tried to find a solution in R:

We can take advantage of the regioneR package and its createRandomRegions function: it lets you sample regions and you control region size, variation in size, and overlapping output regions or not. The function allows you to specify a genome and also mask any region from which you do not want to sample regions.

The easiest solution would be to input your genes as the background genome from which to sample the regions, but the genome input can only have one region per chromosome. Thus, the final solution is to input the genome and mask all regions which are not genes. For example:

# Create GRanges for the whole genome and the genes in the genome
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(BSgenome.Hsapiens.UCSC.hg38)

# Genes, or any BED of regions you like
txdb.genes <- genes(txdb)
txdb.genes <- reduce(txdb.genes, ignore.strand = T)

# Genome
genome <- GRanges(
  seqnames=seqnames(BSgenome.Hsapiens.UCSC.hg38),
  ranges=IRanges(start=rep(1, length(BSgenome.Hsapiens.UCSC.hg38)), end=seqlengths(BSgenome.Hsapiens.UCSC.hg38)),
  strand="*")

# Mask anything in the genome which are not genes
mask <- setdiff(genome, txdb.genes, ignore.strand = T)

table(countOverlaps(txdb.genes, mask)) # check no genes are in the mask

# Create random regions
library(regioneR)

regions_in_genes <- createRandomRegions(nregions=1000, length.mean=1000, length.sd=0, genome = genome, mask = mask, non.overlapping = T)

table(countOverlaps(regions_in_genes, txdb.genes)) # Check that all your regions overlap with genes
table(countOverlaps(regions_in_genes, txdb.genes, type = "within")) # Check that all your regions are WITHIN genes