Pick random position from specific genomic regions
3
2
Entering edit mode
10.6 years ago

Hi,

I've a bed file containing several region of interest and I want to pick randomly N positions in these regions. I know that "bedtools random" can pick random positions into the genome, but I don't find any tool to perform that on specifi region described in a bed file.

Thanks

random position bed • 7.5k views
ADD COMMENT
1
Entering edit mode

How does your bed file differ from the genome file that is required for bedtools random?

ADD REPLY
0
Entering edit mode

bedtools random takes chromosizes in input. <chromName><TAB><chromSize>

ADD REPLY
0
Entering edit mode

Is it required that the tool be pre-made or are you happy with a small R or python script? Also, when you say "pick N positions", do you mean from each of the regions or from all of the regions combined (I assume the latter, but this is ambiguous).

ADD REPLY
0
Entering edit mode

From all the regions combined.

I just wonder if such a tool exists. I can indeed wrote a small script using bedtools random and check if the positions are in the regions in the bed file.

ADD REPLY
4
Entering edit mode
10.6 years ago

If you want to sample N elements without replacement from your set of input BED elements, replace N with your value:

$ shuf input.bed | head -N | sort-bed - > answer.bed

Or you can use my sample program on Github: https://github.com/alexpreynolds/sample

$ sample -k N -p input.bed > answer.bed

This also offers sampling with replacement and some other features that might be useful, depending on the experiment.

If you want to sample N elements without replacement from a population of individual bases made up from your set of input BED elements, use the bedops --chop 1 command to split your input BED file into individual bases, then do the usual shuf or sample operation:

$ bedops --chop 1 input.bed > perBaseInput.bed
$ sample -k N -p perBaseInput.bed > perBaseAnswer.bed

You could pipe the per-base output from bedops into shuf directly, but if you want to draw multiple samples, it can be a waste of time to regenerate the per-base data each time. Saving to a separate file makes for easy re-use (re-sampling).

ADD COMMENT
0
Entering edit mode

I think Nicolas Rosewick is looking for single positions from within the regions, not just random regions.

ADD REPLY
0
Entering edit mode

I have added code for dealing with the second case via the command-line.

ADD REPLY
0
Entering edit mode

Thanks Alex. That's perfect. I adapted a little bit your code in a bash script:

To call : ./getNRandPosFromBed.sh input.bed 10 output.bed

#!/bin/bash

#
# getNRandPosFromBed.sh
# get N random position from regions specified in a bed file

# params
# $1 = in.bed
# $2 = N : number of positions to output
# $3 = output.bed

TFILE="/tmp/$(basename $0).$$.tmp"

awk '\
    BEGIN { \
        OFS = "\t"; \
    } \
    { \
        chr = $1; \
        start = $2; \
        stop = $3; \
        if (NF > 3) { \
            remainder = $4; \
            for (rIdx = 5; rIdx <= NF; rIdx++) { \
                remainder = remainder"\t"$rIdx; \
            } \
        } \
        for (sIdx = start; sIdx < stop; sIdx++) { \
            if (NF > 3) { \
                print chr, sIdx, (sIdx + 1), remainder; \
            } \
            else { \
                print chr, sIdx, (sIdx + 1); \
            } \
        } \
    }' $1 > $TFILE

shuf $TFILE | head -n $2 | awk '{print $1"\t"$2}' | sort -k1,1V -k2,2n > $3

rm $TFILE
ADD REPLY
1
Entering edit mode

Thanks for the feedback. A couple thoughts:

  1. If you plan on taking more than one sample from the split regions file, consider saving it outside the /tmp directory to save time from having to recreate it each time. Compress it, if it takes up a lot of space.
  2. I have done some performance testing on a few "typical" Linux and OS X hosts, and GNU sort can be about 2-3 times as slow as BEDOPS sort-bed at sorting BED files. For large files or when doing lots of sorting, this wasted time can add up.

If your sample is large, consider using sort-bed directly, or if you are doing a non-typical sort, add --parallel=n to the GNU sort call if you are sorting on a multiprocessor machine.

Even with --parallel, BEDOPS sort-bed usually runs faster than GNU sort</code>. But if you have to use it, use--parallel` where you can. (Disclaimer: I am one of the authors of BEDOPS, so take my claims with the necessary skeptical grains of salt and do testing on your end.)

ADD REPLY
0
Entering edit mode

In fact, sorting is not essential for me. I just want to pick random positions to perform a statistical test. But thanks for the information about GNU sort and sort-bed from bedops.

ADD REPLY
5
Entering edit mode
10.6 years ago

If no one posts a premade tool, then the following Rscript will work from the command line (usage script.R file.bed N, where N is the desired number of regions). Note that positions won't occur more than once (this can be trivially changed).

#!/usr/bin/env Rscript
suppressPackageStartupMessages(require(GenomicRanges))
suppressPackageStartupMessages(require(rtracklayer))

args <- commandArgs(T)
if(length(args) != 2) {
    print(args)
    stop("Not enough arguments. Usage: script.R foo.bed N, where N is an integer")
}

b <- import.bed(args[1])
l <- sum(width(b))
positions <- sample(c(1:l), args[2])
new_b <- GRanges(seqnames=as.character(rep(seqnames(b), width(b))),
    ranges=IRanges(start=unlist(mapply(seq, from=start(b), to=end(b))), width=1))
export.bed(new_b[positions], "output.bed")

This will work OK for bed files that aren't terribly large (such that new_b doesn't get enormous).

ADD COMMENT
0
Entering edit mode

Thanks Devon. I'll try it now and give you a feedback.

ADD REPLY
0
Entering edit mode

The problem is that my bed file is quite big ( cpg island for hg19 :p )

ADD REPLY
0
Entering edit mode

That should still be OK unless you're pretty memory limited. The limit with this method is that the new_b object can get quite large and if you only a couple gigs of memory then that'll cause problems (I actually don't know how much space a GRanges object occupies by row). There are alternative implementations of the penultimate line if the memory usage becomes an issue.

ADD REPLY
0
Entering edit mode
10.5 years ago
Ryan Dale 5.0k

Obligatory BEDTools solution: First make a file containing features of the size you'd like to use, and then shuffle them only within the regions of interest. Specifically:

  1. Make a file with N features of the size you want, here, 50 features of 1 bp each. The chromosome, start, stop don't matter since you'll be shuffling them anyway:

    for i in {1..50}; do echo -e "chr1\t1\t2" >> a.bed; done
    
  2. Then shuffle them. Using -incl will only shuffle them within myregions.bed

    bedtools shuffle -i a.bed -incl myregions.bed -g hg19
    

Note the genomes file is only used here by shuffle in case myregions.bed has regions outside the chromosome boundaries.

ADD COMMENT
0
Entering edit mode

Dear all,

I want a slight modification to what Nicolas Rosewick asked in the beginning, so asked a question on existing thread instead of new one.

I have a region of my interest "targets.bed". Now I want to find equal number and same length (6-8nt) of random sequences from the same 3'utr. Can you suggest on how to get this on what i have been trying.

shuffleBed -chrom -excl targets.bed -i utr3p -g assembly.size

This provides me the remaining part of 3putr which do not overlap targets.bed. But it has two limitations

  1. Size of random region is not in the range of targets.bed
  2. Number of random region generated is not equal to count of targets.bed but equal to number of utr3p.

I also tried to input shuffleBED output to randomBED, but was useful. Any hep will be highly appreciated.

Cheers
Chirag

ADD REPLY
0
Entering edit mode

1) and 2) are expected given the command line you provided.

So instead of -excl targets.bed, try -incl targets.bed.

ADD REPLY

Login before adding your answer.

Traffic: 934 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6