How do I randomly sample lines from a bed file? Specifically, I want to make a smaller bed file of ChIP-seq reads from a larger one, while maintaining the relative proportion of lines from each chromosome.
Thanks,
Bede
How do I randomly sample lines from a bed file? Specifically, I want to make a smaller bed file of ChIP-seq reads from a larger one, while maintaining the relative proportion of lines from each chromosome.
Thanks,
Bede
To preserve the relative proportion of elements per chromosome, I think you will need to do a bit more work.
Overall procedure:
Using BEDOPS sort-bed
, first sort your BED data:
$ sort-bed elements.unsorted.bed > elements.bed
Split it with bedextract and count lines. The bedextract
tool very efficiently splits BED files into per-chromosome files, and we put per-chromosome line counts into a text file called counts.txt
:
$ for chr in `bedextract --list-chr elements.bed`; do \
bedextract $chr elements.bed > elements.$chr.bed; \
count=`wc -l elements.$chr.bed | cut -d' ' -f1`; \
echo -e "$chr\t$count"; done \
> counts.txt
Let's assume that we ultimately want to sample a total of 1000 elements from the input BED file. You can change this value to whatever you need.
We next calculate a table of per-chromosome proportions and how many elements you will pull from each chromosome:
$ sampleSize=1000
$ sum=`cut -f2 counts.txt | perl -nle '$sum += $_ } END { print $sum' -`
$ awk -v sum=$sum -v sampleSize=$sampleSize '{print $0"\t"($2/sum)"\t"int(sampleSize*$2/sum)}' counts.txt > proportions.txt
Finally, do a random sample from each of the per-chromosome BED files, choosing a sample size for each chromosome that reflects its proportion in the larger input:
$ awk '{ \
perChrName=$1; \
perChrN=$4; \
cmd="shuf elements."perChrName".bed | head -n "perChrN; \
system(cmd); \
}' proportions.txt \
| sort-bed - \
> sample.bed
The output is a sort-bed
-sorted BED file called sample.bed
. This file contains roughly 1000 rows (this will be close, but not exact, depending on the cutoff values from the int()
call in the awk
statement where we calculate proportions). The number of samples from each chromosome will therefore roughly reflect the distribution of elements in the larger input BED file.
cat input.bed |shuf | head -n [number of lines you want]
This wont maintain the chromosome order (same number of lines from each chromosome, which can be or not shuffled).
This would require, sort -u
with cut
on the first column to fetch unique elements and then running a combination of grep
and head
for each element to generate the output file, which can be randomized using shuf
, there might be a better solution as the file can be huge > 50 Million lines.
Based on the comments, it seems that a simple solution is:
$ cat input.bed | shuf > shufinput.bed
$ cat shufinput.bed | head -n > randomsubset.bed
This solution should maintain the sorted order of the lines in input.bed, while also generating a random subset of data that should equally represent each chromosome in randomsubset.bed.
However, trying this on a test set of data the ratios weren't maintained.
Based on how your question is specifically worded, I still don't think this is what you want, but if this is what you want, Irsan is correct that this is a useless use of cat
. Just operate on the files directly and pipe the results from one command to the next:
$ shuf input.bed | head -n 1234 - > sampleFromEntireSet.bed
As you note, this will not preserve the relative proportions of elements, just like permuting a deck of sorted playing cards will destroy any ordering of suits and ranks. If you want to preserve ratios (i.e., get a randomly sampled subset of ranks for a given suit) then you'll need to permute subsets (i.e., first split the deck of cards by suit, before sampling for a subset of ranks within a suit).
You could also just grab every nth line (which is less random, but will maintain your proportions).
#grabs every 100th line, starting at line 0
cat input.bed | sed -n '0~100p' >output.bed
bedtools sample -i input.bed -n 1000
will randomly sample 1000 rows from input.bed
I wanted to have an one liner for this little tricky problem. So, I consulted Stack Overflow resulting in :
Try using awk
#!/bin/bash
numRow=3
awk 'n[$1]<'$numRow' {a[$1]=a[$1]"\n"$0; n[$1]++} END { asort(a,b); for (x in b) print b[x] }' file
Output:
chr1 3003204 3003454 * 37 +
chr1 3003235 3003485 * 37 +
chr1 3003148 3003152 * 37 -
chr11 71863609 71863647 * 37 +
chr11 71864025 71864275 * 37 +
chr11 71864058 71864308 * 37 -
chrY 90828920 90829170 * 23 -
chrY 90829096 90829346 * 23 +
chrY 90828924 90829174 * 23 -
If randomized order of lines is required inside the groups, then , just a little use of shuf
#!/bin/bash
numRow=3
shuf file | awk 'n[$1]<'$numRow' {a[$1]=a[$1]"\n"$0; n[$1]++} END { asort(a,b); for (x in b) print b[x] }'
Very fast and efficient solution
Time taken for a file of 10 Milllion reads
real 0m2.923s
user 0m2.859s
sys 0m0.063s
Credits : jkshah
Cheers
I wrote a tool called sample
(Github site) that uses reservoir sampling to pull an ordered or randomly-shuffled uniformly random sample from an input text file delimited by newline characters. It can sample with or without replacement.
When sampling without replacement, this application offers a few advantages over common shuf
-based approaches:
shuf
in informal tests on OS X and Linux hosts.sample
stores the start positions of sampled lines (roughly 8 bytes per line).Using less memory also gives sample
two advantages:
shuf: memory exhausted
errors for whole-genome-scale input files.The sample
tool stores a pool of line positions and makes two passes through the input file. One pass generates the sample of random positions, while the second pass uses those positions to print the sample to standard output. To minimize the expense of this second pass, we use mmap
routines to gain random access to data in the (regular) input file on both passes.
The benefit that mmap
provided was significant. For comparison purposes, we also add a --cstdio
option to test the performance of the use of standard C I/O routines (fseek()
, etc.); predictably, this performed worse than the mmap
-based approach in all tests, but timing results were about identical withgshuf
on OS X and still an average 1.5x improvement over shuf
under Linux.
The sample
tool can be used to sample from any text file delimited by single newline characters (BED, SAM, VCF, etc.). Also, using the --lines-per-offset
option allows sampling and shuffling repeated multiples of newline character-delimited lines, useful for sampling from (for example) FASTQ files, where records are split across four lines.
By adding the --preserve-order
option, the output sample preserves the input order. For example, when sampling from an input BED file that has been sorted by BEDOPS sort-bed
- which applies a lexicographical sort on chromosome names and a numerical sort on start and stop coordinates - the sample will also have the same ordering applied, with a relatively small O(k logk) penalty for a sample of size k.
By omitting the sample size parameter, the sample
tool will shuffle the entire file. This tool can be used to shuffle files that shuf
has trouble with; however, in this use case it currently operates slower than shuf
(where shuf
can still be used). We recommend use of shuf
for shuffling an entire file, or specifying the sample size (up to the line count, if known ahead of time), when possible.
One downside at this time is that sample
does not process a standard input stream; the input must be a regular file.
I think the following should work:
$ git clone https://github.com/alexpreynolds/sample.git ... $ cd sample $ make ... $ ./sample -k 10000 /path/to/snps.vcf > /path/to/snps_subset.vcf
I've added some features recently, but the current code should work.
Note that I changed to default behavior (as described above) from sample-in-order to sample-and-shuffle. If you need to preserve order, add the --preserve-order
operand.
This is perfect, thanks so much for the quick response!
Perhaps this is beyond the capabilities of reservoir-shuffle, but is it possible to subsample the sites for all individuals in the VCF file?
eg. Lets say I have a vcf file that contains 100 individuals and 1 million SNP sites. I would like a new file with all 100 individuals and 10000 SNP sites that have been randomly sampled from the larger file.
Adapted this code snippet from heng Li's answer towards random sampling of fasta files using bioawk
, see http://seqanswers.com/forums/showthread.php?t=16845
bioawk -c bed -v k=INT '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}' in.bed
where INT
in k=INT
is an integer indicating the number of random lines to extract
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
This is a more thorough solution that seemingly ensures I equally represent each chromosome. However, it seems to me that if shuf it completely random, than data from each chromosome will be distributed randomly through the resulting file, thus taking head -somenumber will give me an approximately representative sampling?
Being completely random and maintaining your proportions are incompatible goals. After all, a single random sampling is just as likely to return entirely reads from chromosome 4 as it is to be distributed across all chromosomes.
Point taken, I will try to be more precise, and less colloquial, when seeking help in the future.
But this is not a single random sampling. If the OP wants to subset 10,000 lines then he is performing 10,000 random samplings out of the original population without replacement. Assuming a good distribution of reads across chromosomes in the original population (BED file) the probability of sampling 10,000 times and only collecting reads from chromosome 4 is vanishingly small.
You're not wrong, but if you really want to maintain the proportion with high fidelity, then you have to randomly sample an appropriate number from each chromosome. Whether random sampling from a shuffled file is 'good enough' will depend on the specific application.
I attempted to use this particular approach but for some reason only pulled intervals from chr1-3 and X, Y when doing this, while the bed file contains all of the chromosomes. Any reason you can see as to why this might occur? Thanks!
Not sure. Is everything sorted?
Actually have figured that out and applied it successfully except for one thing, I input a list of about 800K peaks and am looking for that 1000 thats outlined in your code however I get about 15K, here is how I ran it:
Im wondering if the following in the counts file might have a bearing on this and is there a way to remove the Un and Random at this stage of analysis and only focus on main mouse chromosomes:
You could use
grep
to remove non-nuclear elements:This is GNU
grep
so make changes as needed for OS X etc.Thanks! One final thing on this, wondering how to subsample intervals of similar lengths lets say an avg of 350bp?
Just spitballing:
You could use a normal distribution model in R to generate a pool of interval lengths with a mean length of 350 bases, with a SD of 50:
Or you could use whatever distribution that best models your random intervals.
Generate a pool of start positions for a given chromosome (say,
chr1
fromhg38
):Subtracting 350 from the max ensures that a random element does not extend past the edge of
chr1
(bounds). Subtracting another 1 ensures the interval is zero-indexed, half-open.For each start position, you add the random length, converted to an integer value:
Then make a vector of chromosomes:
And bundle the sample into a data frame:
From there you could export this to a text file or GenomicRanges object (maybe?).
This would need a little bit of work to package into a function for each chromosome and its assembly's bounds, but hopefully this gives an idea of what you could perhaps do.
I guess I can also take the aggregate bed files representing all chromosomes before downsampling and simply subsample intervals of 350bp in length and thereafter use your approach at the start of this thread to subsample those evenly across all chromosomes? Is there an easy way of taking just a BED file and pulling out intervals that average 350bp in length?
The
bedmap
tool has options to report the sizes of reference and/or mapped elements, which you could use withawk
to report elements of length 350:Or elements within some CI around 350:
Etc.
Perfect, thanks! Second option is what I am looking for in this instance.
One other quick thing, is it possible in this scenario to match the subsampled ~350bp peaks in their read counts to the target peak set? Im basically working with about 10K target peaks at some varying read depth and would like to take the downsampled 10K peaks across all chromosomes that I have at 350bp and match their read depth to the target? Thanks in advance for any advice!
Rejection sampling might work. You would generate an element of desired average length, measure reads, and either keep or discard, depending of whether that measurement falls within the bounds of the read statistics for your target peak set. Keep sampling until you have as many elements as you need to do your experiment, which match your criteria. https://en.m.wikipedia.org/wiki/Rejection_sampling
Do you know of a package or tool that assists in rejection sampling or would you script it out yourself? Thanks!
I would just script it. It's just the equivalent of a
do...while
loop. taking samples that meet your criteria and ignoring the rest, until you hit some predefined limit:You basically need to write a
sampleMeetsCriteria
function. Here, that means an element having a certain number of reads that are in line with what is in your target peaks, perhaps.