BEDOPS would work well for this, along with standard UNIX utilities.
Let's start with an extents file that defines the bounds of our chromosomes, say for hg19:
We'll grab chromosome 10 and reformat the chromosome name to match your chromosome naming scheme:
$ grep chr10 hg19.extents.bed | awk '{print "10\t"$2"\t"$3}' > chr10.extents.bed
Next, we'll make a BED file of windows or bins from this chromosome, every 500 kb:
$ awk ' \
BEGIN \
{ \
binSize = 500000; \
binIdx = 0; \
} \
{ \
chr = $1; \
start = $2; \
stop = $3; \
for (binStart = start; binStart < (stop - binSize); binStart += binSize) { \
print chr"\t"binStart"\t"(binStart + binSize)"\tbin-"binIdx; \
binIdx++; \
} \
}' chr10.extents.bed \
> chr10.bins.bed
Next, we use BEDOPS bedops --partition
to split your input regions (regions.bed
) that span across bins. We can chain together a series of BEDOPS operations that calculate the number of overlapping bases across now-correctly partitioned regions:
$ bedops --partition chr10.bins.bed regions.bed \
| bedmap --echo --indicator --delim '\t' - regions.bed \
| grep "1$" - \
| cut -f1-3 - \
| bedmap --echo --bases --delim '\t' chr10.bins.bed - \
| awk -v binSize=500000 '{print $0"\t"($5/$binSize)}' - \
> final_answer.bed
The file final_answer.bed
contains all the bins in BED format, with six columns: bin chromosome, bin start, bin stop, bin index, the number of bases from regions.bed
that overlap that bin, and the percentage of overlap.
The chain of commands shown above might look a little overwhelming, but you can use UNIX tee
in between steps so you can examine the output being passed to the next command. This is a great way to learn the command line and particularly standard input and output management, which I think you'll find is a critical part of doing bioinformatics.
BEDOPS is really useful for this task - you can pass in standard input and route standard output between BEDOPS tools like bedops
and bedmap
, and UNIX utilities like grep
, cut
and awk
. After you use BEDOPS for a while, it almost becomes an expressive language for dealing with set operations of this kind.
Finally, the --partition
option in bedops
is a powerful piece of kit, which lets you quickly cut input regions that span bins, among other uses.
have a look at bedtools.
I actually looked at it, but there are several of them (tools). Can you give me an overview/idea for how to use bedtools for this? Thanks!
How do you need to handle the case where an input range straddles bounds between two adjacent 500k intervals?
Thanks for the reply! I want to simply calculate total number of bases that are within 500000 interval and calculate the percentage. Anything beyond that 500000 should be added to the subsequent interval.