One option is to map your regions to disjoint windows over a chromosome, counting the number of regions within each window.
To this end, you can use BEDOPS bedmap
and its --count
operator.
First, decide on chromosome bounds. For instance, for the hg19
build of the human genome, you could use the extents available here in hg19.extents.bed
:
Second, make some 10 kb windows with awk
and store them in a file called hg19.disjointWindows.bed
(or adjust window size, depending on the granularity you want):
$ awk ' \
BEGIN {\
windowSize = 10000; \
} \
{ \
extentChromosome = $1; \
extentStart = $2; \
extentStop = $3; \
for (windowStart = extentStart; windowStart < extentStop; windowStart += windowSize) { \
windowStop = windowStart + windowSize; \
print extentChromosome"\t"windowStart"\t"windowStop; \
} \
}' hg19.extents.bed \
> hg19.disjointWindows.bed
This is guaranteed to be sorted, so we don't need to use any other tools (e.g., sort-bed
) to sort this data. So we can use these windows with BEDOPS tools directly.
Third, we prepare your regions. Let's say your regions are in a BED file called unsortedRegions.bed
, but we don't know its sort order. So we first sort this data with BEDOPS sort-bed
:
$ sort-bed unsortedRegions.bed > regions.bed
Fourth, we count:
$ bedmap --echo-map --count regions.bed hg19.disjointWindows.bed > answer.bed
The file answer.bed
will contain data in the following format:
[ window-1 ] | [ number-of-regions-overlapping-window-1 ]
[ window-2 ] | [ number-of-regions-overlapping-window-2 ]
. . .
[ window-n ] | [ number-of-regions-overlapping-window-n ]
Where each [ window-i ]
is the i
-th element in hg19.disjointWindows.bed
, and [ number-of-regions-... ]
is the count of overlapping regions over that window.
If you want to split this file further by chromosome, use BEDOPS bedextract
. In a bash
shell, this is a very fast one-liner:
$ for chr in `bedextract --list-chr answer.bed`; do bedextract $chr answer.bed > $chr.answer.bed; done
You can then bring answer.bed
or its separated chromosome files into R or gnuplot
, plotting a log-scaled histogram or visualizing it in any number of other ways.
Hi Alex,
Thanks for posting this, wondering about the way you generated the extents file, I need to do essentially the same thing for mm10.
Rob.
Could use Kent utilities:
Yes that works well, weirdly though I get an empty mm10.disjointWindows.bed file after running the awk script.
Yeah, not sure why that is. Maybe do:
and see what you get.
Ah, sorry, I misread. I'm not sure why you get an empty result. I'd break things down and use
head
to see what comes out at various stages.Another option is to use
bedops --chop
to make the 10k windows, instead ofawk
:This option was added after I wrote the initial answer.