BEDOPS bedmap
is an excellent tool to solve this problem. You can use the bedmap --count
operation to count the number of elements that overlap your bins.
First, you need to prepare your data file for use with BEDOPS tools. Secondly, you need to prepare a set of bins with awk
. Then you can use bedmap
. I'll show you how you can do these steps.
Let's set up your data file — let's call it myRawData.txt
. You need to strip the first line (those column headers are unnecessary) and then sort that result with BEDOPS sort-bed
:
$ tail -n +2 myRawData.txt \
| sort-bed - \
> mySortedData.bed
You need to sort the data, but this only needs to be done once. This gets your data ready for use with BEDOPS tools, which take advantage of the ordering in sorted data to operate fast and with a low memory profile, compared with alternative tools.
Now let's set up your bins. Let's say you are working with the human hg19
assembly. You'll need the bounds of each chromosome in hg19
. To do this, retrieve the hg19
version of the chromInfo
table from the UCSC Genome Browser:
- Visit the UCSC Table Browser. With the
All Tables
group selected, for example, select the hg19
database and the chromInfo
table. Output all fields to a text file. (This step can also be performed with Kent-tools' hgsql
commands, if this needs automating.)
- Edit this text file (e.g. run
awk
on it to put in a start coordinate of 0
) and run this through sort-bed
to turn it into a sorted BED file.
Here's a ready-to-use example for hg19
that you can download, which I have made for demonstration purposes. Here's what the first few lines look like:
chr1 0 249250621
chr10 0 135534747
chr11 0 135006516
chr11_gl000202_random 0 40103
chr12 0 133851895
...
Now that we have the bounds for our bins, let's make the bins! We'll save that chromosome list to a file called chrList.bed
and use its region data to define bins:
$ awk ' \
BEGIN \
{ \
binSize = 1000000; \
binIdx = 0; \
} \
{ \
chr = $1; \
start = $2; \
stop = $3; \
for (binStart = start; binStart < (stop - binSize); binStart += binSize) { \
print chr"\t"binStart"\t"(binStart + binSize)"\tbin-"binIdx; \
binIdx++; \
} \
}' chrList.bed \
> myBins.bed
Now that we have our two sorted inputs: mySortedData.bed
and myBins.bed
, we're ready to use bedmap --count
to answer your question:
$ bedmap --echo --count --fraction-map 0.51 myBins.bed mySortedData.bed > myAnswer.bed
This command maps elements in mySortedData.bed
over binned regions in myBins.bed
, counts any overlapping elements that meets the overlap criteria (described below), and prints the bins and their respective counts to the file myAnswer.bed
.
This result file myAnswer.bed
is a sorted BED file and can be consumed by other BEDOPS or command-line tools in a larger analysis pipeline.
Note that we used the --fraction-map
operator with a parameter of 0.51
. This means that a "map" element in mySortedData.bed
must overlap a region in myBins.bed
by 51% or more of the mapped element's length. This deals with the edge case where an element in mySortedData.bed
could straddle two bins.
We set this overlap criteria, because the default overlap criteria in bedmap
is one or more bases of overlaps between an element and a binned region. This would double-count elements which overlap two bins, which you probably do not want.
Using --fraction-map 0.51
forces an element to fall into one or the other bin, depending on which has greater relative overlap.
If you do want to double-count, then just remove the --fraction-map 0.51
setting.
Another option is to pre-process your raw data input; you might instead process it into single-base elements, taken from the first base, e.g.:
$ tail -n +2 myRawData.txt \
| sort-bed - \
| awk '{ print $1"\t"$2"\t"($2+1); }' - \
> mySortedFirstBasesData.bed
Because the bins are disjoint, you can then use bedmap
with its default overlap setting, and each element will associate with one and only one bin, where there is overlap between the element's first base and that bin's genomic range.
However you choose to handle overlap cases, you will want to understand how your elements are being counted, as this can often lead to noise or errors that complicate results generated further downstream in your analysis.
Look into the modulo (%) operator. A corner case you'll have to handle is intervals that overlap two of your bins.