Use awk
to split your BED data into two files:
- The first file contains all 0/1 overlaps (where the eighth column is either 0 or 1), with the seventh column moved to the fifth (score) column
- The second file contains 1-only overlaps (where the eighth column is 1), with the seventh column moved to the fifth (score) column
Let's call these two files allOverlaps.bed
and oneOverlaps.bed
. Make sure they are sorted per BEDOPS sort-bed
:
$ awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$7}' overlaps.bed | sort-bed - > allOverlaps.bed
$ awk '{if ($8==1) {print $1"\t"$2"\t"$3"\t"$4"\t"$7}}' overlaps.bed | sort-bed - > oneOverlaps.bed
Set up a third, tab-delimited, sorted BED file called windows.bed
, with your windows of interest:
chr1 100 1000
chr1 1500 3000
...
Here's a one-liner where you apply two BEDOPS bedmap
operations in piped, serial fashion:
$ bedmap --echo --sum --delim '\t' windows.bed oneOverlaps.bed | bedmap --echo --sum --delim '/' - allOverlaps.bed > answer.bed
The file answer.bed
should follow your specified output format, I think.
To understand how this works, you are applying the --sum
operand on the score column of any mapped overlaps (as a result, note that this allows multiple rows with 1
in the last column). By doing this map operation on each category of overlaps, and then by piping results from the first set to the next, we get the final answer.
Should be speedy, I'd think, though awk
is tough to beat in straight-up parsing. The sort-bed
tool is much faster than alternatives and the bedmap
application works very efficiently with BED files (and since your data are already in that format, more or less, why not use the tools to match). We've got some performance enhancements coming soon to make that even faster.
So let me try to understand this from what I see in your example. Let's say for window of chr 100 1000, you sum up all the numbers in the seventh column that corresponds to this window (hence giving the final value of 11), and then for all those rows, you are looking for the rows that has value of 1 in the final column. Is that right?
Also, I assume this is tab-delimited file?
yes exactly. My attempt was to have a list of all the windows and use while loop in perl. And awk to count the counts. But it is very slow.
Can you have more than one rows with last column value of 1? If so, do you add those values together (so, if the first line in your example had last column value of 1, would this be 7/11?
yes you can have more than 1. The example you are giving is correct.