You can use BEDOPS and UCSC Kent tools to do this efficiently.
Get your assembly of interest via fetchChromSizes
, strip out non-nuclear chromosomes, and turn the rest into a BED file, split into 500nt bins.
Then use bedmap
to map it against your example BED file, modified to a BED5 file.
Based on your question, I'll assume you want the --sum
operator, but bedmap
has several statistical operations. Run bedmap --help
or the online docs for more detail.
For example, for assembly hg38
:
$ fetchChromSizes hg38 | grep -v '_*_' | awk -v FS="\t" -v OFS="\t" '{ print $1, "0", $2 }' | sort-bed - | bedops --chop 500 - | bedmap --echo --sum --delim '\t' - <( awk -v FS="\t" -v OFS="\t" '{ print $1, $2, $3, ".", $4 }' signal.bed ) > answer.bed
The file answer.bed
will be in your desired format, but with a 0-based index.
If you want a 1-based index, you can add one to each start coordinate:
$ fetchChromSizes hg38 | grep -v '_*_' | awk -v FS="\t" -v OFS="\t" '{ print $1, "0", $2 }' | sort-bed - | bedops --chop 500 - | bedmap --echo --sum --delim '\t' - <( awk -v FS="\t" -v OFS="\t" '{ print $1, $2, $3, ".", $4 }' signal.bed ) | bedops --everything --range 1:0 - > answer.1based.bed
What is the difference between one- and zero-based indexing, you might ask. There is a previous biostars question on that subject, located here: https://bit.ly/2X1HgF8
@Alex Reynolds your code seems appropriate however this created bins with a small overlap:
I corrected my bins using
--stagger 501
with the--chop 500
command.The additional comments from ATpoint and Alex Reynolds helped me clarify my understanding. I see now that I do not need to use the
--stagger 501
option.I have focused on using the
BEDOPS
solution so that I can become more familiar with it and because I can see a need to use other functionality from it for other work. In the end, thisbedmap
command works for my needsUsing
--fraction-map
overcame an issue where when a coordinate spanned two bins in my reference, it was printed out twice with the same sum values making the output BED file report what look like duplicates.I believe the
Bedtools
version would be a workable solution as well.Thank you for providing such a thorough response to my question.
BED is 0-based, there is no overlap in the intervals you show.
These zero-based intervals do not overlap, even though an interval's start position number matches the previous interval's stop position number.
Take a look at the bit.ly link at the end of my answer, which goes into the difference between zero- and one-based indexing. It's an important detail for doing set operations, because the choice of indexing decides what is called an overlap.
It is convention to use zero-based indexing for BED intervals, though not required. However, I suggest sticking with zero-based indexing to minimize errors when processing with other BEDOPS or other tools, or inputting to a visualization tool like the UCSC Genome Browser, etc.