Question

Splitting even length BED records by an even number of base-pairs produces a single base-pair window

0

Entering edit mode

10.0 years ago

James Ashmore ★ 3.5k

I ran into this issue today while I was plotting TSS occupancy heatmaps. If you take the coordinates of a single transcription start site, extend them by 1000bp and cut this region into 500bp windows then you end up with 5 windows, not 4 as I would have assumed.

Take the TSS of a gene:

# test.bed
chr1    4857693    4857694    Tcea1    1    +

Increase the size by 1000bp upstream and downstream:

bedtools slop -i test.bed -g mm10.chromsizes -b 1000 > test.plusminus1000bp.bed

Check output of bedtools slop:

# test.plusminus1000bp.bed
chr1    4856693    4858694    Tcea1    1    +

Split feature into 500bp windows:

bedtools makewindows -b test.plusminus1000bp.bed -w 500 > test.plusminus1000bp.window500bp.bed

Check output of bedtools makewindows:

# test.plusminus1000bp.window500bp.bed
chr1    4856693    4857193
chr1    4857193    4857693
chr1    4857693    4858193
chr1    4858193    4858693
chr1    4858693    4858694

The last feature in the file is a single base-pair window. I assume this happens because of the 0-based coordinate system, but I'm not sure it's obvious that such a window is produced. I wonder if such output could change the results of an analysis if one of the assumptions is that all windows are the same length? Would it be better to remove this single base-pair window?

bed bedtools • 2.7k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 10.0 years ago by James Ashmore ★ 3.5k

1

Entering edit mode

10.0 years ago

dariober 15k

I think your reasoning is correct. You could also fix it by using -l 1000 -r 999 instead of -b 1000:

bedtools slop -i test.bed -g mm10.chromsizes -l 1000 -r 999 > test.plusminus1000bp.bed

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 10.0 years ago by dariober 15k

Ram · Accepted Answer · 2015-09-03

This is due to the half-open nature of how BED elements are indexed. You can fix this with an asymmetric range operation with BEDOPS bedops --range and then generate windows with bedops --chop.

For instance:

$ echo -e 'chr1\t4857693\t4857694\tTcea1\t1\t+' | bedops --range -1000:999 --everything - | bedops --chop 500 -
chr1    4856693    4857193
chr1    4857193    4857693
chr1    4857693    4858193
chr1    4858193    4858693

BEDOPS natively works with standard input/output streams, which makes this work expressive and fast on large-scale datasets.

Note that --chop operations calculate new elements, which necessarily discards all but the first three columns.