Question

How can I bin my bed files into 500bp bins?

0

Entering edit mode

4.5 years ago

Tawny ▴ 180

I have several hundred bed files that will be processed by another colleague. I have been searching for a way to bin these by 500bp bins to reduce their size and have universal bins across sample results. Here is an example bed file:

chr1    33  52  1
chr1    52  53  2
chr1    53  69  3
chr1    369 405 2
chr1    669 632 1

So each chromosome in each sample's file will have the same 500bp bins and I will have a sum of the fourth column in my new bed file. Which I imagine to look like this after processing my example bed file:

chr1    1    500    8
chr1    501    1001    1

What tools should be used to generate my output bed file?

bed • 4.2k views

ADD COMMENT • link updated 4.5 years ago by ATpoint 85k • written 4.5 years ago by Tawny ▴ 180

1

Entering edit mode

4.5 years ago

cpad0112 21k

adding example to @ATPpoint solution:

$ cat file.txt 
chr1    2000

$ bedtools makewindows -g file.txt -w 500 | bedtools map -a - -b test.txt -c 4 -o sum 
chr1    0   500 8
chr1    500 1000    1
chr1    1000    1500    .
chr1    1500    2000    .

ADD COMMENT • link 4.5 years ago by cpad0112 21k

0

Entering edit mode

4.5 years ago

ATpoint 85k

Check bedtools map. This can take a file of bins (e.g. from bedtools makewindows and then map your data onto these bins. It has options to perform an operation on column4, e.g. sum.

ADD COMMENT • link 4.5 years ago by ATpoint 85k

score 3 · Accepted Answer · 2020-05-26

3

Entering edit mode

4.5 years ago

Alex Reynolds 36k

You can use BEDOPS and UCSC Kent tools to do this efficiently.

Get your assembly of interest via fetchChromSizes, strip out non-nuclear chromosomes, and turn the rest into a BED file, split into 500nt bins.

Then use bedmap to map it against your example BED file, modified to a BED5 file.

Based on your question, I'll assume you want the --sum operator, but bedmap has several statistical operations. Run bedmap --help or the online docs for more detail.

For example, for assembly hg38:

$ fetchChromSizes hg38 | grep -v '_*_' | awk -v FS="\t" -v OFS="\t" '{ print $1, "0", $2 }' | sort-bed - | bedops --chop 500 - | bedmap --echo --sum --delim '\t' - <( awk -v FS="\t" -v OFS="\t" '{ print $1, $2, $3, ".", $4 }' signal.bed ) > answer.bed

The file answer.bed will be in your desired format, but with a 0-based index.

If you want a 1-based index, you can add one to each start coordinate:

$ fetchChromSizes hg38 | grep -v '_*_' | awk -v FS="\t" -v OFS="\t" '{ print $1, "0", $2 }' | sort-bed - | bedops --chop 500 - | bedmap --echo --sum --delim '\t' - <( awk -v FS="\t" -v OFS="\t" '{ print $1, $2, $3, ".", $4 }' signal.bed ) | bedops --everything --range 1:0 - > answer.1based.bed

What is the difference between one- and zero-based indexing, you might ask. There is a previous biostars question on that subject, located here: https://bit.ly/2X1HgF8

ADD COMMENT • link 4.5 years ago by Alex Reynolds 36k

0

Entering edit mode

@Alex Reynolds your code seems appropriate however this created bins with a small overlap:

chr2    81993000        81993500
chr2    81993500        81994000
chr2    81994000        81994500

I corrected my bins using --stagger 501 with the --chop 500 command.

ADD REPLY • link 4.5 years ago by Tawny ▴ 180

1

Entering edit mode

The additional comments from ATpoint and Alex Reynolds helped me clarify my understanding. I see now that I do not need to use the --stagger 501 option.

I have focused on using the BEDOPS solution so that I can become more familiar with it and because I can see a need to use other functionality from it for other work. In the end, this bedmap command works for my needs

bedmap --fraction-map 0.5 --ec --echo --skip-unmapped --sum --delim '\t' hg38_chromSizes.bed  Sample_sort.bed \
        > Sample_sort_500bpBins.bed

Using --fraction-map overcame an issue where when a coordinate spanned two bins in my reference, it was printed out twice with the same sum values making the output BED file report what look like duplicates.

I believe the Bedtools version would be a workable solution as well.

Thank you for providing such a thorough response to my question.

ADD REPLY • link 4.5 years ago by Tawny ▴ 180

0

Entering edit mode

BED is 0-based, there is no overlap in the intervals you show.

ADD REPLY • link 4.5 years ago by ATpoint 85k

0

Entering edit mode

These zero-based intervals do not overlap, even though an interval's start position number matches the previous interval's stop position number.

Take a look at the bit.ly link at the end of my answer, which goes into the difference between zero- and one-based indexing. It's an important detail for doing set operations, because the choice of indexing decides what is called an overlap.

It is convention to use zero-based indexing for BED intervals, though not required. However, I suggest sticking with zero-based indexing to minimize errors when processing with other BEDOPS or other tools, or inputting to a visualization tool like the UCSC Genome Browser, etc.

ADD REPLY • link 4.5 years ago by Alex Reynolds 36k