Question

converting multiple bed files to equivalent intervals

0

Entering edit mode

8.4 years ago

chrisclarkson100 ▴ 160

I downloaded a number of bigWig files from the ENCODE project and converted them to bed files.

I did this as follows:

bigWigToWig file.bigWig file.wig
wig2bed -x <file.wig> file.bed

However the file intervals can differ:

ENCFF001.bed
chr1    2999998 2999999 id-1    1.000000
chr1    2999999 3000000 id-2    1.000000
ENCFF002.bed
chr1    3001400 3001500 id-1    0.140000
chr1    3001600 3001700 id-2    0.140000

My first question is why do they start from different points in the genome? And why do genome-wide bed files always start at ~3000000- why not 1?

And I then downloaded a separate dataset from a source other than ENCODE.

HET.bed
chr1    3049360 3053345 Region_1        0       0
chr1    3136664 3138809 Region_2        0       0

What I would like to do is align the bed files' intervals so that I can analyse them parallel to one another.

The interval distance between rows is arbitrary it can be 100 or 1000. All I really need is to be able to consistently manipulate the files so that the data looks something like this:

                             ENCF001           ENCF002             HET
chr1    3000000 3000500        1                 2                  2          
chr1    3001000 3001500        1                 1                  3
#column values are examples and not from real data

So can anyone help me convert a series of bed files to consistent intervals??

alignment • 2.6k views

ADD COMMENT • link updated 8.4 years ago by Alex Reynolds 36k • written 8.4 years ago by chrisclarkson100 ▴ 160

score 0 · Answer 1 · 2017-02-06

Hey, try to use CNVkit tool - there is an option --split in target argument:

cnvkit.py target my_baits.bed --split -o my_targets.bed

cnvkit.py target -h  


 --split               Split large tiled intervals into smaller, consecutive
                        targets.
  -a AVG_SIZE, --avg-size AVG_SIZE
                        Average size of split target bins (results are
                        approximate). [Default: 266.666666667]

INSTALL:

pip install cnvkit

It is very fast.

score 0 · Answer 2 · 2017-02-06

0

Entering edit mode

8.4 years ago

Alex Reynolds 36k

First, get the bounds of the chromosomes of your genomic build and write them to a sorted BED file.

For example, for hg38:

$ mysql  --user=genome --host=genome-mysql.cse.ucsc.edu -N -A -D hg38 -e 'select chrom,0,size from chromInfo' | sort-bed - > hg38-bounds.bed

Then split up the bounds with BEDOPS bedops --chop N.

For example, to chop up the chromosome intervals into 5k windows:

$ bedops --chop 5000 hg38-bounds.bed > hg38-5k-windows.bed

Then you can do set operations with bedops and bedmap etc. with HET.bed and hg38-5k-windows.bed, or other files. For example:

$ bedmap --echo --echo-map-id-uniq hg38-5k-windows.bed HET.bed > hg38-5k-windows-mapped-to-HET-ids.bed

Etc.

To answer your second question, BED describes intervals as half-open and zero-indexed. This means that coordinates for an interval will start from a zero-relative position and end at (length - 1). This convention allows faster position and length calculations on intervals.

ADD COMMENT • link 8.4 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks so much for the answer. However I cannot seem to re-assign the occupancy values from the original files to the new ones. So far I have:

mysql  --user=genome --host=genome-mysql.cse.ucsc.edu -N -A -D mm9 -e 'select chrom,0,size from chromInfo' | sort-bed - > mm9.bed
bedops --chop 5000 mm9.bed > mm9_5k.bed
bedmap --echo --echo-map-id-uniq mm9_5k.bed HET.bed > HET-5k.bed

And yet the newly formatted HET-5k.bed doesn't have any occupancy values

chr1    0       5000|
chr1    5000    10000|

Unlike the original HET.bed

chr1    3049360 3053345 Region_1        0       0
chr1    3136664 3138809 Region_2        0       0
chr1    3786627 3791240 Region_4        0       0

Thanks again

ADD REPLY • link 8.4 years ago by chrisclarkson100 ▴ 160

0

Entering edit mode

In your HET.bed snippet, there are no elements that overlap with chr1:0-5000 or chr1:5000-10000. That seems correct to me.

If you just want to see the "meat" of the overlaps and not bother where there are no overlaps between windows and HET regions, add --skip-unmapped to bedmap, e.g.:

$ bedmap --echo --echo-map-id-uniq --skip-unmapped mm9_5k.bed HET.bed > HET-5k.bed

ADD REPLY • link 8.4 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks again but I'm confused. Now HET-5k.bed looks like this:

chr1    3045000 3050000|Region_1
chr1    3050000 3055000|Region_1
chr1    3135000 3140000|Region_2

but no occupancy values... the original looked like this:

chr1    3049360 3053345 Region_1        0       0
chr1    3136664 3138809 Region_2        0       0
chr1    3786627 3791240 Region_4        0       0
#occupancy vans included

ADD REPLY • link 8.4 years ago by chrisclarkson100 ▴ 160