converting multiple bed files to equivalent intervals
2
0
Entering edit mode
7.9 years ago

I downloaded a number of bigWig files from the ENCODE project and converted them to bed files.

I did this as follows:

bigWigToWig file.bigWig file.wig
wig2bed -x <file.wig> file.bed

However the file intervals can differ:

ENCFF001.bed
chr1    2999998 2999999 id-1    1.000000
chr1    2999999 3000000 id-2    1.000000
ENCFF002.bed
chr1    3001400 3001500 id-1    0.140000
chr1    3001600 3001700 id-2    0.140000

My first question is why do they start from different points in the genome? And why do genome-wide bed files always start at ~3000000- why not 1?

And I then downloaded a separate dataset from a source other than ENCODE.

HET.bed
chr1    3049360 3053345 Region_1        0       0
chr1    3136664 3138809 Region_2        0       0

What I would like to do is align the bed files' intervals so that I can analyse them parallel to one another.

The interval distance between rows is arbitrary it can be 100 or 1000. All I really need is to be able to consistently manipulate the files so that the data looks something like this:

                             ENCF001           ENCF002             HET
chr1    3000000 3000500        1                 2                  2          
chr1    3001000 3001500        1                 1                  3
#column values are examples and not from real data

So can anyone help me convert a series of bed files to consistent intervals??

alignment • 2.3k views
ADD COMMENT
0
Entering edit mode
7.9 years ago
Paul ★ 1.5k

Hey, try to use CNVkit tool - there is an option --split in target argument:

cnvkit.py target my_baits.bed --split -o my_targets.bed

cnvkit.py target -h  


 --split               Split large tiled intervals into smaller, consecutive
                        targets.
  -a AVG_SIZE, --avg-size AVG_SIZE
                        Average size of split target bins (results are
                        approximate). [Default: 266.666666667]

INSTALL:

pip install cnvkit

It is very fast.

ADD COMMENT
0
Entering edit mode
7.9 years ago

First, get the bounds of the chromosomes of your genomic build and write them to a sorted BED file.

For example, for hg38:

$ mysql  --user=genome --host=genome-mysql.cse.ucsc.edu -N -A -D hg38 -e 'select chrom,0,size from chromInfo' | sort-bed - > hg38-bounds.bed

Then split up the bounds with BEDOPS bedops --chop N.

For example, to chop up the chromosome intervals into 5k windows:

$ bedops --chop 5000 hg38-bounds.bed > hg38-5k-windows.bed

Then you can do set operations with bedops and bedmap etc. with HET.bed and hg38-5k-windows.bed, or other files. For example:

$ bedmap --echo --echo-map-id-uniq hg38-5k-windows.bed HET.bed > hg38-5k-windows-mapped-to-HET-ids.bed

Etc.

To answer your second question, BED describes intervals as half-open and zero-indexed. This means that coordinates for an interval will start from a zero-relative position and end at (length - 1). This convention allows faster position and length calculations on intervals.

ADD COMMENT
0
Entering edit mode

Thanks so much for the answer. However I cannot seem to re-assign the occupancy values from the original files to the new ones. So far I have:

mysql  --user=genome --host=genome-mysql.cse.ucsc.edu -N -A -D mm9 -e 'select chrom,0,size from chromInfo' | sort-bed - > mm9.bed
bedops --chop 5000 mm9.bed > mm9_5k.bed
bedmap --echo --echo-map-id-uniq mm9_5k.bed HET.bed > HET-5k.bed

And yet the newly formatted HET-5k.bed doesn't have any occupancy values

chr1    0       5000|
chr1    5000    10000|

Unlike the original HET.bed

chr1    3049360 3053345 Region_1        0       0
chr1    3136664 3138809 Region_2        0       0
chr1    3786627 3791240 Region_4        0       0

Thanks again

ADD REPLY
0
Entering edit mode

In your HET.bed snippet, there are no elements that overlap with chr1:0-5000 or chr1:5000-10000. That seems correct to me.

If you just want to see the "meat" of the overlaps and not bother where there are no overlaps between windows and HET regions, add --skip-unmapped to bedmap, e.g.:

$ bedmap --echo --echo-map-id-uniq --skip-unmapped mm9_5k.bed HET.bed > HET-5k.bed
ADD REPLY
0
Entering edit mode

Thanks again but I'm confused. Now HET-5k.bed looks like this:

chr1    3045000 3050000|Region_1
chr1    3050000 3055000|Region_1
chr1    3135000 3140000|Region_2

but no occupancy values... the original looked like this:

chr1    3049360 3053345 Region_1        0       0
chr1    3136664 3138809 Region_2        0       0
chr1    3786627 3791240 Region_4        0       0
#occupancy vans included
ADD REPLY

Login before adding your answer.

Traffic: 1722 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6