converting bed file to heatmap data

0

Entering edit mode

10.5 years ago

arnstrm ★ 1.9k

Hi,

I was wondering if there are any utilities that can convert the bed file to heatmap data, which I can use it for plotting.

For example, if I have a bed file:

10      3843    4059
10      6964    7383
10      17670   17855
10      21897   22762
10      71880   72558
10      73454   74055
10      82310   82654
10      82796   85155
10      85306   86100
10      160810  161076

I want to convert it to number of bases per interval, like:

10    1       500000  10.8223
10    500001  1000000 22.0333
10    1000001 1500000 44.4082
10    1500001 2000000 49.7916
10    2000001 2500000 53.1821
10    2500001 3000000 39.134
10    3000001 3500000 33.7936
10    3500001 4000000 12.4694
10    4000001 4500000 57.7986
10    4500001 5000000 66.1606

where the 4th filed is percent of bases covered per 500000 (or other values).

Thanks for any suggestions!

bed heatmap genome-coverage • 5.4k views

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by arnstrm ★ 1.9k

1

Entering edit mode

have a look at bedtools.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Ming Tommy Tang ★ 4.6k

0

Entering edit mode

I actually looked at it, but there are several of them (tools). Can you give me an overview/idea for how to use bedtools for this? Thanks!

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by arnstrm ★ 1.9k

0

Entering edit mode

How do you need to handle the case where an input range straddles bounds between two adjacent 500k intervals?

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks for the reply! I want to simply calculate total number of bases that are within 500000 interval and calculate the percentage. Anything beyond that 500000 should be added to the subsequent interval.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by arnstrm ★ 1.9k

8

Entering edit mode

10.5 years ago

Aaronquinlan 12k

Make 500Kb "windows" tiling the genome. As an example - it looks like you already have your files ready.

bedtools makewindows -w 500000 -g genomes/human.hg19.genome > hg19.500kb.bed

Create a random BED file of 1,000,000 intervals each 100bp in size. As an example - it looks like you already have your files ready.

bedtools random -n 1000000 -l 100 -g genomes/human.hg19.genome > random.bed

Calculate the coverage of the random BED records on the 500Kb windows. The first 3 columns are the window itself, the 4th is the count of BED records that overlapped the window, the fifth is the total number of base pairs in the window with coverage, the 6th is the size of the window, and the 7th is what you are after: the fraction of bases in the window with coverage.

bedtools coverage -a random.bed -b hg19.500kb.bed \
  | sort -k1,1 -k2,2n \
  | awk '$1=="chr10"' \
  | head

chr10    0    500000    162    15947    500000    0.0318940
chr10    500000    1000000    143    14136    500000    0.0282720
chr10    1000000    1500000    147    14487    500000    0.0289740
chr10    1500000    2000000    157    15473    500000    0.0309460
chr10    2000000    2500000    164    16252    500000    0.0325040
chr10    2500000    3000000    160    15501    500000    0.0310020
chr10    3000000    3500000    164    16322    500000    0.0326440
chr10    3500000    4000000    165    16178    500000    0.0323560
chr10    4000000    4500000    150    14931    500000    0.0298620
chr10    4500000    5000000    169    16542    500000    0.0330840

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Aaronquinlan 12k

0

Entering edit mode

This is even better! Thanks a lot!! Love the BEDTools! Thanks.

ADD REPLY • link 10.5 years ago by arnstrm ★ 1.9k

4

Entering edit mode

10.5 years ago

Alex Reynolds 36k

BEDOPS would work well for this, along with standard UNIX utilities.

Let's start with an extents file that defines the bounds of our chromosomes, say for hg19:

	chr1 0 249250621
	chr10 0 135534747
	chr11 0 135006516
	chr12 0 133851895
	chr13 0 115169878
	chr14 0 107349540
	chr15 0 102531392
	chr16 0 90354753
	chr17 0 81195210
	chr18 0 78077248
	chr19 0 59128983
	chr2 0 243199373
	chr20 0 63025520
	chr21 0 48129895
	chr22 0 51304566
	chr3 0 198022430
	chr4 0 191154276
	chr5 0 180915260
	chr6 0 171115067
	chr7 0 159138663
	chr8 0 146364022
	chr9 0 141213431
	chrM 0 16571
	chrX 0 155270560
	chrY 0 59373566

view raw hg19.extents.bed hosted with ❤ by GitHub

We'll grab chromosome 10 and reformat the chromosome name to match your chromosome naming scheme:

$ grep chr10 hg19.extents.bed | awk '{print "10\t"$2"\t"$3}' > chr10.extents.bed

Next, we'll make a BED file of windows or bins from this chromosome, every 500 kb:

$ awk ' \
    BEGIN \
    { \
        binSize = 500000; \
        binIdx = 0; \
    } \
    { \
        chr = $1; \
        start = $2; \
        stop = $3; \
        for (binStart = start; binStart < (stop - binSize); binStart += binSize) { \
            print chr"\t"binStart"\t"(binStart + binSize)"\tbin-"binIdx; \
            binIdx++; \
        } \
    }' chr10.extents.bed \
    > chr10.bins.bed

Next, we use BEDOPS bedops --partition to split your input regions (regions.bed) that span across bins. We can chain together a series of BEDOPS operations that calculate the number of overlapping bases across now-correctly partitioned regions:

$ bedops --partition chr10.bins.bed regions.bed \
    | bedmap --echo --indicator --delim '\t' - regions.bed \
    | grep "1$" - \
    | cut -f1-3 - \
    | bedmap --echo --bases --delim '\t' chr10.bins.bed - \
    | awk -v binSize=500000 '{print $0"\t"($5/$binSize)}' - \
    > final_answer.bed

The file final_answer.bed contains all the bins in BED format, with six columns: bin chromosome, bin start, bin stop, bin index, the number of bases from regions.bed that overlap that bin, and the percentage of overlap.

The chain of commands shown above might look a little overwhelming, but you can use UNIX tee in between steps so you can examine the output being passed to the next command. This is a great way to learn the command line and particularly standard input and output management, which I think you'll find is a critical part of doing bioinformatics.

BEDOPS is really useful for this task - you can pass in standard input and route standard output between BEDOPS tools like bedops and bedmap, and UNIX utilities like grep, cut and awk. After you use BEDOPS for a while, it almost becomes an expressive language for dealing with set operations of this kind.

Finally, the --partition option in bedops is a powerful piece of kit, which lets you quickly cut input regions that span bins, among other uses.

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Alex Reynolds 36k

0

Entering edit mode

Just this morning I was thinking how less I use of awk, and you've just shown me the next step to take as well as the next tool to get conversant with (bedops). Thank you!

ADD REPLY • link 10.5 years ago by Ram 45k

0

Entering edit mode

Wow! thanks a lot!! I think you just saved a week of my time (I'm not a good programmer!)/ I greatly appreciate your detailed answer. Thanks!

ADD REPLY • link 10.5 years ago by arnstrm ★ 1.9k

0

Entering edit mode

Hi Alex,

Actually I noticed this approach and decided that the way bins are split up here are more appropriate towards the heat maps I want to plot of distribution, did it on the whole mm10 scale and weirdly got: awk: (FILENAME=- FNR=1) fatal: division by zero attempted

After running the bedops --partition

ADD REPLY • link 7.9 years ago by rbronste ▴ 420

0

Entering edit mode

One other thing I wanted to mention is that the hg19.extents.bed file shown above actually is in the format of the hg19.bed file when pulled with fetchchromsizes.

ADD REPLY • link 7.9 years ago by rbronste ▴ 420

2

Entering edit mode

10.5 years ago

Ram 45k

This level of customization in requirements can best be addressed by a custom Python script. Logic to build such a script:

Set start = 1
Set end = block_size
Set sum = fringe (if fringe exists and is >0)
Set fringe = 0
For each BED line
- read BED line into array
- if array[1] > stop, set perc[stop] = sum/block_size, start = end+1 and end = end+block_size
- else if array[2] > stop, set sum += end - (array[1] - 1) and fringe = array[2] - (end - 1), start = end+1 and end = end+block_size
- else if array[2] < stop, sum += array[2] - (array[1] - 1)

In the end , the perc dictionary will have the percentages you need.

ADD COMMENT • link 3.3 years ago by Ram 45k

0

Entering edit mode

This of course assumed that each block has some data in the BED file. You might wanna address scenarios where an entire block is skipped. You can do that by adding a condition to check if array[1] > stop && array[1] < stop + block_size

I'd do it but I don't wanna end up writing the program for you :-)

ADD REPLY • link 3.3 years ago by Ram 45k

Login before adding your answer.