Question

how to merge overlapping intervals in a bedgraph file

0

Entering edit mode

9.5 years ago

Fidel ★ 2.0k

I have a simple problem that I think could be resolved with current software; however I can't find a solution.

I am using a program that produces a bedgraph with overlapping intervals for example:

chr1 0 100 2
chr1 50 150 1
chr1 100 200 3

I would like to obtain a bedgraph that computes the maximum (or any other operation) for the overlapping regions. For example, using the maximum the output should look like:

chr1 0 50 2
chr1 50 100 2
chr1 100 150 3
chr 1 150 200 3

I tried using unionbg from bedtools which is expected to do something very similar to what I want but it requires more than one file and, if the same file is given twice, the output is not what I am looking for. merge, from bedtools is not useful because it will merge all regions into a single one.

I already wrote a simple script to solve the problem but only handles the case of two overlaps as in the previous example and I am under the impression that a faster and more powerful solution should be available.

sequence bedtools • 9.7k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 9.5 years ago by Fidel ★ 2.0k

2

Entering edit mode

the UCSC kent tools include a tool bedRemoveOverlap. It doesn't summarize, it just cleans up, which in most cases what I need, as the overlaps will be <0.01% of all features.

ADD REPLY • link 5.3 years ago by max ▴ 60

0

Entering edit mode

bedRemoveOverlap is a life saver!!!

ADD REPLY • link 5.2 years ago by rls_08 ▴ 40

0

Entering edit mode

I am trying to do same thing for analyinz chipseq bed formatted data. If you can transform your bed graph to bed maybe you can get them that merge peaks function. though i am not %100 sure.

http://homer.salk.edu/homer/ngs/mergePeaks.html

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by morovatunc ▴ 560

Ram · Answer 1 · 2015-10-30

If it's okay to use BEDOPS, you could use a one-liner of bedops --partition, bedmap and awk:

$ bedops --partition regions.bed |\
  bedmap --echo --echo-map-id-uniq --delim '\t' - regions.bed |\
  awk '{
          n = split($4, a, ";");
          max = a[1];
          for(i = 2; i <= n; i++)
          {
              if (a[i] > max)
              {
                  max = a[i];
              }
          }
          print $1"\t"$2"\t"$3"\t"max;
      }' - > answer.bed

Make sure the input regions are tab-delimited and sorted with sort-bed.

This works by partitioning the input into disjoint regions however many overlaps there are (bedops), mapping the disjoint regions to the original input to get a string of overlapping ids (bedmap), and using awk to find the maximum value in that string. Then awk prints the disjoint region and its max id.

By using standard input/output streams and using a linear sweep for the line maximum, this should run fast and have low memory overhead.

Here's what things look like at each step of the pipeline.

First, we partition the input regions into disjoint elements:

$ bedops --partition regions.bed
chr1    0      50
chr1    50     100
chr1    100    150
chr1    150    200

You can throw as many overlaps into --partition as you like.

Then we map the partitioned set back to the original input regions, and get the list of overlapping ids:

$ bedops --partition regions.bed |\
  bedmap --echo --echo-map-id-uniq --delim '\t' - regions.bed
chr1    0      50     2
chr1    50     100    1;2
chr1    100    150    1;3
chr1    150    200    3

Then awk parses each of the id-strings in the fourth column to get the maximum value, printing the result:

$ bedops --partition regions.bed |\
  bedmap --echo --echo-map-id-uniq --delim '\t' - regions.bed |\
  awk '{ 
          n = split($4, a, ";");
          max = a[1];
          for(i = 2; i <= n; i++)
          {
              if (a[i] > max)
              {
                  max = a[i];
              }
          }
          print $1"\t"$2"\t"$3"\t"max;
       }' -
chr1    0      50     2
chr1    50     100    2
chr1    100    150    3
chr1    150    200    3

Ram · Answer 2 · 2015-10-30

1

Entering edit mode

9.5 years ago

Kamil ★ 2.3k

I believe you cannot solve this without writing your own algorithm. Here's my first attempt:

cat a.bed

chr1    0       100     2
chr1    50      150     1
chr1    100     200     3

(cut -f2 a.bed; cut -f3 a.bed) |\
    sort -nu |\
    perl -ne '
        BEGIN{$i=0} chomp;
        push @F, $_;
        if($i++){print "chr1\t$F[$i-2]\t$F[$i-1]\n"}
        ' > b.bed

bedtools intersect -a a.bed -b b.bed | sort -k2n -k3n -k4nr | perl -lane 'print unless $h{$F[0,1,2]}++'

Output:

chr1    0       50      2
chr1    50      100     2
chr1    100     150     3
chr1    150     200     3

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 9.5 years ago by Kamil ★ 2.3k

0

Entering edit mode

Thanks for your answer. But, precisely I want to avoid a custom solution. As I said I already have a custom solution in python that handles also different chromosome names:

import fileinput

prev = None
for line in fileinput.input():
    fields = line.strip().split('\t')
    if prev is None or prev[0] != fields[0]:
        prev = fields
        continue
    print "{}\t{}\t{}\t{}".format(fields[0],
                                  fields[1],
                                  prev[2],
                                  max(float(prev[3]),
                                      float(fields[3]))
                                  )
    prev = fields

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by Fidel ★ 2.0k

Ram · Answer 3 · 2015-10-30

0

Entering edit mode

9.5 years ago

Damian Kao 16k

You can try doing this:

Use clusterBed to cluster overlapping intervals

For each cluster get the elementary intervals. Basically just put all start,end coordinates into a list, make the list unique, then sort the list. So something like this:

for cluster in clusters:
 coords = set()
 for line in cluster:
     cols = line.strip().split()
     ref = cols[0]
     start, end, cov = map(int, cols[1:])
     coords.add(start)
     coords.add(end)
 coords = list(coords)
 coords.sort()
 elementary_intervals = []
 for i in range(len(coords) - 1):
     elementary_intervals.append((coords[i],coords[i+1]))

For each elementary interval, find overlaps with cluster members and find the maximum coverage (or whatever other operations). So something like this:

for e_interval in e_intervals:
 covs = []
 for member in cluster:
     if overlap(e_interval,member):
         covs.append(memberCov)
 max(covs) or sum(covs) or whatever

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 9.5 years ago by Damian Kao 16k

0

Entering edit mode

I don't think clusterBed is useful here. All bins are same size same distant and end up in the same cluster.