Question

How Can I Merge Intervals ?

0

Entering edit mode

12.3 years ago

Mchimich ▴ 320

Hello everybody, I should be grateful if you would kindly help me de fix my problem. I have a table like that :

Chromosome   start   end    info1    info2   
chr01        1       100    15       35         
chr01        150     300    15       39      
chr01        299     750    16       39

I would like to merge the intervals that overlap ( line 2 and 3) and those that are closest (line 1 and 2) in addition to perform some operation basing in the other column ! for example I would like to merge the tree line above into one interval (chr1 1-750), sum basing on the info1 (15+15+16) and finally did the mean basing on the info line to (35+39+39)/3 the output I'd like will be as this :

chr1 1-750  46  37.66

I know that Bedtools can merge interval ( galaxy tool ! too )but accept only BED format that contain only 3 coloumn chr start and end !
Thanks in advance for your help !

merge bedtools • 8.0k views

ADD COMMENT • link updated 12.3 years ago by JC 13k • written 12.3 years ago by Mchimich ▴ 320

0

Entering edit mode

How big is your file with this table? You want to merge overlaps first and then merge closest intervals? How do you define closest?

ADD REPLY • link 12.3 years ago by Damian Kao 16k

0

Entering edit mode

Hello DK thanks for responding ! my table contain thousands of lines containing different chromosmes and severals intervals by chromosomes. What I mean by closest intervals is intervals that are separated by a given distance (D). In this case D is equal to 50 but It can be more imporant value (100 or 200). cheers

ADD REPLY • link 12.3 years ago by Mchimich ▴ 320

0

Entering edit mode

It can be some value makes it ill-posed.

ADD REPLY • link 12.3 years ago by Arun 2.4k

0

Entering edit mode

Yeah, without the score it would be easier, e.g. Using IRanges, it is still possible using normalize to join the ranges and then get the scores.

ADD REPLY • link 12.3 years ago by Michael 55k

0

Entering edit mode

I don't get why all ranges in your example should be merged into one, 2,3 ok, but there's a 50 bp gap between 1 and 2.

ADD REPLY • link 12.3 years ago by Michael 55k

0

Entering edit mode

Hello Michael. It will take a lot of time if I want to explain the general context of my problem. In short I want to merge line 1 and line 2 because the gap is due to the fact that 1 and 2 are tow fragment of the same sequence (but diverge from my query sequence) that I want to consider as one !

ADD REPLY • link 12.3 years ago by Mchimich ▴ 320

0

Entering edit mode

Ok, but if we want to do the merging automatically, we need to have concrete criteria, overlap is one possibility, annotation could be another, or a combination. So you know a priori which regions to combine?

ADD REPLY • link 12.3 years ago by Michael 55k

0

Entering edit mode

If I misspoke, I apologize, but my intention was to merge the tree line together ! this is possible with bedtools using Merge nearby (within a given distance bp) but I lose information of the other columns and I can not do aditional operation

ADD REPLY • link 12.3 years ago by Mchimich ▴ 320

0

Entering edit mode

What Michael is trying to tell is that no software will be able to find "nearby" regions by itself. So, you need a concrete criteria. For example, say your file is

chromosome,start,end,info1,info2
chr1, 1, 500,  20, 10
chr1, 501, 555, 10, 20
chr1, 600, 699, 25, 10
chr1, 680, 800, 45, 55

Then, we can merge 600-800 and then you'd expect that to be merged with 501-555, right? So, would it be possible to then "always" merge with the immediate entry above the overlapping region (assuming that the data is coordinate sorted in chr, start and end fashion?

Also, in the above example if the second line was 450,555 instead of 501,555, then how would you want to merge?

ADD REPLY • link 12.3 years ago by Arun 2.4k

score 2 · Answer 1 · 2012-08-07

So basically you just want to merge features that are within X distance of each other. I assume you will predetermine X somehow. I guess an easy way is to just artificially inflate your features by X base pairs on either side and check for overlaps.

I actually just posted something today on finding consensus/overlap sequences on my blog using lexical parsing: http://blog.nextgenetics.net/?e=41

Here is a modification of the code in my blog post. It will return overlapping by X distance groups of features:

#define your distance here
padding = 50

def sortCoord(d):
    coords = []
    i = 0
    for coord in d:
        coords.append(('s',coord[0] - padding,i))
        coords.append(('e',coord[1] + padding))
        i += 1
    coords.sort(key = lambda x : x[0], reverse = True)
    coords.sort(key = lambda x : x[1])
    return coords

def clusterByOverlap(c):
    count = 0
    posA = 0
    out = []
    currentData = []
    for pos in c:
        if count == 0:
            posA = pos[1]
        if pos[0] == 's':
            count += 1
            currentData.append(pos[2])
        if pos[0] == 'e':
            count -=1
        if count == 0:
            out.append((posA + padding, pos[1] - padding, currentData))
            currentData = []

    return out

import sys
inFile = open(sys.argv[1],'r')
inFile.next()
refs = {}
for line in inFile:
    data = line.strip().split('\t')
    if not data[0] in refs.keys():
        refs[data[0]] = []
    refs[data[0]].append([int(x) for x in data[1:]])

for ref, data in refs.items():
    sortedCoords = sortCoord(data)
    clusters = clusterByOverlap(sortedCoords)
    for cluster in clusters:
        info1 = []
        info2 = []
        for i in cluster[2]:
            info1.append(data[i][2])
            info2.append(data[i][3])
        print ref + "\t" + str(cluster[0]) + "\t" + str(cluster[1]) + "\t" + str(sum(info1)) + "\t" + str(sum(info2) / float(len(info2)))

I am assuming you have a tab-delimited input table. Save as yourName.py. Use by: python yourName.py yourFile

score 1 · Answer 2 · 2012-08-07

Assuming that you have your data sorted per chromosome then by initial position and you can define a fixed window overlap to combine regions separated by gaps (region 1 and 2 in your example have 50bp separation), you can use some code like this:

#!/usr/bin/perl

use strict;
use warnings;

my ($chr, $ini, $end, $val_1, $val_2, $sum, $ave);
my $last_chr = 'na';
my $last_ini = 'na';
my $last_end = 'na';
my $ext      =   50; # extension of the overlap
my @val_1;
my @val_2;

while (<>) {
    next if(m/^Chromsome/); # skip header
    chomp;
    ($chr, $ini, $end, $val_1, $val_2) = split (/\s+/, $_);
    if ($last_chr ne $chr) { # first iteration and each chromosome change
        unless ($last_chr eq 'na') {
            $sum = doSum(@val_1);
            $ave = doAve(@val_2);
            print "$chr\t$last_ini-$last_end\t$sum\t$ave\n";
        }
        $last_chr = $chr;
        $last_ini = $ini;
        $last_end = $end;
        @val_1    = ( $val_1 );
        @val_2    = ( $val_2 );
    }
    elsif ($last_ini - $ext <= $ini and $last_end + $ext <= $end and $last_end + $ext >= $ini) { # overlapped segments
        $last_end = $end;
        push @val_1, $val_1;
        push @val_2, $val_2;
    }
    else { # no overlap case
        $sum = doSum(@val_1);
        $ave = doAve(@val_2);
        print "$chr\t$last_ini-$last_end\t$sum\t$ave\n";
        # reset variables
        $last_ini = $ini;
        $last_end = $end;
        @val_1    = ( $val_1 );
        @val_2    = ( $val_2 );
    }
}
# last one
$sum = doSum(@val_1);
$ave = doAve(@val_2);
print "$chr\t$last_ini-$last_end\t$sum\t$ave\n";

sub doSum {
    my $sum = 0;
    foreach my $x (@_) { 
        $sum += $x; 
    }
    return $sum;
}

sub doAve {
    my $sum = 0;
    my $n   = 0;
    foreach my $x (@_) { 
        $sum += $x;
        $n++;
    }
    return $sum / $n;
}

Example output with your sample ranges:

perl mergeData.pl < sample 
chr01    1-750    46    37.6666666666667

score 1 · Answer 3 · 2014-04-15

1

Entering edit mode

10.6 years ago

Alex Reynolds 36k

Here's a one-liner using core BEDOPS tools, taking advantage of standard input and output streams:

$ sort-bed inputs.unsorted.bed > inputs.bed
$ bedops --range 51 --merge inputs.bed \
    | bedmap --echo --sum --mean --delim '\t' - inputs.bed \
    | awk '{ print $1"\t"$2"-"$3"\t"$4"\t"$5 }' -

ADD COMMENT • link 10.6 years ago by Alex Reynolds 36k