I would like to merge the intervals that overlap ( line 2 and 3) and those that are closest (line 1 and 2) in addition to perform some operation basing in the other column ! for example I would like to merge the tree line above into one interval (chr1 1-750), sum basing on the info1 (15+15+16) and finally did the mean basing on the info line to (35+39+39)/3 the output I'd like will be as this :
chr1 1-750 46 37.66
I know that Bedtools can merge interval ( galaxy tool ! too )but accept only BED format that contain only 3 coloumn chr start and end !
Thanks in advance for your help !
Hello DK thanks for responding ! my table contain thousands of lines containing different chromosmes and severals intervals by chromosomes. What I mean by closest intervals is intervals that are separated by a given distance (D). In this case D is equal to 50 but It can be more imporant value (100 or 200). cheers
Hello Michael. It will take a lot of time if I want to explain the general context of my problem. In short I want to merge line 1 and line 2 because the gap is due to the fact that 1 and 2 are tow fragment of the same sequence (but diverge from my query sequence) that I want to consider as one !
Ok, but if we want to do the merging automatically, we need to have concrete criteria, overlap is one possibility, annotation could be another, or a combination. So you know a priori which regions to combine?
If I misspoke, I apologize, but my intention was to merge the tree line together ! this is possible with bedtools using Merge nearby (within a given distance bp) but I lose information of the other columns and I can not do aditional operation
What Michael is trying to tell is that no software will be able to find "nearby" regions by itself. So, you need a concrete criteria. For example, say your file is
Then, we can merge 600-800 and then you'd expect that to be merged with 501-555, right? So, would it be possible to then "always" merge with the immediate entry above the overlapping region (assuming that the data is coordinate sorted in chr, start and end fashion?
Also, in the above example if the second line was 450,555 instead of 501,555, then how would you want to merge?
So basically you just want to merge features that are within X distance of each other. I assume you will predetermine X somehow. I guess an easy way is to just artificially inflate your features by X base pairs on either side and check for overlaps.
I actually just posted something today on finding consensus/overlap sequences on my blog using lexical parsing: http://blog.nextgenetics.net/?e=41
Here is a modification of the code in my blog post. It will return overlapping by X distance groups of features:
#define your distance here
padding = 50
def sortCoord(d):
coords = []
i = 0
for coord in d:
coords.append(('s',coord[0] - padding,i))
coords.append(('e',coord[1] + padding))
i += 1
coords.sort(key = lambda x : x[0], reverse = True)
coords.sort(key = lambda x : x[1])
return coords
def clusterByOverlap(c):
count = 0
posA = 0
out = []
currentData = []
for pos in c:
if count == 0:
posA = pos[1]
if pos[0] == 's':
count += 1
currentData.append(pos[2])
if pos[0] == 'e':
count -=1
if count == 0:
out.append((posA + padding, pos[1] - padding, currentData))
currentData = []
return out
import sys
inFile = open(sys.argv[1],'r')
inFile.next()
refs = {}
for line in inFile:
data = line.strip().split('\t')
if not data[0] in refs.keys():
refs[data[0]] = []
refs[data[0]].append([int(x) for x in data[1:]])
for ref, data in refs.items():
sortedCoords = sortCoord(data)
clusters = clusterByOverlap(sortedCoords)
for cluster in clusters:
info1 = []
info2 = []
for i in cluster[2]:
info1.append(data[i][2])
info2.append(data[i][3])
print ref + "\t" + str(cluster[0]) + "\t" + str(cluster[1]) + "\t" + str(sum(info1)) + "\t" + str(sum(info2) / float(len(info2)))
I am assuming you have a tab-delimited input table. Save as yourName.py. Use by: python yourName.py yourFile
Assuming that you have your data sorted per chromosome then by initial position and you can define a fixed window overlap to combine regions separated by gaps (region 1 and 2 in your example have 50bp separation), you can use some code like this:
#!/usr/bin/perl
use strict;
use warnings;
my ($chr, $ini, $end, $val_1, $val_2, $sum, $ave);
my $last_chr = 'na';
my $last_ini = 'na';
my $last_end = 'na';
my $ext = 50; # extension of the overlap
my @val_1;
my @val_2;
while (<>) {
next if(m/^Chromsome/); # skip header
chomp;
($chr, $ini, $end, $val_1, $val_2) = split (/\s+/, $_);
if ($last_chr ne $chr) { # first iteration and each chromosome change
unless ($last_chr eq 'na') {
$sum = doSum(@val_1);
$ave = doAve(@val_2);
print "$chr\t$last_ini-$last_end\t$sum\t$ave\n";
}
$last_chr = $chr;
$last_ini = $ini;
$last_end = $end;
@val_1 = ( $val_1 );
@val_2 = ( $val_2 );
}
elsif ($last_ini - $ext <= $ini and $last_end + $ext <= $end and $last_end + $ext >= $ini) { # overlapped segments
$last_end = $end;
push @val_1, $val_1;
push @val_2, $val_2;
}
else { # no overlap case
$sum = doSum(@val_1);
$ave = doAve(@val_2);
print "$chr\t$last_ini-$last_end\t$sum\t$ave\n";
# reset variables
$last_ini = $ini;
$last_end = $end;
@val_1 = ( $val_1 );
@val_2 = ( $val_2 );
}
}
# last one
$sum = doSum(@val_1);
$ave = doAve(@val_2);
print "$chr\t$last_ini-$last_end\t$sum\t$ave\n";
sub doSum {
my $sum = 0;
foreach my $x (@_) {
$sum += $x;
}
return $sum;
}
sub doAve {
my $sum = 0;
my $n = 0;
foreach my $x (@_) {
$sum += $x;
$n++;
}
return $sum / $n;
}
How big is your file with this table? You want to merge overlaps first and then merge closest intervals? How do you define closest?
Hello DK thanks for responding ! my table contain thousands of lines containing different chromosmes and severals intervals by chromosomes. What I mean by closest intervals is intervals that are separated by a given distance (D). In this case D is equal to 50 but It can be more imporant value (100 or 200). cheers
It can be some value makes it ill-posed.
Yeah, without the score it would be easier, e.g. Using IRanges, it is still possible using normalize to join the ranges and then get the scores.
I don't get why all ranges in your example should be merged into one, 2,3 ok, but there's a 50 bp gap between 1 and 2.
Hello Michael. It will take a lot of time if I want to explain the general context of my problem. In short I want to merge line 1 and line 2 because the gap is due to the fact that 1 and 2 are tow fragment of the same sequence (but diverge from my query sequence) that I want to consider as one !
Ok, but if we want to do the merging automatically, we need to have concrete criteria, overlap is one possibility, annotation could be another, or a combination. So you know a priori which regions to combine?
If I misspoke, I apologize, but my intention was to merge the tree line together ! this is possible with bedtools using Merge nearby (within a given distance bp) but I lose information of the other columns and I can not do aditional operation
What Michael is trying to tell is that no software will be able to find "nearby" regions by itself. So, you need a concrete criteria. For example, say your file is
Then, we can merge 600-800 and then you'd expect that to be merged with 501-555, right? So, would it be possible to then "always" merge with the immediate entry above the overlapping region (assuming that the data is coordinate sorted in chr, start and end fashion?
Also, in the above example if the second line was 450,555 instead of 501,555, then how would you want to merge?