I have a couple of bed files with regions (peaks of ChIPseq), and I would like to create a unique file with merge regions, where overlapping regions are fused IF the overlap cover more than xx % (in my case 60%) of one of the 2 regions (not necessary both).
I know how to do that with 2 files, however I have more (at least 4). I could do recursive and pipes to do it pairwise multiple times (with bedops or bedtools), but I was wondering if there would not be a cleaner way. I can use bedops to merge everything that overlap by 1bp for my bed files at once, but I do not find a way to control on the overlap fraction...
It might look a bit overwhelming, but it is really quite straightforward if we walk through this line by line.
Take the multiset union of all elements from all N input files with bedops --everything. Note that this is not the same as a --merge operation, which instead makes contiguous regions from all overlapping input regions. Pipe this unioned set to bedmap.
For each element from all files, use bedmap --echo-map to print any other elements from the multiset union that overlap that first element by 60% or more, using --fraction-both 0.6. The result will be a semi-colon delimited string that contains one or more overlapping elements. Note that this result will always include the original element, because it overlaps itself by 100%. Pipe this result to the awk statement.
The awk command splits the string on the semi-colon delimiter and prints out the string if there are two or more overlapping elements. We do this filter step, because we are not interested in any element overlapping itself and only itself, but only multiple (2+) elements that meet the 60% or greater overlap criterion. We pipe this string to sed.
The sed statement replaces all semi-colons with newline characters. This creates BED-formatted output that we pipe to sort-bed.
The sort-bed tool takes the overlapping elements and puts them in lexicographically-sorted order. We pipe this to uniq to remove duplicate elements.
The file answer.bed is a sorted BED file that contains all unique elements from file1.bed through fileN.bed, where there is mutual 60% or greater overlap.
Note that this is a highly efficient solution, compared with O(2^N) pairwise comparisons required by other approaches. Further, we make use of Unix standard input and output streaming (note the hyphen at the end of each command, denoting standard input) to avoid the cost of making unnecessary intermediate files; writing to disk is expensive. The only thing written to disk is the final result.
I actually did something like that previously, but:
here I will get all the regions that overlap an other region by more than 60% (actually I do not want to require 60% for both regions, but using --fraction-ref should do the job).
Now if I have A overlap B with >60%, C overlap D with >60%, but B overlap C with <60% for both. This method should still give me back A, B ,C and D, because for all of them one overlap of >60% can be found. Now if I want to merge my regions if they significantly overlap, if I do a merge of the results, I will merge them all, because all overlap, whereas I only want to merge A with B and C with D (and to keep 2 regions AB and CD that overlap a Little bit). That is what is where I am stuck.
There might be the problem of A overlap B significantly, B overlap C significantly, but A and C do not. In this case I would actually want to merge them all together (personal choice).
I guess what I want is, after the bedmap, take for each like the start of the first element, and the end of the last one (then I will have some overlapping feature, like in the example with A, B and C, a line with AB, a line with BAC, and a line with BC ; but I think their should always be one feature that overlap completely the others, so I can do another bedmap --fraction-ref 1). There can be more than 2, so it won't always be the same column number. I am not very good with awk but I guess I can do that.
and to keep 2 regions AB and CD that overlap a Little bit
I guess I don't understand how you intend to keep AB and CD separate after a merge operation, when they overlap a little bit. Can you be more explicit about your "inclusion" and "exclusion" criteria for all elements, if that makes sense? Maybe draw out what you are trying to do, and include numbers for overlap criteria.
If your criteria is simply 60% minimum overlap for membership in the output file, then what I provided should work. However, if you have two different minimum overlap criteria, you might do two runs of my script with different fraction parameters. Given the two resulting answer.bed files, you might again use bedops on both files to determine which elements are not common to both, using that result to apply custom merge logic with awk or similar.
If you are looking at 3 separate bed files representing 3 replicates for instance, how would you go about limiting the printed result with this method to only elements present in all 3 of the replicates (at some % overlap)? Thanks.
I think you would just modify the awk statement to filter explicitly for those elements wherein there are exactly three elements that meet the --fraction-both 0.6 criterion:
I input 6 bed files for an overlap that will give me intervals where at least 6 overlap at .1, however sometimes there is the instance where there are two adjoining intervals within one of the files that overlap the other files getting the total number to 6 when really only 5 of the files showed overlaps. Wondering how I can adjust it so it shows only 6/6 overlaps across the files. Thanks.
I actually did something like that previously, but:
here I will get all the regions that overlap an other region by more than 60% (actually I do not want to require 60% for both regions, but using
--fraction-ref
should do the job).Now if I have A overlap B with >60%, C overlap D with >60%, but B overlap C with <60% for both. This method should still give me back A, B ,C and D, because for all of them one overlap of >60% can be found. Now if I want to merge my regions if they significantly overlap, if I do a merge of the results, I will merge them all, because all overlap, whereas I only want to merge A with B and C with D (and to keep 2 regions AB and CD that overlap a Little bit). That is what is where I am stuck.
There might be the problem of A overlap B significantly, B overlap C significantly, but A and C do not. In this case I would actually want to merge them all together (personal choice).
I guess what I want is, after the bedmap, take for each like the start of the first element, and the end of the last one (then I will have some overlapping feature, like in the example with A, B and C, a line with AB, a line with BAC, and a line with BC ; but I think their should always be one feature that overlap completely the others, so I can do another
bedmap --fraction-ref 1
). There can be more than 2, so it won't always be the same column number. I am not very good with awk but I guess I can do that.I guess I don't understand how you intend to keep AB and CD separate after a merge operation, when they overlap a little bit. Can you be more explicit about your "inclusion" and "exclusion" criteria for all elements, if that makes sense? Maybe draw out what you are trying to do, and include numbers for overlap criteria.
If your criteria is simply 60% minimum overlap for membership in the output file, then what I provided should work. However, if you have two different minimum overlap criteria, you might do two runs of my script with different fraction parameters. Given the two resulting
answer.bed
files, you might again usebedops
on both files to determine which elements are not common to both, using that result to apply custom merge logic withawk
or similar.If you are looking at 3 separate bed files representing 3 replicates for instance, how would you go about limiting the printed result with this method to only elements present in all 3 of the replicates (at some % overlap)? Thanks.
I think you would just modify the
awk
statement to filter explicitly for those elements wherein there are exactly three elements that meet the--fraction-both 0.6
criterion:Does that make sense?
Absolutely, exactly what I needed, thanks!
So I have the following code:
I input 6 bed files for an overlap that will give me intervals where at least 6 overlap at .1, however sometimes there is the instance where there are two adjoining intervals within one of the files that overlap the other files getting the total number to 6 when really only 5 of the files showed overlaps. Wondering how I can adjust it so it shows only 6/6 overlaps across the files. Thanks.
I'm trying to understand the problem. Can you give me a test input that fails? Thanks!