I have a BED file with some elements that are located in overlapped intervals. I added unique IDs, and length in the 4th and 5th columns, respectively.
cat sorted_intervals.bed
[....]
NC_007117.7 52869911 52875049 NC_007117.7.27 5138
NC_007117.7 52869911 52870819 NC_007117.7.28 908
NC_007117.7 52869929 52870807 NC_007117.7.29 878
NC_007117.7 52869932 52870798 NC_007117.7.30 866
NC_007117.7 52869932 52870795 NC_007117.7.31 863
NC_007117.7 52869932 52870780 NC_007117.7.32 848
NC_007117.7 52869938 52870804 NC_007117.7.33 866
NC_007117.7 52869956 52870795 NC_007117.7.34 839
NC_007117.7 52874159 52875088 NC_007117.7.35 929
NC_007117.7 52874159 52875088 NC_007117.7.36 929
NC_007117.7 52874159 52875088 NC_007117.7.37 929
NC_007117.7 52874159 52875088 NC_007117.7.38 929
NC_007117.7 52874159 52875082 NC_007117.7.39 923
NC_007117.7 52874159 52875088 NC_007117.7.40 929
NC_007117.7 52874159 52875079 NC_007117.7.41 920
NC_007117.7 52874159 52875088 NC_007117.7.42 929
NC_007117.7 52874159 52875088 NC_007117.7.43 929
NC_007117.7 52874162 52875085 NC_007117.7.44 923
NC_007117.7 52874192 52875052 NC_007117.7.45 860
NC_007117.7 52874192 52875079 NC_007117.7.46 887
NC_007117.7 52874192 52875052 NC_007117.7.47 860
[....]
My goal is to find which elements are located in overlapping genomic regions and then select the one that spans the longest length. I also want to keep those elements that do not overlap with any others. To do so, in a previous analysis, I used the following command:
while read LINE ; do grep -wE "$LINE" sorted_intervals.bed | sort -k5,5nr | head -1 ; done < <(bedtools merge -i sorted_intervals.bed -c 4 -o count,collapse | awk '{print $5}' | sed 's/,/|/g')
which successfully outputs the element NC_007117.7.27 for the above example.
Now I want to repeat the same analysis and set up a cutoff value (let's say 10%) for the overlapping regions. That is, if two elements have overlap >=10%, keep the longest one; but if two elements have overlap <10%, keep both of them. Same as before, I also would like to keep elements that do not overlap with any other elements.
I have found similar questions, such as:
Bed file: merging intervals that overlap a certain percentage
multiple bed - merge regions IF overlapping more than xx percent of size
However, I am still not very sure how to do it, because when I run the following bedops command it gives me three different overlaps for the above example instead of just one as in bedtools merge. I also tried the option --fraction-either and it produced the same results. I think I might be misunderstanding something.
bedmap --count --echo-map-range --echo-map-id --fraction-both 0.1 --delim '\t' sorted_intervals.bed | uniq | grep "NC_007117.7.27"
21 NC_007117.7 52869911 52875088 NC_007117.7.28;NC_007117.7.27;NC_007117.7.29;NC_007117.7.32;NC_007117.7.31;NC_007117.7.30;NC_007117.7.33;NC_007117.7.34;NC_007117.7.41;NC_007117.7.39;NC_007117.7.35;NC_007117.7.36;NC_007117.7.37;NC_007117.7.38;NC_007117.7.40;NC_007117.7.42;NC_007117.7.43;NC_007117.7.44;NC_007117.7.45;NC_007117.7.47;NC_007117.7.46
8 NC_007117.7 52869911 52875049 NC_007117.7.28;NC_007117.7.27;NC_007117.7.29;NC_007117.7.32;NC_007117.7.31;NC_007117.7.30;NC_007117.7.33;NC_007117.7.34
14 NC_007117.7 52869911 52875088 NC_007117.7.27;NC_007117.7.41;NC_007117.7.39;NC_007117.7.35;NC_007117.7.36;NC_007117.7.37;NC_007117.7.38;NC_007117.7.40;NC_007117.7.42;NC_007117.7.43;NC_007117.7.44;NC_007117.7.45;NC_007117.7.47;NC_007117.7.46
Does anyone know what would be the best way to proceed?
Many thanks,
EDIT: I don't think I need the two elements to have a reciprocal overlapping percentage of 10%, just the smallest one among the two being compared. That is, if the smallest of the two elements has an overlap that is less than 10%, then both elements should be kept.
Many thanks for all your effort, Alex. I really appreciate it and I think your script would be extremely useful for my work. However, I am not sure if I understand why element "A" (
chr1:120-200
) is not picked when the threshold is set to0.5
. Based on my understanding, I think that in this situation, element "A" should be picked regardless of the threshold because it constitutes the longest element (the "parent") in the overlap region (**but see note below), and the threshold should be applied to children overlaps of element "A" (but not to element "A" itself). Am I understanding it correctly? I am so sorry for the trouble.**Note: It would be very convenient if a column with ranks could be used to solve ties when choosing the parent, as previously discussed. That is, element "A" would be the one picked unless there are two or more elements with the same length as "A" in the overlap region, in which case the one with the lowest rank number should be picked (or one at random should be picked in the case that two or more tied parent elements have the same ranks).