Hello,
I have a set of 8 .bed files and I want to extract those (sub-)intervals that are present in a certain fraction of these files, e.g., intervals that occur in 6 of these 8 files. Does anyone know an efficient way/tool to do this?
Markus
Hello,
I have a set of 8 .bed files and I want to extract those (sub-)intervals that are present in a certain fraction of these files, e.g., intervals that occur in 6 of these 8 files. Does anyone know an efficient way/tool to do this?
Markus
Here is an efficient solution with BEDOPS, which avoids approaches with other tools that require M^N pairwise comparisons, for M elements and N sets.
Say you have N files (your specific example uses 8 files, but this shows a generic solution for N files). You want all elements from N files, which overlap at least F (6) elements from each of all N sets.
Label each BED file to note where it is coming from:
$ for inFn in `find . -name "*.bed"`; do echo $inFn; outFn=${inFn%%.*}.labeled.bed; prefix=${inFn%%.*}; echo ${prefix} | cut -f1-3 ${inFn} - | sort-bed - > $outFn; done
Set shell variables with the number of sets and your overlap threshold, e.g.:
$ export N=8
$ export F=6
Then do the set operations on the labeled files with BEDOPS bedops
and bedmap
:
$ bedops --everything file1.labeled.bed file2.labeled.bed ... fileN.labeled.bed \
| bedmap --echo --echo-map-id-uniq --fraction-map 1 --delim '\t' - \
| cut -f1-3,5 - \
| awk -vF=${F} '{ n = split($4, a, ";"); if (n >= F) { print $0; } }' \
> intervals_overlapping_at_least_F_inputs.bed
The file intervals_overlapping_at_least_F_inputs.bed
contains a list of intervals that are contained entirely within at least F sets, with an ID value indicating its origin set.
uniq -c
collect the base positions occuring more than 6 times for F in *.bed ; do awk '{B=int($2);E=int($3);for(i=B;i<E;++i) {printf("%s\t%d\t%d\n",$1,i,i+1);}}' $F | sort | uniq ; done | sort | uniq -c | awk '{if($1>=6) printf("%s\t%s\t%s\n",$2,$3,$4);}' | bedtools merge
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hello Pierre, your idea sounds good, I will try it. And Alex: bedops seem to be an interesting toolbox, I will have a closer look at it. Thank you!