Question

get intervals of a fraction of bed-files

0

Entering edit mode

7.4 years ago

markus.glass ▴ 40

Hello,

I have a set of 8 .bed files and I want to extract those (sub-)intervals that are present in a certain fraction of these files, e.g., intervals that occur in 6 of these 8 files. Does anyone know an efficient way/tool to do this?

Markus

ChIP-Seq • 2.0k views

ADD COMMENT • link updated 7.4 years ago by Alex Reynolds 36k • written 7.4 years ago by markus.glass ▴ 40

score 1 · Answer 1 · 2017-06-13

Here is an efficient solution with BEDOPS, which avoids approaches with other tools that require M^N pairwise comparisons, for M elements and N sets.

Say you have N files (your specific example uses 8 files, but this shows a generic solution for N files). You want all elements from N files, which overlap at least F (6) elements from each of all N sets.

Label each BED file to note where it is coming from:

$ for inFn in `find . -name "*.bed"`; do echo $inFn; outFn=${inFn%%.*}.labeled.bed; prefix=${inFn%%.*}; echo ${prefix} | cut -f1-3 ${inFn} - | sort-bed - > $outFn; done

Set shell variables with the number of sets and your overlap threshold, e.g.:

$ export N=8
$ export F=6

Then do the set operations on the labeled files with BEDOPS bedops and bedmap:

$ bedops --everything file1.labeled.bed file2.labeled.bed ... fileN.labeled.bed \
    | bedmap --echo --echo-map-id-uniq --fraction-map 1 --delim '\t' - \
    | cut -f1-3,5 - \
    | awk -vF=${F} '{ n = split($4, a, ";"); if (n >= F) { print $0; } }' \
    > intervals_overlapping_at_least_F_inputs.bed

The file intervals_overlapping_at_least_F_inputs.bed contains a list of intervals that are contained entirely within at least F sets, with an ID value indicating its origin set.

score 0 · Answer 2 · 2017-06-13

for each bed : convert your bed to uniq-base-bed and sort | uniq
uniq -ccollect the base positions occuring more than 6 times
convert back to bed
merge with bedtools

 for F in *.bed ; do awk '{B=int($2);E=int($3);for(i=B;i<E;++i) {printf("%s\t%d\t%d\n",$1,i,i+1);}}' $F | sort | uniq ; done | sort | uniq -c | awk '{if($1>=6) printf("%s\t%s\t%s\n",$2,$3,$4);}' | bedtools merge