Here is an efficient solution with BEDOPS, which avoids approaches with other tools that require M^N pairwise comparisons, for M elements and N sets.
Say you have N files (your specific example uses 6 files, but this shows the generic solution for N files).
You appear to want all elements from the N files, which overlap at least one element from each of all other N-1 sets.
First, convert any non-BED files to sorted BED with the appropriate conversion tools, e.g.: bam2bed
from BEDOPS:
$ for inFn in `find . -name "*.bam"`; do echo $inFn; outFn=$inFn.bed; bam2bed < $infn > $outFn; done
Second, label each BED file to note where it is coming from:
$ for inFn in `find . -name "*.bam.bed"`; do echo $inFn; outFn=${inFn%%.*}.labeled.bed; prefix=${inFn%%.*}; echo ${prefix} | cut -f1-3 ${inFn} - > $outFn; done
Set a shell variable with the number N
set to your overlap threshold, e.g.:
$ export N=6
Then do the set operations on the labeled files with BEDOPS bedops
and bedmap
:
$ bedops --everything file1.labeled.bed file2.labeled.bed ... fileN.labeled.bed \
| bedmap --echo --echo-map-id-uniq --delim '\t' - \
| cut -f1-3,5 - \
| awk -vN=${N} '{ n = split($4, a, ";"); if (n==N) { print $0; } }' \
> intervals_overlapping_all_N_inputs.bed
The file intervals_overlapping_all_N_inputs.bed
contains a list of intervals that overlap all N sets.
This may be enough, but perhaps you want more detail. Now you can go back and filter all the converted BAM files for those elements:
$ for inFn in `find . -name "*.bam.bed"`; do echo $inFn; $outFn=${inFn%%.*}.filtered.bed; bedops --element-of -1 $inFn intervals_overlapping_all_N_inputs.bed > $outFn; done
Each *.filtered.bed
will contain BED-formatted reads specific to the parent BAM file, which overlap all N sets.
This looks like a lot of work, but this will run much faster than a pairwise approach with other tools.
Unique would be a set element present in only one of N inputs. Non-unique would mean an element present or overlapping in two or more of N inputs. Which do you mean?
An element overlapping 6 out of 6 Inputs.
The simplest solution would be multiple iterations of BEDTools 'intersect' with each of your six BAMs (intersect bam1 and bam2, intersect that output with bam3, etc).
BTW, you should use the 'Add Comment' button instead of 'Answer' when responding.
Consider the following example:
A intersects B, and B intersects C. However, A does not intersect C.
If you want intervals overlapping across all sets, you can do pairwise comparison tests between all the input sets, which does not scale well and requires storing intermediate files.
Or you can use the faster approach I describe, which uses sorted BED files with
bedops
andbedmap
, which does not require pairwise comparisons or intermediate files.