The bedops
tool in the BEDOPS suite will find overlaps between multiple (two or more) BED files. You can specify custom overlap criteria or use the default, which is one base of overlap. Please see documentation for the --intersect
and --element-of
operators for more detail. You could invoke a set operation from R with a system
call.
1) To find overlaps with your criteria, first add your sample number to the ID field of the BED file. For example, for sample 5 of 18, the BED file Sample5.bed
might look like:
chrN 1234 5678 peak-8724-sample-5
...
You could do this with awk
, Perl, whatever.
Note that if your elements already have IDs that are unique to the sample, then you don't need to do this step. This step just ensures that you can retrieve the element information later on.
Then collate the samples to build a "master list":
$ bedops --everything Sample1.bed Sample2.bed ... Sample18.bed > MasterList.bed
2) Next, map the master list against itself with bedmap
. Use the --count
operator to give the number of peaks that the BED element overlaps, and the --echo-map-id
operator to show the IDs of the peaks that overlap:
$ bedmap --echo --count --echo-map-id MasterList.bed MasterList.bed
The output will show you the number of "hits" or overlaps for each element, along with the IDs associated with those hits, e.g.:
chrN 1234 5678 peak-8724-sample-5 | 2 | peak-8724-sample-5;peak-8724-sample-12
chrN 2563 3125 peak-6258-sample-12 | 2 | peak-8724-sample-5;peak-8724-sample-12
chrN 7746 8913 peak-5923-sample-1 | 1 | peak-5923-sample-1
...
In other words, element chrN:2563-3125
from sample 12 overlaps with itself and one other element in your master list, which comes from sample 5.
3) To speed up step 4, filter this result for elements which have hit counts greater than or equal to 14 with a Perl or awk
script. Because you pre-processed the ID field, you know which element comes from which sample, which satisfies your criterion. Note that this gives you peaks that may overlap within the same sample, so this isn't a guarantee that you have an overlap with 14 or more samples.
4) Therefore, finally, you would want to parse the list of overlapping elements with, for example, a Perl or awk
script to get a unique number of samples for that element. For example, parsing the mapped ID results for the chrN:2563-3125
element shows itself (of course) and a peak from sample 5, so that peak gets a "true" count of 2 samples, and so on. Filter this final list of sample counts for elements with a score of 14 or greater, and there's your answer.
This is definitely more work than other solutions, but I think this may give you more useful information, like the IDs of specific peaks as well as their overlap characteristics within a sample and across samples. Using bedops --merge
discards ID information (not sure what mergeBed
does here). I'd probably want to first make sure that overlaps between peaks in the same sample are taken into account when counting. Hopefully this is useful to you, in any case.
Hi Alex, thanks for your reply. I'm not sure though if BEDOPS tools allows me to identify regiions that overlap in "at least 14 out of 18" sample? Maybe I'm missiong something? Looks like a great tool but not sure if there is any way to specify that? Thanks again.
If you want to do this outside of R (see my answer), I think your best tool would be BEDTools, as two people above have suggested.
This will give you a count of how many merged overlapping peaks there are.
There is no direct way to do this with
bedops
only, but I'll edit my answer to suggest a mix ofbedops
andbedmap
that might work for you.Hi, Alex:
Thanks for the post. I just followed your post to find hit counts great or equal to 2 out of 3 samples. Can you please also post the awk script for the step 3 and step 4? Thanks very much.
best
HY