It's not 100% clear to me what you are doing, but let's start with two sorted BED files:
The operation that matches the wording of your question #1 has the following structure:
$ bedops --element-of -1 ChipA.bed ChipB.bed > AnswerQ1A.bed
The file AnswerQ1A.bed
is made up of elements from ChipA.bed
that overlap elements in ChipB.bed
by one or more bases. This preserves columnar data in ChipA.bed
.
Likewise, the following command gives different output:
$ bedops --element-of -1 ChipB.bed ChipA.bed > AnswerQ1B.bed
The file AnswerQ1B.bed
is made up of elements from ChipB.bed
that overlap elements in ChipA.bed
by one or more bases. This, too, preserves columnar data in ChipB.bed
.
What you might be interested is the set of overlapping elements common to both input sets:
$ bedops --element-of -1 ChipA.bed ChipB.bed > ChipAOverlappingB.bed
$ bedops --element-of -1 ChipB.bed ChipA.bed > ChipBOverlappingA.bed
$ bedops --everything ChipAOverlappingB.bed ChipBOverlappingA.bed > ChipAIfAndOnlyIfOverlappingB.bed
Basically, use --element-of
if you need more than the input sets' genomic regions, i.e. to preserve columnar data (IDs, scores, etc.). You might want to further filter your element sets by other criteria contained in these extra columns.
The following operation that matches your question #2 gives a different answer:
$ bedops --merge ChipA.bed ChipB.bed > AnswerQ2.bed
The file AnswerQ2.bed
is made up of just genomic regions that are calculated from the genomic regions in both ChipA.bed
and ChipB.bed
. This second answer does not preserve columnar data. The output will not tell you where the region originally came from (though simple set operations back on ChipA.bed
and ChipB.bed
answer that question).
You might want this instead, if you are looking to discover genes in a set of genomic coordinates, and if the columnar data in ChipA.bed
and ChipB.bed
is not important or relevant. This will depend on your experiment.
For instance, if you want to look at the merged regions only within +/- 20 kbase of a set of TSSs, then (assuming you have a sorted BED file containing TSSs) you could do the following:
$ bedops --merge --range 20000 TSSs.bed \
| bedops --element-of -1 AnswerQ2.bed - \
> mergedChipABRegionsThatOverlap20kbWindowsAroundTSSs.bed
This last file is a constrained set of merged regions from AnswerQ2.bed
- you might only be interested in focusing downstream searches in this subset of regions.
Hi Alex,
Thank you so much for your detailed answer! Basically, I understand your explanation. However, I have another question not clearly described in my first post.
Say, I want to make a Venn diagram based on three ChIP.bed files. Each area within the diagram with distinctive or overlapping area needs a number to fill in to describing how all the regions in all 3 bed files distribute. How can I achieve this? One problem I have is: I did the operation exactly the same as you explained in your first part of answer. The number of overlapping regions based on Chip A or on Chip B files are different (50 difference). How can I assign the overlapping number between ChipA and ChipB in the Venn diagram?
Thanks,
Xiaoyong
Given two sets A and B, you can form three disjoint sets:
To get the first set via operations with BEDOPS, assuming a one-base overlap criterion for set inclusion or exclusion:
To get the third set:
To get the second set:
Then do
wc -l
on each result to get element counts.You can use
diff
to verify that the multiset union of these three disjoint sets reconstructs the entire set of elements:This is very helpful! However, if the number of elements belonging to A in "UniqueAB" is different to the number of elements belonging to B in "UniqueAB", how can I assign the number to overlapping region in the Venn Diagram of A x B?
Thanks!
Unless I misunderstand your question, that would be the line count of the second set
UniqueAB
from my three example subsets.