I have a directory with 50 bed files. I want to get all regions in common in at least 50% of my files. How would I get that? Using `bedopts` I can intersect all, but I want at least those existing regions in 25 files.
I have a directory with 50 bed files. I want to get all regions in common in at least 50% of my files. How would I get that? Using `bedopts` I can intersect all, but I want at least those existing regions in 25 files.
This isn't counting the fraction of overlap between intervals, so you can't directly use available options in bedops or bedmap, but you can use some ID tricks to uniquely associate an interval with the BED file it comes from, and use that ID information to do counting and thresholding.
Here is one way to do so:
Label your input BED files so that their IDs uniquely identify their intervals, e.g., for BED files called 1.bed
, 2.bed
, etc.:
$ for idx in `seq 1 50`; do cut -f1-3 $idx.bed | awk -vidx=$idx '{ print $0"\t"idx; }' > $idx.id.bed; done
You would modify this loop depending on how you name your files. Basically, however you do this, the goal is to print the interval in the first three columns, and use the number of files you have as a unique identifier in the fourth column. You would print 1
in the fourth column of the first file 1.id.bed
, and 5
in the fourth column of the fifth file 5.id.bed
, and so on.
I use numbers here, but you can use any unique identifier you want, instead. Cell type. Experiment or sample name, etc. The key part to making this work is that the chosen identifier is unique to that input file and that file only.
Take the union of all these ID-tagged files with BEDOPS bedops --everything and pipe the result to BEDOPS bedmap with the --echo-map-id-uniq operation:
$ bedops --everything *.id.bed | bedmap --echo --echo-map-id-uniq --delim '\t' - > all.bed
What you have at this point, for every interval from every input BED file, is a file called all.bed
, which shows that interval in the first three columns, the file it comes from in the fourth column, and a fifth column that is a semi-colon delimited string of every unique ID value that overlaps that interval, e.g.:
...
chrN 123 456 5 2;5;6;7;11;12;...;46;48
chrN 234 567 5 5;6;7;8
...
Because we use --echo-map-id-uniq, this string contains unique identifiers - no duplicates. This uniqueness property is how we can count how many of the original input files overlap.
To get an interval that has regions in common with at least 50% of your 50 input files - 25 files other than the original file the interval comes from - you can use awk to split
this ID string in the fifth column by the semi-colon delimiter,* *counting how many unique IDs you find and printing the interval, if the threshold of 26 ID values is met (25 IDs other than the ID representing the original interval's source):
$ awk -vthreshold=26 '{ \
n_ids = split($5, ids, ";"); \
if (n_ids >= threshold) { \
print $0; \
} \
}' all.bed > thresholded.bed
A shorter, cleaner awk statement with the same functionality would be:
$ awk -vthreshold=26 '(split($5, ids, ";") >= threshold)' all.bed > thresholded.bed
Bonus: If you need to get back any additional columns of the original interval, you can just do another bedmap operation on thresholded.bed
on a union of the 50 input BED files.
$ bedops --everything 1.bed 2.bed ... 50.bed > union.bed
$ bedmap --echo --exact union.bed thresholded.bed > union-thresholded.bed
Note: Input files should be sorted before use with BEDOPS tools, if the sort state is unknown or different. BEDOPS includes a tool called sort-bed that sorts BED files faster than GNU sort or alternatives.
$ sort-bed < foo.unsorted.bed > foo.sorted.bed
You can use whatever you like here, so long as the BEDOPS sort criteria are met before passing data to bedops and bedmap.
Have a look at this post: Bedtools Compare Multiple Bed Files?
Since this thread is active and I work with almost the same problem now here is my question:
How can I visualize those common intervals from many bed files?
I would imagine something like
file1) -------------- ------------------- ------
file2) ------ ---------- ----------- ----------
file3) ---------- ------------- -- --- ---
Where - indicates present in file, and empty space, as no coverage at all
A general approach can be done with BEDOPS bedmap --count, which generalizes to N input files:
$ N=`ls *.bed | wc -l`
$ bedops --everything A.bed B.bed C.bed ... N.bed \
| bedmap --count --echo --delim '\t' - \
| uniq \
| awk -vN=${N} '$1==N' \
| cut -f2- \
> common.bed
By changing the test in the awk statement, this approach can be modified to return other subsets of the input's power set, e.g., all elements common to N-1 inputs, N-2 inputs, etc.
This approach is linear and doesn't create intermediate files, so it should be generally preferable to alternative approaches that make N^2 or more passes, if your time is valuable to you.
yea, I'm just asking of the last part: generating the plot. I did some googling, and maybe plotgff, and some iranges tricks to put those reduced ranges on one plot.
On a side note however, I resolved my common intervals problem with:
'convert2bed' from bamutils, then 'bedops intersect' and passed the same file twice to get reduced intervals in the file. Last step 'bedtools multiinter -header -empty -g genome file' on all those reduced files to find common intervals, how long they are and to which files exactly they belong
hi,
Using (recent versions) bedtools intersect, give one BED as 1st arg. and the rest 49 as 2nd arg. And use the -c
parameter. That will give you the count of overlaps across the files. Then you can use that to apply your threshold.
Ofcourse the above works assuming you dont get multiple overalps from a singel BED. So you can first create unique intervals of each BED and then do the above.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
So useful, thanks a lot!!
You're welcome!
I have the path of the multiple bed files in a txt file I want bedops to process. Is it possible to pass this txt file into the bedops command? The txt file has one path on each line.
Here is one way, via bash:
Replace
bedops -u
with your command and options, etc.