BEDOPS bedops
and bedmap
are efficient in memory and time on whole-genome scale files. These tools also work with standard input and output streams, which allows chaining of operations and integration of set operations with standard Unix tools. These features make work on hundreds of files tractable.
Method 1
If you can get away without the NA
, given sorted single-base BED files A.bed
, B.bed
through N.bed
(N=100 or whatever), you could take their union and their merge (these are all stored under the directory files
):
$ bedops -u files/A.bed files/B.bed ... files/N.bed > union.bed
$ bedops -m files/A.bed files/B.bed ... files/N.bed > merge.bed
Or if you want to work on all the files in this directory, more simply:
$ bedops -u files/*.bed > union.bed
$ bedops -m files/*.bed > merge.bed
Then map the IDs in the union file to the intervals in the merge file:
$ bedmap --echo --echo-map-id --delim '\t' merge.bed union.bed > answer.bed
The file answer.bed
would look like:
chr1 1 2 1;10;2
chr1 10 11 3;8
chr1 50 51 2;4
chr10 2 3 8;9
The --multidelim '\t'
option could be added, but using the default semi-colon delimiter may make it easier to demonstrate the result. In this approach, there are no NA values and ID values are presented in lexicographical order, not the order of original input files.
Method 2
If you need an NA
value positioned in order from columns 4 through N+3, where ID values are missing, then one option is to loop over all N BED files to generate N per-file maps:
$ for fn in `ls files/*.bed`; do echo $fn; bedops -n 1 merge.bed $fn | awk '{ print $0"\tNA" }' | bedops -u - $fn | cut -f4 > $fn.map; done
The pipeline of bedops
commands here uses merge.bed
to fill in gaps where intervals are missing, adding NA
as the ID value for missing intervals before doing the final mapping step with cut
. At the end, each of the per-file maps is a column in the result you ultimately want.
Finally, use paste
to glue all these columns together into an N+3 column matrix:
$ paste merge.bed files/*.bed.map > answer.mtx
The file answer.mtx
will contain the intervals in the first three columns, and ID values from columns 4 through N+3, for input files A.bed
through N.bed
. It'll look like this:
chr1 1 2 2 10 1
chr1 10 11 3 8 NA
chr1 50 51 4 NA 2
chr10 2 3 NA 8 9
Basically, this is the result you want, but without the chrm ... fileN
header at the top.
Unix pipelines demo'ed in this answer will make all of this work go faster, by avoiding generating intermediate files as much as possible.
If you have access to a compute cluster and each file is very large, or you have many hundreds of files, and you need to automate this process, each iteration or step in the serial per-map loop can be parallelized as a separate job on the cluster. Then you have a final, "dependent" job that does the paste
step at the end, in map-reduce fashion.
I'm trying to wrap my head around the difference between the Union (
-u
) and Merge (-m
). But even looking at the docs it seems both do the same in practical terms. Why do you need both?And also, how do you keep the file names at the top of the columns, as the OP asked?
Thanks a lot in advance
The documentation includes images to visually explain the difference between different operations:
• https://bedops.readthedocs.io/en/latest/content/reference/set-operations/bedops.html • https://bedops.readthedocs.io/en/latest/content/reference/set-operations/bedops.html#everything-u-everything • https://bedops.readthedocs.io/en/latest/content/reference/set-operations/bedops.html#merge-m-merge
Basically, the difference is that
-u
puts input elements into the output set, without changing them. The-m
operation creates new elements by merging the input intervals.Thanks so much! Do you see a straightforward way to keep the file names as feature names? (i.e. column names for columns 4+). Asking this for the case with many input files. For now, I'm processing everything and at the end I'm adding a new line with file names, but still need to manually make sure the columns match
I don't know. It will depend on your inputs and I don't really understand what you are doing. Maybe read through the docs I linked to, in order to see which operators best meet your needs, and then explore?