Ah, a BEDOPS question. Cool. Here's one way to do it, perhaps there are others.
Let's say you have X sort-bed
sorted BED3+1 ("bedgraph") files named like so:
A.bg B.bg C.bg ... X.bg
You can map them with the following two lines of code:
$ bedops --merge A.bg B.bg ... N.bg | bedops --chop 1 - > chop.bed
$ paste chop.bed <(bedmap --echo-map-id --unmapped-val 0 chop.bed A.bg) <(bedmap --echo-map-id --unmapped-val 0 chop.bed B.bg) ... <(bedmap --echo-map-id --unmapped-val 0 chop.bed X.bg) > answer.bed
That ...
bit in the middle of the paste
command is something a simple script would use a loop to create, for however many files you're working with. I'll leave that as an exercise for the reader.
(Note: The limit of X is determined by the Unix system you're using. Without making adjustments to the operating system's kernel settings, this means that generally speaking, in practice, you can run BEDOPS tools on up to 1021 files simultaneously.)
Multibase elements in all your files will be split into single-base elements — whatever is in chop.bed
.
If a position or range in a file does not map to single base position, the --unmapped-val 0
option sets that cell's value to zero. This can be a string, too, so nan
or NaN
or whatever else you'd prefer.
The file answer.bed
will be a BED3+N file that has single-base elements from chop.bed
. The remaining N columns will be zeroes or whatever maps to the bedgraph at the specified position.
The order in which you specify each <(bedmap...)
process will determine the columnar order of results. In other words, if you use A
, B
, ... X
, then the columns will reflect that order.
Because the inputs are sorted, it is all processed in small chunks of memory. It will still take time to run, but I'll wager this will run faster than any other hand-wavy command-line option you're given here that is working with text-based data.
I do recommend first trying this out on a handful of files, so that you understand how this works.
My guess is that putting your data into a database or binary format and storing it in all in memory is better. For the number and size of files you are working with, perhaps that's not a realistic option. You might look at bigWig files but I don't know of any way to do this kind of tabular analysis with a lot of them — not to say it can't be done (maybe look at pyBigWig
and Pandas if you have a huge amount of memory), but bigWigs were designed for visualization.
Another option for speeding things up is to use bedextract
to efficiently split all your BED (or BED3+1) files by chromosome, and then throw 24 or so jobs on your local computational cluster, each doing the map step on its own chromosome. That will typically reduce your runtime down to however long it takes chr1
to complete, which is the longest chromosome and so usually the largest input.
Good luck!
Do you need to work in bedgraph space? The file size is likely smaller with
bigwig
files and you could use something likewiggletools
.If you really have 200 files with that tiny intervals, I see why it takes that long. I think mapping should be faster, so bedtools map should be an alternative (mapping bigwigs as Trivas suggested is a very good idea!). To create the intervals to map to, you could use bedtools makewindows for uniform intervals or bedtools multiinter if you want to retain exactly the original precision.
Thanks for the bigWig suggestion. So, after I convert my .bed to .bw how do I use bedtools map to calculate coverage over a reference interval .bed file for all the 200 .bw? I can create the interval .bed file using makewindows. I need an output that looks like this:
Here, no. of columns = 203 (200 for the samples, and 3 for chr, start and stop). I am confused about which options to use in the bedtools map function.
Sorry, I did not make myself clear: If you have bedGraphs, then you could use bedtools map, with bigwigs rather bigWigAverageOverBed (also available from Bioconda) or wiggletools.