I would like to know if there is a memory-efficent way of sorting and merging a large amount of bed files, each of them containing millions of entries, into a single bed file that merges the entries, either duplicated or partially overlapping, so that they are unique in the file.
I have tried the following but it blows up in memory beyond the 32G I have available here:
find /my/path -name '*.bed.gz' | xargs gunzip -c | ~/src/bedtools-2.17.0/bin/bedtools sort | ~/src/bedtools-2.17.0/bin/bedtools merge | gzip -c > bed.all.gz
Any suggestions?
a minor addition to this, there is an
-m
option for the sort that takest files that are already individually sorted and merges them into oneI think this would work if I wasn't gunzip'ing the files.
I tried this version and it uses a very small amount of memory. It is slower than the equivalent bedtools sort, but it solves my problem.