I have a long list of genomic coordinates in the format chromosome:position. Most or all of these could be expressed as intervals, ie chr:start-end, because the bases are consecutive. I can think of a bunch of approaches in R, but my list is 50 million lines long and I have a lot of them. Is there a fast way to do this?
If the input data looks like this:
1:501
1:502
1:503
1:634
1:635
1:636
8:9982
8:9983
8:9984
8:9985
etc
I would like the output to look like this:
1:501-503
1:634-636
8:9982-9985
etc
The input data is in order, and each line is unique. Any ideas? I'm open to R/data.table/bioconductor, command line tools like BEDtools etc, unix utilities like awk or whatever. I would prefer to avoid python as it's not present anywhere else in this workflow.
Thank you, this is very clean.