Change list of consecutive genomic coordinates to intervals?
3
1
Entering edit mode
7.1 years ago

I have a long list of genomic coordinates in the format chromosome:position. Most or all of these could be expressed as intervals, ie chr:start-end, because the bases are consecutive. I can think of a bunch of approaches in R, but my list is 50 million lines long and I have a lot of them. Is there a fast way to do this?

If the input data looks like this:

1:501
1:502
1:503
1:634
1:635
1:636
8:9982
8:9983
8:9984
8:9985
etc

I would like the output to look like this:

1:501-503
1:634-636
8:9982-9985
etc

The input data is in order, and each line is unique. Any ideas? I'm open to R/data.table/bioconductor, command line tools like BEDtools etc, unix utilities like awk or whatever. I would prefer to avoid python as it's not present anywhere else in this workflow.

sequence R intervals • 1.8k views
ADD COMMENT
3
Entering edit mode
7.1 years ago

using bedtools merge http://bedtools.readthedocs.io/en/latest/content/tools/merge.html

awk -F ':' '{printf("%s\t%d\t%d\n",$1,$2,1+$2);}'  input.txt|\
sort -t $'\t' -k1,1 -k2,2n |\
bedtools merge -i - |\
awk '{printf("%s:%d-%d\n",$1,$2,$3 -1);}'


1:501-503
1:634-636
8:9982-9985
ADD COMMENT
0
Entering edit mode

Thank you, this is very clean.

ADD REPLY
1
Entering edit mode
7.1 years ago

Hi Patrick, assuming your data is in a file called MyData.intervals you can try these commands (mixture of sed, paste, and awk; saves results into a new file called MyNewData.intervals):

sed 's/:/\t/g' MyData.intervals | awk '{chr[NR]+=$1; bp[NR]+=$2} END {for (i=1; i<=NR; i++) if(chr[i]==chr[i+1] && ((bp[i]-1)!=bp[i-1])) print chr[i]":"bp[i]; else if (chr[i]==chr[i-1] && ((bp[i]+1)!=bp[i+1])) {print bp[i]}}' | paste -s -d'-\n' > MyNewData.intervals
cat MyNewData.intervals
1:501-503
1:634-636
8:9982-9985

Kevin

ADD COMMENT
1
Entering edit mode
7.1 years ago
Charles Plessy ★ 2.9k

With Bioconductor:

> suppressPackageStartupMessages(library(GenomicRanges))
> reduce(
    GRanges(
      seqnames = c(1,1,1,1,1,1,8,8,8,8),
      ranges   = IRanges(
                   start = c(501,502,503,634,635,636,9982,9983,9984,9985),
                   width = 1)))
GRanges object with 3 ranges and 0 metadata columns:
      seqnames       ranges strand
         <Rle>    <IRanges>  <Rle>
  [1]        1 [ 501,  503]      *
  [2]        1 [ 634,  636]      *
  [3]        8 [9982, 9985]      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
ADD COMMENT

Login before adding your answer.

Traffic: 1759 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6