Question

Merge duplicate GRanges ranges

3

Entering edit mode

10.2 years ago

ifreecell ▴ 220

Hi there, I have a GRanges object that I wanna merge duplicate ranges and append counts as a metadata columns. I know similar thing can be easily done using uniq -c, but I want do it in R/Bioconductor. Is there any suggestion?

> originalGRanges

GRanges with 6 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     Chr1   [2, 24]      +
  [2]     Chr1   [2, 25]      +
  [3]     Chr1   [2, 25]      +
  [4]     Chr1   [4, 25]      +
  [5]     Chr1   [5, 25]      +
  [6]     Chr1   [6, 21]      +
  ---
  seqlengths:
           Chr1
             NA

> expectedGRanges

GRanges with 5 ranges and 1 metadata column:
      seqnames    ranges strand |      hits
         <Rle> <IRanges>  <Rle> | <numeric>
  [1]     Chr1   [2, 24]      + |         1
  [2]     Chr1   [2, 25]      + |         2
  [3]     Chr1   [4, 25]      + |         1
  [4]     Chr1   [5, 25]      + |         1
  [5]     Chr1   [6, 21]      + |         1
  ---
  seqlengths:
           Chr1
             NA

merge R GRanges • 9.1k views

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by ifreecell ▴ 220

Ram · Answer 1 · 2014-09-04

3

Entering edit mode

10.2 years ago

Devon Ryan 104k

You're looking for the countOverlaps() function:

g <- GRanges(seqnames=c(rep("Chr1",6)), strand=rep("+",6), ranges=IRanges(start=c(2,2,2,4,5,6), end=c(24,25,25,25,25,21)))
g2 <- unique(g)
g2$hits<- countOverlaps(g2, g, type="equal")

You could also run countOverlaps(g,g,type="equal") first and then use unique(). You'll get the same result either way.

ADD COMMENT • link 10.2 years ago by Devon Ryan 104k

0

Entering edit mode

I am testing your approach on a sorted 5 million ranges object, it took hours before Rstudio encountered a fetal error. BTW, countOverlaps is really RAM consuming, it ate up all RAM (64G) on my computer.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by ifreecell ▴ 220

0

Entering edit mode

Yeah, I imagine that that could prove pretty memory heavy, since I don't think it assumes that things are sorted. You might split() by chromosome and then use lapply(). Perhaps that'll be a bit more reasonable.

ADD REPLY • link 10.2 years ago by Devon Ryan 104k

0

Entering edit mode

There is only one chromosome in this GRanges, I don't think split() will help. I wrote the GRanges to a bed file, and it took one second for uniq -c to compute. I really appreciate it if the entire workflow can be done in Bioconductor, is there other workaround?

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by ifreecell ▴ 220

0

Entering edit mode

Is it the unique() step that takes forever or just countOverlaps (or both)? I would guess that it's just the countOverlaps() function. Since you only ever care about exact overlaps/matches, you might be best off writing a custom function to do this. At least if you have to do this more than once or twice.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by Devon Ryan 104k

0

Entering edit mode

You are right, it's just the countOverlaps() function. Thank you anyway, I will try to write a custom function.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by ifreecell ▴ 220