Merge duplicate GRanges ranges
1
3
Entering edit mode
10.2 years ago
ifreecell ▴ 220

Hi there, I have a GRanges object that I wanna merge duplicate ranges and append counts as a metadata columns. I know similar thing can be easily done using uniq -c, but I want do it in R/Bioconductor. Is there any suggestion?

> originalGRanges

GRanges with 6 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     Chr1   [2, 24]      +
  [2]     Chr1   [2, 25]      +
  [3]     Chr1   [2, 25]      +
  [4]     Chr1   [4, 25]      +
  [5]     Chr1   [5, 25]      +
  [6]     Chr1   [6, 21]      +
  ---
  seqlengths:
           Chr1
             NA
> expectedGRanges

GRanges with 5 ranges and 1 metadata column:
      seqnames    ranges strand |      hits
         <Rle> <IRanges>  <Rle> | <numeric>
  [1]     Chr1   [2, 24]      + |         1
  [2]     Chr1   [2, 25]      + |         2
  [3]     Chr1   [4, 25]      + |         1
  [4]     Chr1   [5, 25]      + |         1
  [5]     Chr1   [6, 21]      + |         1
  ---
  seqlengths:
           Chr1
             NA
merge R GRanges • 9.1k views
ADD COMMENT
3
Entering edit mode
10.2 years ago

You're looking for the countOverlaps() function:

g <- GRanges(seqnames=c(rep("Chr1",6)), strand=rep("+",6), ranges=IRanges(start=c(2,2,2,4,5,6), end=c(24,25,25,25,25,21)))
g2 <- unique(g)
g2$hits<- countOverlaps(g2, g, type="equal")

You could also run countOverlaps(g,g,type="equal") first and then use unique(). You'll get the same result either way.

ADD COMMENT
0
Entering edit mode

I am testing your approach on a sorted 5 million ranges object, it took hours before Rstudio encountered a fetal error. BTW, countOverlaps is really RAM consuming, it ate up all RAM (64G) on my computer.

ADD REPLY
0
Entering edit mode

Yeah, I imagine that that could prove pretty memory heavy, since I don't think it assumes that things are sorted. You might split() by chromosome and then use lapply(). Perhaps that'll be a bit more reasonable.

ADD REPLY
0
Entering edit mode

There is only one chromosome in this GRanges, I don't think split() will help. I wrote the GRanges to a bed file, and it took one second for uniq -c to compute. I really appreciate it if the entire workflow can be done in Bioconductor, is there other workaround?

ADD REPLY
0
Entering edit mode

Is it the unique() step that takes forever or just countOverlaps (or both)? I would guess that it's just the countOverlaps() function. Since you only ever care about exact overlaps/matches, you might be best off writing a custom function to do this. At least if you have to do this more than once or twice.

ADD REPLY
0
Entering edit mode

You are right, it's just the countOverlaps() function. Thank you anyway, I will try to write a custom function.

ADD REPLY

Login before adding your answer.

Traffic: 2960 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6