Question

How to prevent reduce from being performed in GRanges during setdiff

2

Entering edit mode

3.8 years ago

Bosberg ▴ 50

In the genomic ranges package for R, is there an option in setdiff that prevents the collapsing of adjacent ranges? for example, If I have the following:

gr1 = GRanges object 
  [1]     chr1      1-10      *
  [2]     chr1     11-20      *
  [3]     chr1     21-30      *

gr2 = GRanges object
  [1]     chr1     18-25      *

and I take setdiff:

What I get:

setdiff( gr1, gr2 )
  [1]     chr1      1-17      *   # <-- 1-10 and 11-20 are reduced together automatically; I don't want this.
  [2]     chr1     26-30      *

The output stops at 17 and picks up again at 26 to avoid gr2 from 18-25, so at least that much is correct, but unfortunately, my output now has a single continuous range from 1 to 17 instead of 1-10, and a different GRange directly adjacent from 11-17. There's a reduce operation being done automatically that I want to suppress.

What I want:

setdiff( gr1, gr2, <Some_option_to_suppress_reduce> )
  [1]     chr1      1-10      *
  [2]     chr1     11-17      *
  [3]     chr1     26-30      *

I don’t want these first two regions to be reduced into one.

What I've tried:

The best solution I've come up with so far is to convert to a list and then back to GRange, like this:

unlist( GRangesList( lapply( 1:length(gr1), function(i) setdiff ( gr1[i], gr2) ) ))

..which does what I want but with all the converting between data types it's really slow and inefficient. Is there an option to turn off reduce directly (or some other more elegant solution)?

blunt-end genomicranges setdiff • 2.8k views

ADD COMMENT • link updated 18 months ago by hossiny • 0 • written 3.8 years ago by Bosberg ▴ 50

0

Entering edit mode

Apparently PyRanges does this by default with the function subtract() --which, I think is a much more logical default setting for something called 'subtract' (but for a function called 'setdiff()' I can understand this choice of default behavior; it would be nice if there were an equivalent 'subtract' function for GRanges though).

https://pyranges.readthedocs.io/en/latest/autoapi/pyranges/index.html#pyranges.PyRanges.subtract

Solution=switch to PyRanges, or use function below.

ADD REPLY • link 3.7 years ago by Bosberg ▴ 50

0

Entering edit mode

if you use GenomicRanges::subtract(gr1, gr2) %>% unlist() .. you'll get the diff without reduction

ADD REPLY • link 18 months ago by hossiny • 0

score 2 · Answer 1 · 2021-02-24

Thanks to Rongxin for putting me on the right track here. I wrote the following function in case anyone encounters this in the future (editted March 22 to account for the fact that bedtools assumes base-0 start position; if anyone encounters problems with this, please reply and ping me.):

library(GenomicRanges)
library(bedtoolsr)
# Written by bosberg on Biostars, March 22, 2021. Free to use for scholarly purposes.

GRanges_subtract <- function( gr1, gr2 )
{
  # Subtract GRange object gr2 from gr1, but unlike setdiff, preserve individual
  # ranges in gr1
  df_1 = data.frame( seqnames=seqnames(gr1), start=start(gr1)-1, end=end(gr1), strand=strand(gr1), mcols( gr1 ) )
  df_2 = data.frame( seqnames=seqnames(gr2), start=start(gr2)-1, end=end(gr2), strand=strand(gr2), mcols( gr2 ) )
  #                                                          ^ -1 --> convert to base-0 start for bedtools
  result = bedtoolsr::bt.subtract(df_1, df_2)

  if ( length(result)==0 ){
    # subtraction has left nothing remaining. Return empty GRanges obj.
    return( GRanges() )
  } else {

    colnames( result ) = colnames( df_1 )
    result$start=result$start+1
    #                        ^ reset to base-1 notation consistent with GRanges

    return ( GRanges( result ) )
  }
}

score 1 · Answer 2 · 2021-02-19

Hello Bosberg,

I understand your question. I am sorry that I have no idea about using setdiff to solve this problem.

However, I would like to recommend you an alternative R package (bedtoolsr) which may solve this problem.

Example code

install.packages("devtools")
devtools::install_github("PhanstielLab/bedtoolsr")

library(GenomicRanges)
library(bedtoolsr)

A <- GRanges("chr1", IRanges(c(1, 11, 21), c(10, 20, 30)))
B <- GRanges("chr1", IRanges(18, 25))

A.2 <- data.frame(seqnames=seqnames(A), starts=start(A), ends=end(A))
B.2 <- data.frame(seqnames=seqnames(B), starts=start(B) - 1, ends=end(B) + 1) # boundary should be handled carefully
print(bedtoolsr::bt.subtract(A.2, B.2))
# result
#    V1 V2 V3
#1 chr1  1 10
#2 chr1 11 17
#3 chr1 26 30
# then convert data.frame to GRanges

score 0 · Answer 3 · 2021-10-28

Here is a solution based on R/GenomicRanges only:

subtract.GRanges <- function(x, y){
  stopifnot(isDisjoint(x))
  z <- disjoin(c(x, y))
  return(z[!overlapsAny(z, y)])
}

Lets try it on your input:

x <- GRanges(c('chr1:1-10', 'chr1:11-20', 'chr1:21-30'))
y <- GRanges('chr1:18-25')
subtract.GRanges(x, y)

The output is:

GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      1-10      *
  [2]     chr1     11-17      *
  [3]     chr1     26-30      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Here a graphical representation of how it works:

Graphical representation of the subtract.GRanges()

First the two sets of GRanges x and y are combined and disjoined so that we get new GRanges z with all regions and breakpoints present in x and y. Then, all ranges in z that overlap y are removed to get the final result.

One caveat is that the method only works if the ranges in x are disjoint. Disjoint means that there must not be any overlapping ranges within a GRanges object. If this method would be applied on x with overlapping ranges the result might not be as expected, so the function above checks this condition.