Evaluate overlap of regions
2
1
Entering edit mode
6.1 years ago
gamma.jian ▴ 40

Dear all,

I am struggling with what should probably be a trivial problem. I have 2 seg files with the following structure:

chr    start   end  feature.1  feature.2 
1       1000  1200   0.5          0.001


chr    start   end  feature.1  feature.2  
1       1100  1400   0.7          0.004

Let's call them A and B. I want to obtain a unique file like this:

chr    start   end     feature.1.a      feature.1.b
1       1000    1100     0.5             NA
1       1100    1200       0.5            0.7
1       1200    1400     NA             0.7

If a region is covered by only A or B, then it would contain only the corresponding values and it would have NA in the other fields. Otherwise, I want to retrieve both of them. This looks close to what bedtools -intersect does, and indeed I am trying to use the package Bedr to solve this task, but it seems to leave out regions covered only in one file. Maybe GenomicRanges is another possibility. If you have any hint on this it would be very helpful to me. Thank you

genome R • 1.5k views
ADD COMMENT
1
Entering edit mode

See if this example in this answer helps:

A: Struggling to filter data with R, tidyverse

ADD REPLY
0
Entering edit mode

Thank you, even though the operation is not symmetrical it definitely helped out with my problem. I can add it as a solution myself or you can add it and I will close the post. Thank you again.

ADD REPLY
0
Entering edit mode

Feel free to answer your own question, and accept as answer.

ADD REPLY
1
Entering edit mode

There's a --partition option in BEDOPS bedops that solves this pretty easily. Some people don't seem to like using this kit for some reason, but if you're interested, let me know and I'll put together an answer.

ADD REPLY
2
Entering edit mode
6.1 years ago
gamma.jian ▴ 40

I managed to deal with the issue using data.table::foverlaps. I report an example from another thread.

A: Struggling to filter data with R, tidyverse

The solution needs to be adapted but essentially solves the problem.

ADD COMMENT
1
Entering edit mode
6.1 years ago

Given two GenomicRanges objects a and b.

library(GenomicRanges)
a <- GRanges("chr1:1000-1200")
b <- GRanges("chr1:1100-1400")
a$feature.1 <- 0.5
b$feature.1 <- 0.7

get.overlap.info <- function(a,b){
   overlap.r <- intersect(a,b)
   different.r1 <- setdiff(a,b)
   different.r2 <- setdiff(b,a)
   different.r <- c(different.r1,different.r2)
   full.r <- c(different.r,overlap.r)
   ol.a <- findOverlaps(full.r,a)
   ol.b <- findOverlaps(full.r,b)
   full.r$feature.1.a <- NA
   full.r$feature.1.b <- NA
   full.r$feature.1.a[queryHits(ol.a)] <- a$feature.1[subjectHits(ol.a)]
   full.r$feature.1.b[queryHits(ol.b)] <- b$feature.1[subjectHits(ol.b)]
   return(full.r)
}

new.ranges <- get.overlap.info(a,b)
new.ranges
GRanges object with 3 ranges and 2 metadata columns:
      seqnames    ranges strand | feature.1.a feature.1.b
         <Rle> <IRanges>  <Rle> |   <numeric>   <numeric>
  [1]     chr1 1000-1099      * |         0.5        <NA>
  [2]     chr1 1201-1400      * |        <NA>         0.7
  [3]     chr1 1100-1200      * |         0.5         0.7
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
ADD COMMENT
0
Entering edit mode

you could also use sort() on that final object so that it is sorted

ADD REPLY

Login before adding your answer.

Traffic: 2511 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6