findOverlaps function in R
2
8
Entering edit mode
8.9 years ago
SGMS ▴ 130

Hi everyone,

I have used the "findOverlaps" function in R to find which positions of my two datasets overlap. I have also used "countOverlaps" to see how many overlaps I have.

What I want to do now, is to find the exact coordinates which overlapped. I was looking how to do it using findOverlaps but it has no such options and by doing a google search, I couldn't find much help.

A little help on this would be greatly appreciated.

Thank you

findOverlaps R • 20k views
ADD COMMENT
0
Entering edit mode

I have switched to foverlaps (data.table package) whenever I working with genomic regions.

A fast binary-search based overlap join of two data.tables. This is very much inspired by findOverlaps function from the bioconductor package IRanges...

Simply intersect two data tables and extract overlapping region. Please post example data and I will show how it may be done.

ADD REPLY
0
Entering edit mode

Thank you for your reply Pgibas. Example data as required:

1st dataset:

1   668630  850204
1   713044  755966
1   738570  742020
1   766600  769112

2nd dataset:

1    149854870    149909434
1    17325224     17400665
1    19971779     20026055
1    24180459     24259817

Thanks again!

ADD REPLY
1
Entering edit mode

You datasets don't overlap.

ADD REPLY
0
Entering edit mode

What is the expected output? Do you want a new set of regions that represents the overlaps between dataset 1 and 2?

ADD REPLY
0
Entering edit mode

yes! exactly that!

ADD REPLY
0
Entering edit mode

I personally use mergeByOverlaps and subsetByOverlaps for these cases.

ADD REPLY
10
Entering edit mode
8.9 years ago
PoGibas 5.1k

I have done this with dummy data.

Lets say we have we have regions A (groupA) and regions B (groupB).

groupA <- data.table(
    chr = rep("chr1", 5),
    start = seq(0, 1000, 100),
    end = seq(50, 1050, 100))
setkey(groupA, chr, start, end)

groupB <- data.table(
    chr = rep("chr1", 5),
    start = seq(25, 1025, 100),
    end = seq(75, 1075, 100))
setkey(groupB, chr, start, end)

Check:

  1. If your datasets are data.table class(groupA), if not do setDT(groupA)
  2. If keys are chr start end, if not do setkey(groupA, chr, start, end)

# Find overlaps
over <- foverlaps(groupA, groupB, nomatch = 0)
# Extract exact regions
over2 <- data.table(
    chr = over$chr,
    start = over[, ifelse(start > i.start, start, i.start)],
    end = over[, ifelse(end < i.end, end, i.end)])
ADD COMMENT
1
Entering edit mode

Thanks so much for this. My coordinates actually overlap -my datasets are much larger than the above-. I used your code on my data and it works perfectly. I have double-checked the number of hits with findOverlaps and the number is the same, but now I also have the overlapped coordinates! Thanks again :)

ADD REPLY
0
Entering edit mode

If it helped to solve your problem mark the answer :)

ADD REPLY
0
Entering edit mode

@Pgibas If I want to label the values, for example in a new column I assign two values, "Exact" or "Partial" based on if the value in groupA exactly falls within the range of region in groupB and if there is a partial overlap, respectively. How can I do that?

ADD REPLY
9
Entering edit mode
8.9 years ago

The findOverlaps() method is not the right tool for the job. Instead, you want the set intersection which can be obtained with intersect() on two GRanges objects.

# create two GRanges objects for testing
gr1 = GRanges(seqnames='chr1',ranges=IRanges(start=seq(1,1001,100),width=50))
gr2 = GRanges(seqnames='chr1',ranges=IRanges(start=seq(25,1025,100),width=50))

# and "intersect" them, finding the set of regions
# formed by the overlaps of gr1 and gr2
gr3 = intersect(gr1,gr2)

# and the result
gr3

GRanges object with 11 ranges and 0 metadata columns:
       seqnames       ranges strand
          <Rle>    <IRanges>  <Rle>
   [1]     chr1 [  25,   50]      *
   [2]     chr1 [ 125,  150]      *
   [3]     chr1 [ 225,  250]      *
   [4]     chr1 [ 325,  350]      *
   [5]     chr1 [ 425,  450]      *
   [6]     chr1 [ 525,  550]      *
   [7]     chr1 [ 625,  650]      *
   [8]     chr1 [ 725,  750]      *
   [9]     chr1 [ 825,  850]      *
  [10]     chr1 [ 925,  950]      *
  [11]     chr1 [1025, 1050]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
ADD COMMENT
0
Entering edit mode

Thank you! A much easier solution! I'm wondering: I get 137 ranges, meaning we have 137 overlapped coordinates right?

When I was using countOverlaps though, it gave me 195. Shouldn't the number of these two be the same -despite the fact that "intersect" also gives us the coordinates-?

Thanks again!

ADD REPLY
1
Entering edit mode

If regions in dataset1 overlap with other regions in dataset1, they will end up being "collapsed" into a single region in the output of intersect(). This is the discrepancy with countOverlaps, which will count the overlapping regions from dataset1 twice.

ADD REPLY

Login before adding your answer.

Traffic: 2003 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6