Question

How to speed up averaging peak signal overlapping with domains

0

Entering edit mode

2.8 years ago

gdeniz ▴ 20

Hi,

Always on the quest of learning, I would like to know how to speed up the following task.

I have a reference bed file containing regions and a query bed file that contains peak coordinates and a single signal value.

The goal is to average peak signal within each domain.

Here is my attempt in R on a linux cluster which takes a ~20min to compute (using the full data, example data below):

ref <- data.frame("Chr"=c("chr1","chr1","chr1"),"Start"=c(3000000,3050000,3100000),"End"=c(3050000,3100000,3150000))
ref.gr <- GRanges(ref)

peaks <- data.frame("Chr"=c("chr1","chr2","chr1"),"Start"=c(3010000,3050000,3015000),"End"=c(3011000,3100000,3016000),"Signal1"=c(1,2,3),"Signal2"=c(4,5,6))
peaks.gr <- GRanges(peaks)

x <- sapply(1:length(ref.gr), function(y){
  ovlp <- findOverlaps(ref.gr[y],peaks.gr,ignore.strand=T)
  ind1 <- subjectHits(ovlp)
  if (length(ind1)==0) {
    rep(NA,2)
  } else {
    colMeans(data.frame(peaks.gr)[ind1,6:7])
  }
})

result <- cbind(ref,t(x))
result

   Chr   Start     End Signal1 Signal2
1 chr1 3000000 3050000       2       5
2 chr1 3050000 3100000      NA      NA
3 chr1 3100000 3150000      NA      NA

Thanks for your feedback.

R • 1.3k views

ADD COMMENT • link updated 2.8 years ago by seidel 11k • written 2.8 years ago by gdeniz ▴ 20

1

Entering edit mode

It probably would be fatster to use bedtools.

ADD REPLY • link 2.8 years ago by 4galaxy77 2.9k

0

Entering edit mode

Not reproducible, please provide some example data and expected output.

ADD REPLY • link 2.8 years ago by ATpoint 85k

0

Entering edit mode

Added, sorry.

ADD REPLY • link 2.8 years ago by gdeniz ▴ 20

score 3 · Accepted Answer · 2022-02-16

3

Entering edit mode

2.8 years ago

seidel 11k

One thing that might help is to be a little more straightforward about it, and use the built in facilities for accessing the things you want, rather than converting to other data structures unnecessarily. In your case, you call findOverlaps() but then use the result inefficiently, and I don't understand why you have commas in the accessor for your GRanges objects.

If you have two GRanges objects, one of the regions to be queried: reference, and one of the peaks from which you want to extract and average the signal values, you can do the following:

library(GenomicRanges)
# for GRanges objects of Query Regions: reference
# and Peaks: peaks

ol <- findOverlaps(reference, peaks)
# average the signal for all peaks overlapping each reference element
averagedValues  <- tapply(peaks$signal[subjectHits(ol)], factor(queryHits(ol)), mean)

The result of findOverlaps() is a Hits object, and any reference regions with no overlap will not be returned, so you don't have worry about checking for a result. Then you can simply use the query hits as a factor to average the signal for all the matching peak elements using tapply(), which has the following form:

tapply(dataToBeProcessed, factor(dataGroupIDs), functionToApply)

ADD COMMENT • link 2.8 years ago by seidel 11k

1

Entering edit mode

One easy way to get all the results together would be to add the averaged values to the reference regions:

ref <- GRanges(ranges=IRanges(start=c(3,7,45,80,65), end=c(50,30,63,91,72)),
               seqnames="chr1", strand="*")

peaks <- GRanges(ranges=IRanges(start=c(7,23,50,93,21), end=c(15,45,75,97,33)),
               seqnames="chr1", strand="*", signal=c(5,2,7,5,8))

ol <- findOverlaps(ref, peaks)
averagedValues  <- tapply(peaks$signal[subjectHits(ol)], factor(queryHits(ol)), mean)

# Add the information to the reference object
ref$avg <- NA
ref$avg[as.numeric(names(averagedValues))] <- averagedValues

GRanges object with 5 ranges and 1 metadata column:
      seqnames    ranges strand |       avg
         <Rle> <IRanges>  <Rle> | <numeric>
  [1]     chr1      3-50      * |       5.5
  [2]     chr1      7-30      * |       5.0
  [3]     chr1     45-63      * |       4.5
  [4]     chr1     80-91      * |        NA
  [5]     chr1     65-72      * |       7.0

The reference peaks that were non-overlapping can also be found as follows:

# the ones that were found are the queryHits
# so get everything not them
ref[! 1:length(ref) %in% queryHits(ol)]

GRanges object with 1 range and 1 metadata column:
      seqnames    ranges strand |       avg
         <Rle> <IRanges>  <Rle> | <numeric>
  [1]     chr1     80-91      * |        NA

ADD REPLY • link 2.8 years ago by seidel 11k

0

Entering edit mode

Hi, thanks a lot. tapply() is incredibly useful to add to my arsenal. I updated post with example data, little improvements and my output. How would I make sure to get the NAs for non overlapping reference regions and keeping the chromosome locations?

This is the only way I could think of:

ovlp <- findOverlaps(ref.gr[y],peaks.gr,ignore.strand=T)
y <- apply(peaks[subjectHits(ovlp),4:5],2,function(x) tapply(x, factor(queryHits(ovlp)), mean))
y <- cbind(ref[unique(queryHits(ovlp)),],t(data.frame(y)))
y2 <- ref[-unique(queryHits(ovlp)),]
y2 <- cbind(y2,NA,NA)
colnames(y2) <- colnames(y)
rbind(y,y2)

ADD REPLY • link 2.8 years ago by gdeniz ▴ 20