Hi,
Always on the quest of learning, I would like to know how to speed up the following task.
I have a reference bed file containing regions and a query bed file that contains peak coordinates and a single signal value.
The goal is to average peak signal within each domain.
Here is my attempt in R on a linux cluster which takes a ~20min to compute (using the full data, example data below):
ref <- data.frame("Chr"=c("chr1","chr1","chr1"),"Start"=c(3000000,3050000,3100000),"End"=c(3050000,3100000,3150000))
ref.gr <- GRanges(ref)
peaks <- data.frame("Chr"=c("chr1","chr2","chr1"),"Start"=c(3010000,3050000,3015000),"End"=c(3011000,3100000,3016000),"Signal1"=c(1,2,3),"Signal2"=c(4,5,6))
peaks.gr <- GRanges(peaks)
x <- sapply(1:length(ref.gr), function(y){
ovlp <- findOverlaps(ref.gr[y],peaks.gr,ignore.strand=T)
ind1 <- subjectHits(ovlp)
if (length(ind1)==0) {
rep(NA,2)
} else {
colMeans(data.frame(peaks.gr)[ind1,6:7])
}
})
result <- cbind(ref,t(x))
result
Chr Start End Signal1 Signal2
1 chr1 3000000 3050000 2 5
2 chr1 3050000 3100000 NA NA
3 chr1 3100000 3150000 NA NA
Thanks for your feedback.
It probably would be fatster to use bedtools.
Not reproducible, please provide some example data and expected output.
Added, sorry.