I have a file of ATAC seq narrow peaks. like PEPATAC pipeline here is the algorithm :,First, the most significant peak is kept and any peak that directly overlaps with that significant peak is removed. Then, this process iterates to the next most significant peak and so on until all peaks have either been kept or removed due to direct overlap with a more significant peak.
here is a sample of my data
data <- structure(list(V1 = c("chr3", "chrUn_KI270467v1", "chr8", "chrUn_KI270467v1",
"chr21", "chr7"), V2 = c(93470281L, 1668L, 109333171L, 1668L,
14382415L, 12686167L), V3 = c(93470873L, 3946L, 109335050L, 3946L,
14384230L, 12688127L), V4 = c("Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_60893",
"Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_97388b", "Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_92241d",
"Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_97388c", "Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_55158c",
"Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_83188b"), V5 = c(91371L,
24480L, 19002L, 17131L, 17084L, 16639L), V6 = c(".", ".", ".",
".", ".", "."), V7 = c(726.76721, 240.65007, 195.49055, 179.63454,
179.23312, 175.41965), V8 = c(9144.74609, 2454.88721, 1906.9408,
1719.66797, 1714.96497, 1670.38489), V9 = c(9137.11816, 2448.07471,
1900.24915, 1713.15186, 1708.45618, 1663.93958), V10 = c(272L,
666L, 1082L, 1445L, 898L, 525L)), row.names = c(88715L, 141209L,
133771L, 141210L, 80584L, 120831L), class = "data.frame")
I sorted my peaks based on their significance so far I wrote a function
for ( i in 1:nrow(B1.1)){
sub<-as_granges(B1.1[i , ], seqnames=1, start=2 , end=3)
que<- as_granges(B1.1 , seqnames=V1 , start=V2 , end=V3)
overlaps<-which(overlapsAny( que, sub))
overlaps<-overlaps[overlaps!=i ]
B1.1<-B1.1[ ! seq(from=1 , to=nrow(B1.1), by=1)%in%overlaps, ]
overlaps<-c()
}
Obviously this code is not very efficient and takes a lot of time .
I would appreciate your suggestions
yes exactly . can you please explain your code? thank you
reduce_ranges
merges all overlapping ranges into consensus ranges.group_by_overlaps
then groups each peak in the original data based on what merged peak (from the previous step) it overlaps.slice
then selects the first peak in these groups of peaks.well actually this is not exactly what I'm looking for. here is the algorithm : First, the most significant peak is kept and any peak that directly overlaps with that significant peak is removed. Then, this process iterates to the next most significant peak and so on until all peaks have either been kept or removed due to direct overlap with a more significant peak.
I modified the code so that it keeps only the peak with the largest value in the V8 columns, which is what I believe your code was pointing to.
I'm not sure how
reduce_ranges
work. but I assume that if I have 3 ranges "1" , "2" , "3" and if "1" and "2" are overlapping and "2" and "3" are overlapping but "1" and "3" don't overlap, it still gives a consensus region merging all 3 ranges. but what I wanna do is different. in the case of this example first I order my ranges by their score(decreasing=TRUE) in the first iteration I want to only delete range "2" and keep range "3" . next I want to search for all overlapping ranges with the next most significant range which is range "3" and I will delete only ranges that directly overlap range "3". I'm sorry if I couldn’t explain more clearly.