Hi,
I have a dataset with two sets of ranges (.df1 and .df2) and I am trying to find a common set of ranges across both simultaneously. Meaning, I would like to reduce the ranges only if they overlap both sequences in a row.
df = data.frame(seqnames.df1 = c("chr1", "chr1", "chr1"), start.df1 = c(1,3,3), end.df1 = c(4,8,14),
seqnames.df2 = c("chr1", "chr1", "chr1"), start.df2 = c(2,8,4), end.df2 = c(5,11,17),
score1=c(0,1,2), score2=c(2,3,4),score3=c(3,4,5), score4=c(5,6,7))
Desired output:
out = data.frame(seqnames.df1 = c("chr1", "chr1"), start.df1 = c(1,3), end.df1 = c(14,8), seqnames.df2 = c("chr1", "chr1"), start.df2 = c(2,8), end.df2 = c(17,11),
score1=c(1,2), score2=c(3,4),score3=c(4,5), score4=c(6,7))
rows 1 and 3 get reduced to the union of the ranges, because both ranges from .df1 and .df2 overlap. row 2 does not get reduced because, although the first ranges from .df1 overlap with other ranges, the second ranges from .df2 do not (the max score is kept)
Is there a clever way of doing this with GenomicRanges or other packages? I am struggling in finding a good approach, for now I am creating the reduced data for each of the ranges, and then looking for overlaps. Is this the right direction?
This is what I have until now:
library(GenomicRanges)
df1.gr = makeGRangesFromDataFrame(data.frame(chrom=df$seqnames.df1, start=df$start.df1, end=df$end.df1))
df2.gr = makeGRangesFromDataFrame(data.frame(chrom=df$seqnames.df2, start=df$start.df2, end=df$end.df2))
df1.main.gr = reduce(makeGRangesFromDataFrame(data.frame(chrom=df$seqnames.df1, start=df$start.df1, end=df$end.df1)))
df2.main.gr = reduce(makeGRangesFromDataFrame(data.frame(chrom=df$seqnames.df2, start=df$start.df2, end=df$end.df2)))
hits1 = findOverlapsdf1.gr, df1.main.gr)
hits2 = findOverlapsdf2.gr, df2.main.gr)
Thank you for any suggestions!
Can you set up your inputs as BED files (BED5)? I'd like to try to help, but I loaded your data frames into R and still don't really understand what you're trying to do. Some concrete input and output would help.
Yes, I was also looking in R last night, but I realised that it was not clear what the OP wanted. It looks like it would be easier outside of R, in addition.
Hi, Thank you for looking into helping.
I think R is the most appropriate because of the "GenomicRanges" package, the set up of the problem would be similarly hard with bedtools, unless there is a specific function for this situation. What I am trying to do is merge the overlaps across rows only if both the columns overlap.
So, taking for example row 1 and row 2, merge the rows from the two ranges only if
from row 1 and row 2 overlap, AND
from row 1 and row 2 overlap. If for example only one of these overlaps, then we don't reduce the ranges in these rows:
overlap, AND
does not overlap, leave the rows as independent, as they are.
Meaning, I would like to reduce the ranges only if they overlap both sequences in a row.