Question

R reduce across multiple ranges by row

1

Entering edit mode

6.2 years ago

User 7754 ▴ 270

Hi,

I have a dataset with two sets of ranges (.df1 and .df2) and I am trying to find a common set of ranges across both simultaneously. Meaning, I would like to reduce the ranges only if they overlap both sequences in a row.

 df = data.frame(seqnames.df1 = c("chr1", "chr1", "chr1"), start.df1 = c(1,3,3), end.df1 = c(4,8,14), 
 seqnames.df2 = c("chr1", "chr1", "chr1"), start.df2 = c(2,8,4), end.df2 = c(5,11,17), 
 score1=c(0,1,2), score2=c(2,3,4),score3=c(3,4,5), score4=c(5,6,7))

Desired output:

out = data.frame(seqnames.df1 = c("chr1", "chr1"), start.df1 = c(1,3), end.df1 = c(14,8), seqnames.df2 = c("chr1", "chr1"), start.df2 = c(2,8), end.df2 = c(17,11), 
score1=c(1,2), score2=c(3,4),score3=c(4,5), score4=c(6,7))

rows 1 and 3 get reduced to the union of the ranges, because both ranges from .df1 and .df2 overlap. row 2 does not get reduced because, although the first ranges from .df1 overlap with other ranges, the second ranges from .df2 do not (the max score is kept)

Is there a clever way of doing this with GenomicRanges or other packages? I am struggling in finding a good approach, for now I am creating the reduced data for each of the ranges, and then looking for overlaps. Is this the right direction?

This is what I have until now:

library(GenomicRanges)

df1.gr = makeGRangesFromDataFrame(data.frame(chrom=df$seqnames.df1, start=df$start.df1, end=df$end.df1))
df2.gr = makeGRangesFromDataFrame(data.frame(chrom=df$seqnames.df2, start=df$start.df2, end=df$end.df2))

df1.main.gr = reduce(makeGRangesFromDataFrame(data.frame(chrom=df$seqnames.df1, start=df$start.df1, end=df$end.df1)))
df2.main.gr = reduce(makeGRangesFromDataFrame(data.frame(chrom=df$seqnames.df2, start=df$start.df2, end=df$end.df2)))

hits1 = findOverlapsdf1.gr, df1.main.gr)
hits2 = findOverlapsdf2.gr, df2.main.gr)

Thank you for any suggestions!

genomicranges R ranges reduce • 2.5k views

ADD COMMENT • link updated 6.2 years ago by Ram 44k • written 6.2 years ago by User 7754 ▴ 270

0

Entering edit mode

Can you set up your inputs as BED files (BED5)? I'd like to try to help, but I loaded your data frames into R and still don't really understand what you're trying to do. Some concrete input and output would help.

ADD REPLY • link 6.2 years ago by Alex Reynolds 36k

0

Entering edit mode

Yes, I was also looking in R last night, but I realised that it was not clear what the OP wanted. It looks like it would be easier outside of R, in addition.

ADD REPLY • link 6.2 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi, Thank you for looking into helping.

I think R is the most appropriate because of the "GenomicRanges" package, the set up of the problem would be similarly hard with bedtools, unless there is a specific function for this situation. What I am trying to do is merge the overlaps across rows only if both the columns overlap.

So, taking for example row 1 and row 2, merge the rows from the two ranges only if

[ df$seqnames.df1, start=df$start.df1, end=df$end.df1]

from row 1 and row 2 overlap, AND

[df$seqnames.df2, start=df$start.df2, end=df$end.df2) ]

from row 1 and row 2 overlap. If for example only one of these overlaps, then we don't reduce the ranges in these rows:

[ df$seqnames.df1, start=df$start.df1, end=df$end.df1 ]

overlap, AND

[df$seqnames.df2, start=df$start.df2, end=df$end.df2) ]

does not overlap, leave the rows as independent, as they are.

Meaning, I would like to reduce the ranges only if they overlap both sequences in a row.

ADD REPLY • link updated 6.2 years ago by Ram 44k • written 6.2 years ago by User 7754 ▴ 270

score 1 · Answer 1 · 2018-09-10

1

Entering edit mode

6.2 years ago

Alex Reynolds 36k

If you just want ranges (no metadata), here is one approach using Unix streams and BEDOPS tools.

Put your files into sorted BED format:

$ sort-bed A.unsorted.bed > A.bed
$ sort-bed B.unsorted.bed > B.bed

Separate out elements which overlap by at least one base in both sets, and merge their genomic space:

$ bedops --merge <(bedops --element-of 1 A.bed B.bed) <(bedops --element-of 1 B.bed A.bed) > merge.bed

Take the union of the set of elements which do not overlap, and cut out everything but genomic space:

$ bedops --everything <(bedops --not-element-of 1 A.bed B.bed) <(bedops --not-element-of 1 B.bed A.bed) | cut -f1-3 > disjoint.bed

Take the union of the merged and disjoined space:

$ bedops --everything merge.bed disjoint.bed > answer.bed

The file answer.bed should have your ranges.

ADD COMMENT • link 6.2 years ago by Alex Reynolds 36k

0

Entering edit mode

Thank you Alex.

A.bed B.bed

chr1    1       4                                          chr1    2       5
chr1    3       8                                          chr1    8       11
chr1    3       14                                         chr1    4       17

but then this gives me only the overall merged ("chr1:1:17")? I was thinking the solution to my problem could be to use findOverlaps by row, to find if both overlaps are true. I am still not sure how to best approach this....

ADD REPLY • link 6.2 years ago by User 7754 ▴ 270

0

Entering edit mode

Every element in those two sets overlaps.

ADD REPLY • link 6.2 years ago by Alex Reynolds 36k

score 1 · Answer 2 · 2018-09-10

I see. Perhaps this will help you on your way:

df1_GR <- makeGRangesFromDataFrame(
    df,
    seqnames="seqnames.df1",
    start.field="start.df1",
    end.field="end.df1",
    keep.extra.columns=TRUE)

df2_GR <- makeGRangesFromDataFrame(
    df,
    seqnames="seqnames.df2",
    start.field="start.df2",
    end.field="end.df2",
    keep.extra.columns=TRUE)

# find row indices where df1_GR and df2_GR rows overlap by >1 base position
for (i in 1:nrow(df))
{
    if (length(queryHits(findOverlaps(df1_GR[i,], df2_GR[i,], type="any", minoverlap=2)))) {
        indicesOverlapping <- c(indicesOverlapping, i)
    }
}

# reduce / collapse segments where rows have matched
final <- data.frame(
    reduce(df1_GR[indicesOverlapping,]),
    reduce(df2_GR[indicesOverlapping,]))

colnames(final) <- c(
    "seqnames.df1", "start.df1", "end.df1", "width.df1", "strand.df1",
    "seqnames.df2", "start.df2", "end.df2", "width.df2", "strand.df2")

final <- final[,-which(colnames(final) %in% c("width.df1", "strand.df1", "width.df2", "strand.df2"))]

# fina lresult is reduced segments + those rows that did not originally match
final <- rbind(
    final,
    df[-indicesOverlapping, c("seqnames.df1", "start.df1", "end.df1", "seqnames.df2", "start.df2", "end.df2")]
)

final

  seqnames.df1 start.df1 end.df1 seqnames.df2 start.df2 end.df2
1         chr1         1      14         chr1         2      17
2         chr1         3       8         chr1         8      11

This obviously doesn't include the scores, but the general structure is there [I believe] for doing what you need.