Question

Granges manipulation

0

Entering edit mode

8 months ago

Bastien Hervé 5.9k

Hello,

I would like not to reinvent to wheel, I have a sorted grange of overlapping transcript positions (names). My goal is to aggregate to names for the positions they are overlapping on, while keeping the positions unique to a single transcript.

With an example :

gr <- GRanges(
  seqnames = c("chr1", "chr1", "chr1", "chr1"),
  ranges = IRanges(start = c(50, 75, 80, 85),
                   end = c(110, 90, 110, 120)),
  names = c("id1", "id2", "id3", "id4")
)

The expected output would be something like :

gr_output <- GRanges(
  seqnames = c("chr1", "chr1", "chr1", "chr1","chr1", "chr1"),
  ranges = IRanges(start = c(50, 75, 80, 85, 91, 111),
                   end = c(74, 79, 84, 90, 110, 120)),
  names = c("id1", "id1;id2", "id1;id2;id3", "id1;id2;id3;id4", "id1;id3;id4", "id4")
)

Maybe something with findOverlaps, reduce and aggregate, or summarise ? Or maybe another tool like bedtools ?

granges • 492 views

ADD COMMENT • link 8 months ago by Bastien Hervé 5.9k

score 5 · Accepted Answer · 2024-02-28

you're looking for disjoin

> d<-disjoin(r,with.revmap=TRUE)
> d
GRanges object with 6 ranges and 1 metadata column:
      seqnames    ranges strand |        revmap
         <Rle> <IRanges>  <Rle> | <IntegerList>
  [1]     chr1     50-74      * |             1
  [2]     chr1     75-79      * |           1,2
  [3]     chr1     80-84      * |         1,2,3
  [4]     chr1     85-90      * |     1,2,3,...
  [5]     chr1    91-110      * |         1,3,4
  [6]     chr1   111-120      * |             4
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> d$names<-unlist(lapply(d$revmap,\(i){paste(collapse=';',r$names[i])} ))
> d
GRanges object with 6 ranges and 2 metadata columns:
      seqnames    ranges strand |        revmap           names
         <Rle> <IRanges>  <Rle> | <IntegerList>     <character>
  [1]     chr1     50-74      * |             1             id1
  [2]     chr1     75-79      * |           1,2         id1;id2
  [3]     chr1     80-84      * |         1,2,3     id1;id2;id3
  [4]     chr1     85-90      * |     1,2,3,... id1;id2;id3;id4
  [5]     chr1    91-110      * |         1,3,4     id1;id3;id4
  [6]     chr1   111-120      * |             4             id4
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths