Question

Understanding findoverlaps function in GenomicRanges

1

Entering edit mode

3.5 years ago

alexmondaini ▴ 20

I would like to understand better the meaning of the indexes for both Granges and Grangeslist objects. I will post here what I think it's going on in this function and please do correct me if I'm wrong.

Let's assume two grange objects:

> g
GRanges object with 4 ranges and 2 metadata columns:
    seqnames    ranges strand |     score        GC
       <Rle> <IRanges>  <Rle> | <integer> <numeric>
  a     chr1   101-113      - |         1  1.000000
  b     chr2   102-112      + |         2  0.888889
  c     chr2   103-113      + |         3  0.777778
  d     chr2   150-200      * |         4  0.666667
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

and

> g2
GRanges object with 2 ranges and 2 metadata columns:
    seqnames    ranges strand |     score        GC
       <Rle> <IRanges>  <Rle> | <integer> <numeric>
  a     chr1   111-120      - |         1  1.000000
  b     chr2   102-112      + |         2  0.888889
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

If I use findoverlaps like that:

> findOverlaps(g,g2)
Hits object with 3 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  [2]         2           2
  [3]         3           2
  -------
  queryLength: 4 / subjectLength: 2

Here query is g and subject is g2 and the indexes(numbers) of queryhits from this Hits object which are 1 2 3 represent the index in the g object. Is that correct ? So 1 2 3 indexes in this case will be the first 3 rows of g, which are:

> g
GRanges object with 4 ranges and 2 metadata columns:
    seqnames    ranges strand |     score        GC
       <Rle> <IRanges>  <Rle> | <integer> <numeric>
  a     chr1   101-113      - |         1  1.000000
  b     chr2   102-112      + |         2  0.888889
  c     chr2   103-113      + |         3  0.777778

For the subjectHits these are the indexes of g2 which the query is hitting to, is that correct ? So the first index of g hits the first index of g2, the second hits the second and the third hits the second again.

If those statements are correct I think I got the idea of those indices in both queryHits and subjectHits columns.

Now things start to get confusing when using a grange object and grangelist, for example this grangelist:

> grl
GRangesList object of length 2:
$txA
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |     score        GC
         <Rle> <IRanges>  <Rle> | <integer> <numeric>
  [1]     chr2   103-106      + |         5      0.45
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

$txB
GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |     score        GC
         <Rle> <IRanges>  <Rle> | <integer> <numeric>
  [1]     chr1   107-109      + |         3       0.3
  [2]     chr1   113-115      - |         4       0.5
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

If I use findoverlaps like that:

> findOverlaps(g,grl)
Hits object with 3 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           2
  [2]         2           1
  [3]         3           1
  -------
  queryLength: 4 / subjectLength: 2

I think the queryHits are following the same logic as before, but subjectHits is representing not the row indices of the different objects in the list but rather the list indices, is that correct ? So in this case 2 1 1 means that the first query is hitting a genomic region somewhere in the 2 index of this GrangeList object. This 2 is equal to $txB in this case.

I'm thinking about the possibilities of building a matrix of hits and missing hits using the genomicRanges package but I'm not sure if this function is appropriate for my use case:

I would like to have something like that when using findoverlaps(g,grl):

      queryHits     txa             txb
      <integer>   <logical>   <logical>
  [1]         1           0               1
  [2]         2           1               0
  [3]         3           1               0
  [4]         4           0               0

So here in this hypothetical case we have a logical vectors for both txa and txb objects , there queryhits correspond to the indexes of g and even if there are no hits the index should be included, for example index 4 doesn't hit any regions in txa and txb and so we have 0 0. For the first index it hits txb but not txa.

Any help is much appreciated!!! thanks

GenomicRanges • 1.4k views

ADD COMMENT • link 3.5 years ago by alexmondaini ▴ 20