Question

[R] Granges Findoverlaps Constrain By Id

2

Entering edit mode

12.1 years ago

bbio ▴ 90

I have two GRanges and I want to find overlaps between them. However, I only want to find overlaps for ranges that have the same parent_id. My current approach is to use the parent_id as the seqnames (instead of the chromosome) under the assumption that findOverlaps will not find matches between different seqnames and that ranges with the same parent_id are on the same chromosome anyway. Is this a valid approach? Are there any potential problems with this? Is there a better way?

r bioconductor • 3.5k views

ADD COMMENT • link updated 11.3 years ago by Biostar 20 • written 12.1 years ago by bbio ▴ 90

Ram · Answer 1 · 2013-04-04

Yes, there are some potential problems. I assume that you are importing the GRanges from an annotation file in GFF format using the rtracklayer package. If you use something different, the concern is most likely also important. From http://www.sequenceontology.org/gff3.shtml:

Parent Indicates the parent of the feature. A parent ID can be used to group exons into transcripts, transcripts into genes, an so forth. A feature may have multiple parents....

You can try this out by using the example code in the rtracklayer package, function import.gff3:

test_path <- system.file("tests", package = "rtracklayer")
test_gff3 <- file.path(test_path, "genes.gff3")

## basic import
test <- import(test_gff3)
test

When you look at the output, you will find that it is an RangedData object which can be coerced into a GRanges object by: as (test, "GRanges") if needed. The parent information is in the "Parent" column, it is of type CompressedCharacterList to accomodate multiple parent entries:

> test[,'Parent']
 RangedData with 31 rows and 1 value column across 2 spaces
          space         ranges   |                    Parent
    <factor>      <IRanges>   | <CompressedCharacterList>
1      chr10 [92828, 95504]   |                          
2      chr10 [92828, 95178]   |             GeneID:347688
3      chr10 [92828, 95504]   |             GeneID:347688
4      chr10 [92828, 94054]   |                   872,873
5      chr10 [92997, 94054]   |                   872,873

To copy these entries into the geneName column which is of type character is possibly neither necessary nor useful.

To restrict the ranges by a certain Parent ID you can do the following:

l1 = drop( test[,'Parent'])[[1]]
ind = sapply(l1,function(x, parent){parent %in% x}, 872) # filterfunction, in this example Parent := 872

test[ind,]
RangedData with 6 rows and 10 value columns across 2 spaces
     space         ranges |      source     type     score   strand     phase                     Alias    geneName          ID
  <factor>      <IRanges> |    <factor> <factor> <numeric> <factor> <integer> <CompressedCharacterList> <character> <character>
1    chr10 [92828, 94054] | rtracklayer     exon        NA        -        NA                                    NA          NA
2    chr10 [92997, 94054] | rtracklayer      CDS        NA        -        NA                                    NA          NA
3    chr10 [94555, 94665] | rtracklayer     exon        NA        -        NA                                    NA          NA
4    chr10 [94555, 94615] | rtracklayer      CDS        NA        -        NA                                    NA          NA
5    chr10 [94744, 94852] | rtracklayer     exon        NA        -        NA                                    NA          NA
6    chr10 [95348, 95504] | rtracklayer     exon        NA        -        NA                                    NA          NA
         Name                    Parent
  <character> <CompressedCharacterList>
1          NA                   872,873
2          NA                   872,873
3          NA                       872
4          NA                       872
5          NA                       872
6          NA                       872