Yes, there are some potential problems. I assume that you are importing the GRanges from an annotation file in GFF format using the rtracklayer package. If you use something different, the concern is most likely also important. From http://www.sequenceontology.org/gff3.shtml:
Parent Indicates the parent of the feature. A parent ID can be used to group exons into transcripts, transcripts into genes, an so forth. A feature may have multiple parents....
You can try this out by using the example code in the rtracklayer
package, function import.gff3
:
test_path <- system.file("tests", package = "rtracklayer")
test_gff3 <- file.path(test_path, "genes.gff3")
## basic import
test <- import(test_gff3)
test
When you look at the output, you will find that it is an RangedData object which can be coerced into a GRanges object by: as (test, "GRanges")
if needed. The parent information is in the
"Parent" column, it is of type CompressedCharacterList to accomodate multiple parent entries:
> test[,'Parent']
RangedData with 31 rows and 1 value column across 2 spaces
space ranges | Parent
<factor> <IRanges> | <CompressedCharacterList>
1 chr10 [92828, 95504] |
2 chr10 [92828, 95178] | GeneID:347688
3 chr10 [92828, 95504] | GeneID:347688
4 chr10 [92828, 94054] | 872,873
5 chr10 [92997, 94054] | 872,873
To copy these entries into the geneName column which is of type character is possibly neither necessary nor useful.
To restrict the ranges by a certain Parent ID you can do the following:
l1 = drop( test[,'Parent'])[[1]]
ind = sapply(l1,function(x, parent){parent %in% x}, 872) # filterfunction, in this example Parent := 872
test[ind,]
RangedData with 6 rows and 10 value columns across 2 spaces
space ranges | source type score strand phase Alias geneName ID
<factor> <IRanges> | <factor> <factor> <numeric> <factor> <integer> <CompressedCharacterList> <character> <character>
1 chr10 [92828, 94054] | rtracklayer exon NA - NA NA NA
2 chr10 [92997, 94054] | rtracklayer CDS NA - NA NA NA
3 chr10 [94555, 94665] | rtracklayer exon NA - NA NA NA
4 chr10 [94555, 94615] | rtracklayer CDS NA - NA NA NA
5 chr10 [94744, 94852] | rtracklayer exon NA - NA NA NA
6 chr10 [95348, 95504] | rtracklayer exon NA - NA NA NA
Name Parent
<character> <CompressedCharacterList>
1 NA 872,873
2 NA 872,873
3 NA 872
4 NA 872
5 NA 872
6 NA 872
Thank you for your response. However, I already do some filtering on my data such that each range has exactly zero or one parent. Under these circumstances, do you think my method would be appropriate?
I don't see a major problem then, except the handling of multiplicity or missing parent_ids, and that you need to keep track of the sequence name elsewhere. Just theoretically, it should work and the application case, e.g. finding genes where exons or transcripts overlap, might be useful. To be on the safe side, I would like to refer you to the bioconductor mailing list as well: http://bioconductor.org/help/mailing-list/