Hi,
I've a GRanges representing a intervals for all genes in the genome. A lot of these intervals are overlapping. I would like to use the reduce()
from GenomicRanges
package in order to make a non-overlapping set of interval. However I would like to do it for each gene separately. Thus for one specific gene, intervals for this gene should not overlap ; but intervals for different genes may overlap. One solution would like to split the GRanges by gene and apply reduce() on each subset but I'm wondering if there is a more efficient way ?
Thanks
Actual data
chrom start end hgnc
1 100 200 MYC
1 150 300 MYC
1 400 500 MYC
1 150 230 TP53
1 200 350 TP53
1 420 550 TP53
expected result
chrom start end hgnc
1 100 300 MYC
1 400 500 MYC
1 150 350 TP53
1 420 550 TP53
My actual solution :
# gene is the dataframe used to create the initial GRanges
do.call(rbind,lapply(
split(gene,gene$hgnc),
function(x){
as.data.frame(
reduce(
GRanges(x$chrom,IRanges(x$start,x$end))))}))
Do you expect 2 rows for this example data? If yes, then group by hgnc, get min(start) max(end) ?
The example is maybe not the best indeed. In this case I expect two lines yes. But I can have more than 1 line per gene in the end (if there is multiple non-overlapping intervals ; that's why I use the
reduce()
function )Please provide better data.
As long as genes do not overlap, simply doing this would work, too?
reduce(GRanges(x$chrom, IRanges(x$start, x$end)))
Just changed with dummy data more suited to the question. and expect result.