Entering edit mode
10.2 years ago
ifreecell
▴
220
Hi there, I have a GRanges object that I wanna merge duplicate ranges and append counts as a metadata columns. I know similar thing can be easily done using uniq -c
, but I want do it in R/Bioconductor. Is there any suggestion?
> originalGRanges
GRanges with 6 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] Chr1 [2, 24] +
[2] Chr1 [2, 25] +
[3] Chr1 [2, 25] +
[4] Chr1 [4, 25] +
[5] Chr1 [5, 25] +
[6] Chr1 [6, 21] +
---
seqlengths:
Chr1
NA
> expectedGRanges
GRanges with 5 ranges and 1 metadata column:
seqnames ranges strand | hits
<Rle> <IRanges> <Rle> | <numeric>
[1] Chr1 [2, 24] + | 1
[2] Chr1 [2, 25] + | 2
[3] Chr1 [4, 25] + | 1
[4] Chr1 [5, 25] + | 1
[5] Chr1 [6, 21] + | 1
---
seqlengths:
Chr1
NA
I am testing your approach on a sorted 5 million ranges object, it took hours before Rstudio encountered a fetal error. BTW,
countOverlaps
is really RAM consuming, it ate up all RAM (64G) on my computer.Yeah, I imagine that that could prove pretty memory heavy, since I don't think it assumes that things are sorted. You might split() by chromosome and then use lapply(). Perhaps that'll be a bit more reasonable.
There is only one chromosome in this GRanges, I don't think
split()
will help. I wrote the GRanges to a bed file, and it took one second foruniq -c
to compute. I really appreciate it if the entire workflow can be done in Bioconductor, is there other workaround?Is it the
unique()
step that takes forever or justcountOverlaps
(or both)? I would guess that it's just thecountOverlaps()
function. Since you only ever care about exact overlaps/matches, you might be best off writing a custom function to do this. At least if you have to do this more than once or twice.You are right, it's just the
countOverlaps()
function. Thank you anyway, I will try to write a custom function.