Entering edit mode
8.4 years ago
floris.barthel
▴
50
I'm trying to overlap several hundred thousand breakpoint locations with cytobands. Because I want to account for breakpoints spanning a cytoband junction, I'm using findOverlaps and want to use split to gather them by breakpoint.
However, while searching for overlaps is extremely fast, splitting them up is tediously slow. Is there any way to make this faster?
> hits1 = findOverlaps(gr1, cbr, ignore.strand=T)
> hits
> Hits object with 871572 hits and 0 metadata columns:
> queryHits subjectHits
> <integer> <integer>
> [1] 1 1
> [2] 2 1
> [3] 3 1
> [4] 4 1
> [5] 5 2
> ... ... ... [871568] 871562 419 [871569] 871563 421 [871570] 871564 421
> [871571] 871565 461 [871572] 871566 475
> ------- queryLength: 871566 subjectLength: 862
This works quick enough, the next step takes forever to finish however:
> hits1 = split(hits1,queryHits(hits1)) ## This is extremely slow
Can I optimize this?