What Is The Quickest Algorithm For Range Overlap?
15
28
Entering edit mode
14.2 years ago
User 1586 ▴ 280

A very common bioinformatics query is a range overlap, looking for features/annotations/etc that overlap a certain 'slice' of sequence.

There are many different implementations (database like search trees, nested list, binning etc.).

My question is, have people profiled these algorithms? If so, what is the best algorithm for a 'dense' dataset, for example NGS data against a genome?

If this has not been worked out perhaps it could be a code challenge?

python alignment next-gen sequencing database • 40k views
ADD COMMENT
3
Entering edit mode

I would try to differentiate between algorithms and complexity and certain implementations and profiling. An algorithm has space and time complexity which is a universally valid feature of the algo, while an implementation has run-time and memory usage, which depend on the algorithm, prog. language, machine type., CPU frequency, etc.

ADD REPLY
0
Entering edit mode

It's worth noting that when doing bioinformatics, I'm usually more interested in the results than the process. So the question for me isn't what's fastest, but what's easiest and fast enough.

ADD REPLY
13
Entering edit mode
14.2 years ago

The intersectBed tool in my BEDTools suite allows one to compare NGS reads in BAM format to large annotation files in BED, GFF, or VCF (v4.0) format. The algorithm uses the UCSC hierarchical binning approach (an RTree). I am also implementing an approach based on the fjoin algorithm that assumes both input files are sorted by chrom, the start position. I use intersectBed quite frequently for large BAM files and it can be piped with samtools as well. For example:

$ samtools view -u -f 0x2 aln.bam | intersectBed -abam stdin -b exons.bed > aln.inExons.bam

I also note that as Michael mentions regarding complexity of tree-based approaches, it is best to load the smaller file into the tree (memory). For BEDTools, this is the -b file.

ADD COMMENT
10
Entering edit mode
14.2 years ago
Michael 55k

As you were asking about 'quickest' I assume you refer to complexity of the algorithm, and in particular time complexity. Complexity analysis is a standard tool in computer-science and using it for the analysis of an algorithm is normally preferred over profiling a specific implementation of an algorithm. However, you might ask the question which implementation is fastest. There are other aspects, like convenience of use or implementation complexity (such as lines of code, and hence probability of code errors) that come into play.

I agree with Aaron, that the IRanges implementation is imh most convenient especially for use with R. There are two mainly two kinds of algortihms in addition to the naive implementation:

  • Interval tree based algortihms use a tree structure to find overlaps
  • Sorting based algorithms which sort the intervals first

IRanges uses the first kind of algorithms.

Regarding the complexity analysi I would like to quote the IRanges documentation because they sort of fully answer your question:

Notes on Time Complexity The cost of constructing an instance of the interval tree is a O(n*lg(n)), which makes it about as fast as other types of overlap query algorithms based on sorting. The good news is that the tree need only be built once per subject; this is useful in situations of frequent querying. [...]

For the query operation, the running time is based on the query size m and the average number of hits per query k. The output size is then max(mk,m), but we abbreviate this as mk. Note that when the multiple parameter is set to FALSE, k is fixed to 1 and drops out of this analysis. We also assume here that the query is sorted by start position (the findOverlaps function sorts the query if it is unsorted).

An upper bound for finding overlaps is O(min(mklg(n),n+mk)). The fastest interval tree algorithm known is bounded by O(min(mlg(n),n)+mk) but is a lot more complicated and involves two auxillary trees. The lower bound is Omega(lg(n)+mk), which is almost the same as for returning the answer, Omega(mk). The average is of course somewhere in between. This analysis informs the choice of which set of ranges to process into a tree, i.e. assigning one to be the subject and the other to be the query.Note that ifm > n,then the running time is O(m),and the total operation of complexity O(n*lg(n) + m) is better than if m and n were exchanged. Thus, for once-off operations, it is often most efficient to choose the smaller set to become the tree (but k also affects this). [...]

ADD COMMENT
9
Entering edit mode
14.2 years ago

The bx-python library is excellent for doing overlaps.

Specifically, check out the intersection and cluster interfaces.

For algorithm implementation discussions and timings, see: http://hackmap.blogspot.com/2008/11/python-interval-tree.html

This previous related question has example code by Istvan.

ADD COMMENT
0
Entering edit mode
ADD REPLY
8
Entering edit mode
14.2 years ago
Aaron Statham ★ 1.1k

The findOverlaps implementation in the Bioconductor IRanges package is plenty fast enough for me, I regularly use it with 10s of millions of reads & 10s of thousands of features, but I'm sure it can continue to scale

ADD COMMENT
1
Entering edit mode

I've used IRanges to intersect 1e7 positions against 2.5e6 positions with no problems, and it probably only uses 2 gigs or so of memory. So yeah, it scales well.

ADD REPLY
0
Entering edit mode

I've not used IRanges --- how does it scale when at least one of the two datasets are large (i.e. 10E6 to 10E10), such as is the case for NGS alignments? I am asking out of sheer curiosity.

ADD REPLY
6
Entering edit mode
14.2 years ago
Mitch Skinner ▴ 660

To some extent, the answer depends on the data; I've been using nested containment lists a lot, and they have good typical-case algorithmic complexity, and a pretty good constant factor compared to R-Tree based approaches (going by the benchmarks in the original paper). They have bad worst-case complexity, though: if each of your intervals is contained within all the preceding intervals, the NCList degrades into a linked list, and queries are O(n).

Given the processes that generate most biological data, I think it's okay to choose based on the typical case. In computer science generally, you have to worry about things like adversaries who are trying to attack you, who will purposely choose worst-case data. But in bioinformatics that's less of a concern in my experience.

If you have a large number of queries that you want to run in a batch, like the intersection use case that Aaron Quinlan and the IRanges documentation that Michael Dondrup quotes are talking about (say, you have a set of genes and you want to find all the SNPs within a certain range upstream of each gene), then you can sort the list of genes and the list of SNPs, and then compare them by streaming through the sorted lists simultaneously (sort of like a database merge join, although you can't actually use a database merge join because overlap isn't transitive the way most comparison operators are). I think this is the "fjoin" approach Aaron mentioned.

If you do the sort-both-sides-of-the-comparison thing, supposing you have m genes and n snps, with k snps overlapping each gene on average, then the complexity of the whole thing is O(m * log(m) + n * log(n)) to set up, and O(m * k + n) to run. That may be better than doing the queries separately against a tree-based query mechanism, which will be O(n * log(n)) to set up, and O(m * (log(n) + k)) to run.

Assuming I've got all that right, then which approach is faster depends on m, n, and k, and it also depends on whether or not you can amortize the setup cost over many intersection runs.

ADD COMMENT
0
Entering edit mode

Hi, I am curious about how the nested containment list scales with respect to read depth for NGS type data. If you had, say 20X coverage would this be a bad case scenario? Thanks for your thoughts!

ADD REPLY
0
Entering edit mode

i'm genuinely interested to see a real use-case where a nested containment list outperforms an interval tree. also, what implementation of nclist are you using?

ADD REPLY
0
Entering edit mode

@unknown: as far as I know, NCLists will typically scale just fine with deep NGS data. The only bad case for nclists is data with lots of containment, and if your reads are all the same length, then cases where one read is contained within another should be rare. I guess another not-so-good case for NCLists is where the data gets updated a lot--I think something decent could be done there, but I don't think anything like that has been described yet.

ADD REPLY
0
Entering edit mode

@Brent: Hi! I was only going by the benchmarks in the original NCList paper that compared their NCList implementation with postgres R-Trees. It's not a perfect comparison, though, so it might be interesting to do some more benchmarking. And I haven't really done the comparison myself. If you have a python interval tree, you could benchmark it against the the pygr NCList implementation. I think the pygr one is a python wrapper around a C implementation, though.

I've been using my own NClist implementation; I've been doing the construction in Perl and the querying in javascript.

ADD REPLY
0
Entering edit mode

@unknown: the scalability testing that I've done has been with simulated 30X unpaired reads on the first few human chromosomes. And my code doesn't handle that volume of data very well, but I think that's more the fault of my code than of the algorithm itself (I only have a batch-mode version of NCList construction implemented so far, and not a streaming-mode version, yet).

Maybe a code golf and/or some additional public benchmarking would be a nice way of getting some more clarity about the applicability of NCLists relative to other approaches.

ADD REPLY
0
Entering edit mode

@unknown: the scalability testing that I've done has been with simulated 30X unpaired reads on the first few human chromosomes. And my code doesn't handle that volume of data very well, but I think that's more the fault of my code than of the algorithm itself (I only have a batch-mode version of NCList construction implemented so far, and not a streaming-mode version, yet). Maybe a code golf (as you suggested) and/or some additional public benchmarking would be a nice way of getting some more clarity about the applicability of NCLists relative to other approaches.

ADD REPLY
4
Entering edit mode
14.2 years ago
brentp 24k

Just for completeness, I'll mention the recently published BigBed/BigWig format and implementation which

"allow the high-performance display of next-generation sequencing"next-generation sequencing"

They have a mixed B-Tree / 1D R-Tree implementation described in the paper whose size is "is typically less than 1% of the size of the data itself." They also briefly mention advantages over a nested containment list.

ADD COMMENT
3
Entering edit mode
14.2 years ago

You may want to investigate NCList as an alternative to the more commonly used binning approach. python libraries are available from pygr

ADD COMMENT
0
Entering edit mode
ADD REPLY
2
Entering edit mode
14.2 years ago

Try the overlapSelect utility from the UCSC Source Tree

I'm not familiar with the details of the algorithm this utility implements, but I believe it inherits fast intersection capabilities from being built around mySQL libraries.

ADD COMMENT
2
Entering edit mode
9.8 years ago

One of the functions of the bedextract tool we have had in the BEDOPS suite the last couple years has been to perform a binary search through sorted BED data, to retrieve elements that overlap a specified input range.

If we know the start and end byte offsets of the input file, we can jump to the middle byte offset with fseek() (or similar calls) and take a look around. We can jump to the middle of the left or right half depending on where we need to go next, and repeat as needed, given the lexicographical ordering of sort-bed-processed BED data:

If the BED data doesn't change and is queried frequently, sorting is a one-time cost, since BEDOPS operations generally output sorted data that can go back into other tools that also accept sorted data. The sort-bed tool runs faster than GNU sort and sortBed, which minimizes the sort time expense.

To give a more concrete idea of the benefits of a binary search through sorted data, in local testing, a naïve O(n)-walkthrough of a 36 GB sorted BED input to get a list of chromosomes (using UNIX cut and uniq utilities) took approximately 20 minutes to complete on a typical Core 2 Duo-based Linux workstation. Retrieval of the same chromosome listing with bedextract --list-chr took roughly 2 seconds - a 600-fold improvement.

BEDOPS can improve on operation performance even further by facilitating parallelized approaches, using bedextract chrName to quickly split a BED file by chromosome (or extract a chromosome's-worth of data from an efficient Starch archive). We can then again bedextract each per-chromosome file to query for elements. Parallelization is accomplished by assigning per-chromosome bedextract queries across separate computational nodes, for instance, or even per-chromosome bedmap operations:

for chrName in `bedextract --list-chr foo.bed`; do \
    qsub <options> bedmap --chrom ${chrName} --operations foo.bed ... > answer.${chrName}.bed; \
done

This kind of approach is perhaps a bit more forward-looking, in that it facilitates use of modern distributed computing resources that genomics analyses have been transitioning towards.

ADD COMMENT
2
Entering edit mode
9.8 years ago
lh3 33k

There are three interval query patterns:

  1. Pattern 1: a genome browser/alignment viewer retrieving all features/alignments overlapping a 1kb window. If the sizes of features/alignments vary a lot, binning/interval tree/nested containment list are still the best data structures. BAM/tabix index and UCSC binning are designed for this use case. If feature sizes are ALL small (e.g. genomic short read alignment), it is possible to use a much simpler linear index which practically faster in this special case.
  2. Pattern 2: retrieval all features overlapping regions in another large BED file (say with 100k lines), and we don't care about which lines in the BED are overlapping features. For this use case, it is better to build a simple linear index and scan through the feature file. An interval tree for the second BED file is overkilling and potentially hurts performance.
    Sometimes the second BED file could be sparse in some regions but dense in others. The ideal method would be to index both the first feature file and the second BED file, though this rarely gets implemented.
  3. Pattern 3: identify a feature in file 1 overlapping a feature in file 2. It is similar to pattern 2 except that we care about which lines in file 2 are overlapping a feature in file 1. This is essentially BED intersection. We could process two sorted BED streams (the bedops strategy), or index one file and stream through the other (the classical bedtools strategy). The bedextract method is related, but not quite the same. In the end, there are no universally optimal method. It all depends on the distribution of regions.
ADD COMMENT
0
Entering edit mode

As for Pattern 3, one note is that bedtools has had the ability to process two (or N) sorted input streams for a couple of years now and, as expected, it is much faster than the index one/stream the other method. Moreover, it uses far less memory.

ADD REPLY
1
Entering edit mode
13.0 years ago

I needed to take a load of regions (queries) and find those that do not overlap annotated genomic features. I did it very badly a few times, but came up with this alternative that has reduced the cost so much that I'm now bottlenecked by fetching the regions and rendering the output.

val chromId = ... // the chromosome ID

// order by element 1 of a pair by stop
val by_1 = Ordering.by((_:(Int, Set[Long]))._1)

// we're going to build a sorted set from range start to the set of all
// range IDs that start before or at that position
val builder_1 = SortedSet.newBuilder(by_1)
var idsSoFar: Set[Long] = Set.empty
for((start, id) <- from(regions)(r =>
  where(r.genomic_sequence_id === chromId)
    select(r.start, r.id) orderBy(r.start)))
{
  idsSoFar += id
  builder_1 += (start -> idsSoFar)
}
val ranges_1 = builder_1.result()

// now we're going to build a sorted set from range end to the set of all
// range IDs that end after that position
val builder_2 = SortedSet.newBuilder(by_1)
var idsSoFar: Set[Long] = Set.empty
for((stop, id) <- from(regions)(r =>
  where(r.genomic_sequence_id === chromId)
    select(r.stop, r.id) orderBy(r.stop).desc))
{
  idsSoFar += id
  builder_2 += (stop -> idsSoFar)
}
val ranges_2 = builder_2.result()

// now we can compare a bazillion items to this to see if they overlap
for(query <- queryRegions) {

  // slice the sorted set to get all starts that start before the psm stops
  val startsBefore = ranges_1.rangeImpl(None, Some(query.stop -> Set.empty))

  // slice the sorted set to get all ranges that stop before the psm starts
  val stopsAfter = ranges_2.rangeImpl(Some(query.start -> Set.empty), None)

  // if one or the other set is empty, we don't have any overlaps
  val nonOverlap = if(startsBefore.isEmpty || stopsAfter.isEmpty) true
  else {
    val (_, lbIds) = startsBefore.lastKey // get the IDs for all 
    val (_, faIds) = stopsAfter.firstKey

    val (smaller, larger) = if(lbIds.size < faIds.size) (lbIds, faIds) else (faIds, lbIds)

    val anyShared_? = smaller exists larger

    ! anyShared_?
  }

  doSomethingUseful(query, nonOverlap)
}

To get any space efficiency, you must use a structure-sharing immutable set implementation for the sets of range IDs. That way, the total cost of all those sets of range IDs will be approximately linear on the cost of the largest set. If each is a fully independent set, the cost will be quadratic on the number of ranges, which may lead to crying or the killing of kittens.

The intuition is that as a overlaps b iff a.start < b.end and b.start > a.end, we can store a pre-computed map of IDs for all a.start < x and a.end > y, and then simply perform set intersection to and these restrictions. If we're only interested in the existence or otherwise of an overlap (not the overlapping regions themselves), then we just need to see if any single item in the smaller set is in the bigger one and stop as soon as we find one.

This could be sped up further by taking advantage of a structured, sorted representation of the ID sets to get performance close to O(log(smaller) x log(larger)). However, I haven't bothered with this yet, as it's fast enough :)

ADD COMMENT
1
Entering edit mode
9.8 years ago

Since typically the data structure being searched is constant, I like to use binary search of a sorted array of range objects or ideally structs, each containing a start and stop; sorted by start then stop. For fast lookup, it's best to use an array of objects/structs rather than an array of pointers; if the elements are large such that moving them is expensive, you can always sort an array of pointers then use that to create an array of objects/structs. This can be greatly accelerated by indexing; for example, if you want to see what intersects position 13934827:

13934827 & 0xFFFFF000 = 13934592 (0xD4A000)
13934592 (array lookup) -> {27924, 28710} (for example)

...so in one AND and one cached read, you reduce the search space to a very small interval for the binary search, minimizing random accesses.

I find explicit trees to be inefficient in time and memory when all you need is a sorted set of elements, so I mainly use them for data structures that are constantly being mutated.

ADD COMMENT
0
Entering edit mode

Say we have a feature spanning [12000000,16000000]. It is overlapping 13934827. How does your method work? Do you split the interval into smaller ones? Or add additional data structures?

ADD REPLY
0
Entering edit mode

It still works, but is less efficient when there are giant ranges spanning each other, which is fortunately rarely the case in genomics. When you create the index, every index entry must yield the lower and upper addresses of all elements spanning the area into which the entry was condensed. So in this case, if you index 13934592-13938687 (D4A000-D4AFFF) into a single entry, that entry must contain the address of the lowest element with a stop of at least the range min (13934592), and the address of the greatest element with a start of at most the range max (13938687). Indexing this is still O(N) in time where N is the number of elements (aka ranges or addresses), and O((genome size)/(2^(masked bits))) in memory.

ADD REPLY
1
Entering edit mode

This is essentially the linear index. Long features are rare, but a single long feature (e.g. a long intron, a large deletion or a bug in the mapper) will ruin such an index. It is a nice index when data fit, but not robust enough to be a generic solution.

ADD REPLY
0
Entering edit mode
11.2 years ago
always_learning ★ 1.1k

For similar task I have used intervaltree modules in Perl !! Which is written basically in C++ and their are some Python implementation of this Data structure is available also .

ADD COMMENT
0
Entering edit mode
9.8 years ago

I typically use the interval tree data structure for range overlap queries. Here are links to:

ADD COMMENT
0
Entering edit mode
6.5 years ago
endrebak ▴ 970

The fastest implementation in Python I have found in is the Nested Containment List. It is several times faster than the ultrafast intervaltrees found in the kernel. pygr has died from bitrot, but I have revived the underlying datastructure here: https://github.com/hunt-genes/ncls

This datastructure is used in pyranges if you want a more convenient interface to it, like bioconductor genomicranges: https://github.com/endrebak/pyranges

Note that pybedtools is faster for overlaps and such in python, but it does not keep the data in memory, so you need to load/save it if you want to do explorative analyses and computations on the data.

ADD COMMENT

Login before adding your answer.

Traffic: 1913 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6