Fast Interval Intersection Methodologies
4
17
Entering edit mode
14.9 years ago
Biostar User ★ 1.0k

Most genomic annotations are specified as intervals along the genome.

Provide code examples in your programming language that demonstrate the use of fast interval querying.

genomics • 21k views
ADD COMMENT
0
Entering edit mode

anyone benchmarked a nested containment list against an interval tree? other than pygr are there any NCL's in the wild?

ADD REPLY
16
Entering edit mode
14.9 years ago

This code example generates 10,000 intervals then queries them for overlapping regions. Requires only the presence of Python.

The code below requires the either the installation of the bx python package or alternatively you may just download the quicksect.py module and place it next to the script itself:

from random import randint, seed

# if you can install bx python then uncomment the line below
#
# from bx.intervals.operations.quicksect import IntervalNode

# otherwise just download the quickset module as shown above 
# and place it in next to your program
#
from quicksect import IntervalNode

# the span of the generated intervals
SPAN = 10

# the size of the genome
SIZE = 5*10**4

# the number of intervals
N = 10**4

def generate(x):
    "Generates random interval over a size and span"
    lo = randint(10000, SIZE)
    hi = lo + randint(1, SPAN)
    return (lo, hi)

def find(start, end, tree):
    "Returns a list with the overlapping intervals"
    out = []
    tree.intersect( start, end, lambda x: out.append(x) )
    return [ (x.start, x.end) for x in out ]

# use this to force both examples to generate the same data
seed(10)

# generate 10 thousand random intervals
data = map(generate, xrange(N))

# generate the intervals to query over
query = map(generate, xrange(10))

# start the root at the first element
start, end = data[0]
tree = IntervalNode( start, end )

# build an interval tree from the rest of the data
for start, end in data[1:]:
    tree = tree.insert( start, end )

for start, end in query:
    overlap = find(start, end , tree)
    print '(%s, %s) -> %s' % (start, end, overlap)

Produces the output:

(41901, 41903) -> [(41894, 41902)]
(36981, 36987) -> [(36981, 36984), (36973, 36982), (36978, 36987)]
(36338, 36339) -> [(36337, 36347)]
(32741, 32748) -> [(32738, 32742)]
(49864, 49872) -> [(49859, 49865)]
(21475, 21477) -> []
(29425, 29428) -> [(29418, 29426), (29419, 29426)]
(29590, 29599) -> [(29586, 29595), (29596, 29598)]
(12804, 12811) -> [(12806, 12811), (12799, 12806), (12809, 12819)]
(30339, 30343) -> [(30336, 30346), (30335, 30345), (30340, 30341)]
ADD COMMENT
10
Entering edit mode
14.9 years ago

This code example generates 10,000 intervals then queries them for overlapping regions. The installation requires a C compiler and Python.

This is a faster solution than the other example, one that now requires the full installation of the bx-python module. The data structure is implemented in C and is at least one order of magnitude faster than the quicksect.py module presented in another example.

from random import randint, seed
from bx.intervals.intersection import Intersecter, Interval

# the span of the generated intervals
SPAN = 10

# the size of the genome
SIZE = 5*10**4

# the number of intervals
N = 10**4

def generate(x):
    "Generates random interval over a size and span"
    lo = randint(10000, SIZE)
    hi = lo + randint(1, SPAN)
    return (lo, hi)

# use this to force both examples to generate the same data
seed(10)

# generate 10 thousand random intervals
data = map(generate, xrange(N))

# generate the intervals to query over
query = map(generate, xrange(10))

# create the interval tree
tree = Intersecter()

# build an interval tree from the rest of the data
for start, end in data:
    tree.add_interval( Interval(start, end) )

# perform the query
for start, end in query:
    overlap = tree.find(start, end)
    out = [ (x.start, x.end) for x in overlap ]
    print '(%s, %s) -> %s' % (start, end, out)

Produces the output:

(41901, 41903) -> [(41894, 41902)]
(36981, 36987) -> [(36973, 36982), (36978, 36987), (36981, 36984)]
(36338, 36339) -> [(36337, 36347)]
(32741, 32748) -> [(32738, 32742)]
(49864, 49872) -> [(49859, 49865)]
(21475, 21477) -> []
(29425, 29428) -> [(29418, 29426), (29419, 29426)]
(29590, 29599) -> [(29586, 29595), (29596, 29598)]
(12804, 12811) -> [(12799, 12806), (12806, 12811), (12809, 12819)]
(30339, 30343) -> [(30335, 30345), (30336, 30346), (30340, 30341)]
ADD COMMENT
0
Entering edit mode

This is awesome! Thanks for sharing that, Istvan.

ADD REPLY
7
Entering edit mode
14.9 years ago

An alternative is to use BED files to store data and BEDTools to calculate the intersection.

The BED format is a format used by UCSC and many other projects to generically store annotations on genomes. Sometimes, it is easier to use just plain BED files to store annotations, instead of complex databases or HDF5, and BEDTools make it also faster to access them. Moreover, with BEDFiles you have other advantages, as you can use custom tracks on ucsc or gbrowse and there are other tools that use BED as input.

Anyway, to solve the problem of intersection with BEDTools, you would do:

intersectBed -a segdups.bed -b exons.bed
ADD COMMENT
0
Entering edit mode
7.2 years ago
daniel ▴ 30

I've recently developed a C++ library, Mappable, which efficiently handles intervals. It allows you to write stuff like this:

std::vector<MyMappableType> mappables {};
/* insert data */
const auto overlapped = overlap_range(mappables, some_interval); // returns an iterator range
for (const auto& mappable : overlapped) {
    std::cout << mappable << std::endl;
}
std::cout << count_contained(mappables, some_interval) << std::endl;

One really nice thing about the library is that the algorithms are independent of the container used to store the data (like the C++ STL).

I've benchmarked against interval tree based methods and the performance advantage can be quite striking - both in memory usage and query time (see benchmark code on Github repo).

ADD COMMENT
1
Entering edit mode

I would recommend providing a Python interface to it (as an extension library) as that would significantly raise its utility to many more people.

ADD REPLY
1
Entering edit mode

I would suggest usage of pybind11 to write such a python extension; this amazing library is really a good choice to expose fast C++ algorithms implementations in python in order to make them more popular as Istvan said.

ADD REPLY

Login before adding your answer.

Traffic: 1734 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6