Couchdb (Or Similar) For Annotations, Alignments, Etc.
5
11
Entering edit mode
12.9 years ago
Marvin ▴ 900

I'd like to store genomic annotations (think bed or bam files, lots of bam files in fact) in some sort of database, preferably something that is distributed and facilitates parallel processing (something like MapReduce). CouchDB, MongoDB, Cassandra seem to fit the bill more or less.

Such a scheme only makes sense if it's possible to run queries like "give me all annotations overlapping this region", or something like 'samtools view'. Which database supports this kind of interval index or where would it be possible to retrofit one?

To clarify: I'm aware of the UCSC binning scheme, which looks like an ugly hack to me. The BAM format adds another hack to that, apparently because the binning scheme underperforms for bigger data sets. What I'm actually looking for is something like the priority search queue described here: http://www.soi.city.ac.uk/~ross/papers/FingerTree.html (It works for every search tree, not just finger trees.)

database annotation • 5.0k views
ADD COMMENT
0
Entering edit mode

Well, IMO the UCSC binning is a very clever hack for a database engine that does not support the R-tree index. You do not have a much better way to do that with MySQL. Also, correct me if I am wrong: finger tree seems an alternative to AVL and red-black tree. If this is the case, it is not efficient for interval queries. K-d tree and R-tree are the appropriate data structures for interval queries. UCSC binning is essentially an R-tree.

ADD REPLY
0
Entering edit mode

About MongoDB. I heard that MongoDB is lightening fast when the data fits memory, but otherwise it becomes quite slow. A Google result: http://www.colinhowe.co.uk/2011/02/23/mongodb-performance-for-data-bigger-than-memor/

ADD REPLY
0
Entering edit mode

The point isn't really about Finger Trees, any balancing scheme works. The point is about the annotation, which might actually be equivalent to the (1-dimensional) R-tree.

ADD REPLY
0
Entering edit mode

I'll add my +1 to MongoDB.

ADD REPLY
0
Entering edit mode

No, it is wrong that "any balancing scheme works". Interval query is a special type of query which makes many tree structures inefficient. You cannot use a B-tree, AVL or red-black tree to achieve the efficiency of the binning index. This is exactly why UCSC came up with this binning index when B-tree is widely used in all RDBMS. If your point is not about finger tree, then remove it to avoid confusion.

ADD REPLY
0
Entering edit mode

Hinze's finger tree is equivalent to an annotated (2,3)-B-tree, which is again equivalent to ab annotated red-black-tree. I'm asking if any database can annotate its B-tree index.

ADD REPLY
1
Entering edit mode

You see not understand the problem. You keep mentioning annotation, but what you really want to solve is an interval query problem. B-tree, together with finger, red-black and AVL trees, is a wrong data structure for interval queries. No matter how much you can annotate these kind of trees, you cannot achieve efficient interval query. UCSC binning is the only method I know that works efficiently on B-tree.

ADD REPLY
2
Entering edit mode

What the (marvelous) Hinze-Paterson paper says (section 4.8) is that you can implement (1-dim) interval trees via finger trees, just by choosing the right monoid; you get O(log(n)) for finding some overlapping interval, and getting all m of them in O(m log(m/n)). Not an expert in this area, but AFAIK this is as efficient as you can get.

In general, finger trees are best thought of as a "machine" producing data structures, given a monoid, not as a single data structure. See this blog post for example Apfelmus - monoids and finger trees

ADD REPLY
0
Entering edit mode

It seems that I was wrong about finger tree. Probably I was just making a fuss over the sentence "[the UCSC binning scheme] looks like an ugly hack". Without the last paragraph, which is certainly inaccurate, this is actually a pretty good question.

ADD REPLY
2
Entering edit mode

Well, actually comparing interval trees as implemented with finger trees (let's call this finger interval trees) and traditional interval trees it's kind of an apples to oranges comparison: finger interval trees (as any finger tree data structure) are persistent data structures. From a performance standpoint, query times for finger interval trees are amortized, require a language with support for lazy evaluation, and are a bit worse than that of traditional interval trees; you get however the benefits of a persistent data structure.

ADD REPLY
0
Entering edit mode

Thanks for the info. Now I understand why few implemented fingertree - because without lazy evaluation, implementing a fingertree is challenging. I wish someone could compared the performance of fingertree to other traditional data structures (better theoretical complexity may not always guarantee practical efficiency), but I could not find it.

ADD REPLY
10
Entering edit mode
12.9 years ago

You can also store such files with sqlite using R-Trees.

I was actually doing that for a while -- series of tables that would hold ts-start and ts-end coords for "genes" might look something like this:

CREATE TABLE gene(
  id INTEGER PRIMARY KEY, -- ID to link with the RTree Index
  entrez_id VARCHAR,      -- entrez id
  chromosome_id INTEGER,  -- Fkey to chromosome.id
  strand INTEGER,         -- 1 : plus, 2 : minus
  FOREIGN KEY (chromosome_id) REFERENCES chromosome(id)
);
CREATE UNIQUE INDEX geneId on gene(entrez_id);
CREATE INDEX geneChromosome on gene(chromosome_id);
CREATE INDEX geneStrand on gene(strand);

-- For quick retrievel of genes that lie between pts x and y on a chromosome
-- build a query around this RTree
CREATE VIRTUAL TABLE gene_tree USING rtree_i32(
  id,                     -- primary & foreign key to gene.id
  start,                  -- transcription start
  end,                    -- transcription end
  chr1,                   -- bound on the chromosome (fkey to chromosome.id)
  chr2                    -- same
);

You can do overlap queries on your rtree tables by querying start/stop and chromosome then join them by their id to the appropriate table for more meta information about that "gene". I reckon you can add another "strand" dimension to the rtree if you want do the entire query in one shot.

This being sqlite and all, it doesn't really qualify for your "distributed" requirement, but there are other databases, like PostgreSQL, that support rtree's, too.

ADD COMMENT
3
Entering edit mode

+1. Didn't know sqlite now supports R-tree. Nice!

ADD REPLY
0
Entering edit mode

You're right, sqlite is probably not the right tool if you want to store billions of short reads. But at least MongoDB has geospatial indexing, it might be similar enough to be useful.

ADD REPLY
0
Entering edit mode

Geospatial index is the right one for interval query.

ADD REPLY
2
Entering edit mode
12.5 years ago

If you want to use a db, I'd use neo4j, in particular neo4j-spatial; you have R-trees and all that, and the graph db model is well-suited for this kind of thing (shameless plug: check bio4j).

If what you want is a data structure for stabbing queries (give me all of the intervals containing this point/interval x), as all your values are from a discrete range, there are data structures with O(k) query times where k is the number of interval containing x. The important thing here is that this is independent of the number of non-matching intervals. This could make a huge difference in a setting like yours. See Interval Stabbing Problems in Small Integer Ranges

ADD COMMENT
1
Entering edit mode
12.9 years ago

"give me all annotations overlapping this region"

a RDBMS is enough: the UCSC database uses a "bin indexing system" to quickly retrieve this kind of information: http://genomewiki.ucsc.edu/index.php/Bin_indexing_system . Of course, you could use this strategy to store your data in any NOSQL system.

ADD COMMENT
1
Entering edit mode
12.9 years ago

File based interval databases/datastructures are strangely uncommon, two that I heard of are both python based:

ADD COMMENT
3
Entering edit mode
ADD REPLY
0
Entering edit mode

can you explain how this works in conjunction with a database?

ADD REPLY
1
Entering edit mode
12.5 years ago

For this sort of problem, if you're willing to use PostgreSQL, the BIOSEG type might help.

From the introduction: "BIOSEG is an implementation of a biological sequence interval/range type for PostgreSQL. A PostgreSQL GiST index is used to improve the speed of overlap, contains and contained-in queries"

Some examples of usage: http://www.bioinformatics.org/bioseg/wiki/Main/BiosegUsage

The code is quite stable and is used in production by the InterMine project.

(Disclaimer: I maintain BIOSEG)

ADD COMMENT
0
Entering edit mode

It looks like PostgreSQL v9.2 will have a standard range type, which I'm sure will be more stable and better optimised than BIOSEG: http://www.postgresql.org/docs/devel/static/rangetypes.html

ADD REPLY

Login before adding your answer.

Traffic: 2501 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6