I'd like to store genomic annotations (think bed or bam files, lots of bam files in fact) in some sort of database, preferably something that is distributed and facilitates parallel processing (something like MapReduce). CouchDB, MongoDB, Cassandra seem to fit the bill more or less.
Such a scheme only makes sense if it's possible to run queries like "give me all annotations overlapping this region", or something like 'samtools view'. Which database supports this kind of interval index or where would it be possible to retrofit one?
To clarify: I'm aware of the UCSC binning scheme, which looks like an ugly hack to me. The BAM format adds another hack to that, apparently because the binning scheme underperforms for bigger data sets. What I'm actually looking for is something like the priority search queue described here: http://www.soi.city.ac.uk/~ross/papers/FingerTree.html (It works for every search tree, not just finger trees.)
Well, IMO the UCSC binning is a very clever hack for a database engine that does not support the R-tree index. You do not have a much better way to do that with MySQL. Also, correct me if I am wrong: finger tree seems an alternative to AVL and red-black tree. If this is the case, it is not efficient for interval queries. K-d tree and R-tree are the appropriate data structures for interval queries. UCSC binning is essentially an R-tree.
About MongoDB. I heard that MongoDB is lightening fast when the data fits memory, but otherwise it becomes quite slow. A Google result: http://www.colinhowe.co.uk/2011/02/23/mongodb-performance-for-data-bigger-than-memor/
The point isn't really about Finger Trees, any balancing scheme works. The point is about the annotation, which might actually be equivalent to the (1-dimensional) R-tree.
I'll add my +1 to MongoDB.
No, it is wrong that "any balancing scheme works". Interval query is a special type of query which makes many tree structures inefficient. You cannot use a B-tree, AVL or red-black tree to achieve the efficiency of the binning index. This is exactly why UCSC came up with this binning index when B-tree is widely used in all RDBMS. If your point is not about finger tree, then remove it to avoid confusion.
Hinze's finger tree is equivalent to an annotated (2,3)-B-tree, which is again equivalent to ab annotated red-black-tree. I'm asking if any database can annotate its B-tree index.
You see not understand the problem. You keep mentioning annotation, but what you really want to solve is an interval query problem. B-tree, together with finger, red-black and AVL trees, is a wrong data structure for interval queries. No matter how much you can annotate these kind of trees, you cannot achieve efficient interval query. UCSC binning is the only method I know that works efficiently on B-tree.
What the (marvelous) Hinze-Paterson paper says (section 4.8) is that you can implement (1-dim) interval trees via finger trees, just by choosing the right monoid; you get O(log(n)) for finding some overlapping interval, and getting all m of them in O(m log(m/n)). Not an expert in this area, but AFAIK this is as efficient as you can get.
In general, finger trees are best thought of as a "machine" producing data structures, given a monoid, not as a single data structure. See this blog post for example Apfelmus - monoids and finger trees
It seems that I was wrong about finger tree. Probably I was just making a fuss over the sentence "[the UCSC binning scheme] looks like an ugly hack". Without the last paragraph, which is certainly inaccurate, this is actually a pretty good question.
Well, actually comparing interval trees as implemented with finger trees (let's call this finger interval trees) and traditional interval trees it's kind of an apples to oranges comparison: finger interval trees (as any finger tree data structure) are persistent data structures. From a performance standpoint, query times for finger interval trees are amortized, require a language with support for lazy evaluation, and are a bit worse than that of traditional interval trees; you get however the benefits of a persistent data structure.
Thanks for the info. Now I understand why few implemented fingertree - because without lazy evaluation, implementing a fingertree is challenging. I wish someone could compared the performance of fingertree to other traditional data structures (better theoretical complexity may not always guarantee practical efficiency), but I could not find it.