I have been developing a database-driven web application for analyzing and reporting a variety of public and private genomic data sets for the past two years. As this project has gained traction and a following, it has grown significantly, doubling in size every 6 months or so. With another doubling of data schedule for the next quarter, I am starting to question whether the application has outgrown its infrastructure.
Currently, the application uses a highly-normalized Oracle database (~1TB) for processed data, networked storage (~30TB) for raw data, with the software layer developed using Pipeline Pilot and Spotfire, served from a VM that hosts multiple widely-used applications. User queries are starting to drag and data processing is becoming a chore without multi-CPU support.
Does there exist any standard or popular emerging strategies for tackling such applications? What sort of database, web development, and application server technologies do people use successfully? Additionally, is there a significant difference between free (open source) and proprietary solutions?
If you use a similar system, is it proprietary or home-grown? What technology is it built upon? How do you like it?
Update: Some smart people over at StackExchange suggested that this might be a good candidate for an OLAP system. Are there any OLAP systems particularly well suited to genomic data?
I would love to know some answers to this as well. But it may be that you are one the few that knows first hand what actually works when data grows really large.
This is not a unique problem, by a long shot. While every organization's needs may be unique, the overall data model and data variety should be very similar across organizations involved in bioinformatics. I have seen software companies cook up all kinds of weird, proprietary solutions. I know this is an open-ended question and there is no real answer, but I have a feeling that there could be an ideal generic solution.
What kind of request latencies and traffic are you seeing? Would the database benefit from a caching layer? Have you considered partitioning the database at all? You aren't going to be able to scale it easily without partitioning and/or caching. Is the raw data sitting in object storage or on a file system? How often is that data accessed? Scaling is less about the technology and more about how you build it.
Traffic is relatively light, so that is not an issue. Caching and partitioning is something we have regularly tossed around as a measure to combat poor database performance. The biggest architectural problem we have is that the database data is accessed and used in many different ways, so creating a partitioning scheme has been difficult, and caching would result in absurd resource usage. I wish I could narrow user focus, but pressure from above forces the system to be flexible. Raw data is located on a file system, and so far has not been a performance bottleneck.
As I remember, the total size of 1000g phase3 BAMs is about 50TB or something roughly in that scale. Your project is over half of 1000g. The UCSC hg19 gzip'd dump takes only 70GB. Your database is competing that. These make me a little curious about what data you are keeping for a web application. Just curious...
The ~30TB of networked storage I speak of is archived raw data, nothing that is used in the web application, which uses the ~1TB Oracle database of summarized data. My concern for this project is primarily how to work with that database data, though effective ways of tying in raw data are always worth considering.