It Architecture For Large, Genomic Data-Backed Application
4
7
Entering edit mode
11.8 years ago
woemler ▴ 170

I have been developing a database-driven web application for analyzing and reporting a variety of public and private genomic data sets for the past two years. As this project has gained traction and a following, it has grown significantly, doubling in size every 6 months or so. With another doubling of data schedule for the next quarter, I am starting to question whether the application has outgrown its infrastructure.

Currently, the application uses a highly-normalized Oracle database (~1TB) for processed data, networked storage (~30TB) for raw data, with the software layer developed using Pipeline Pilot and Spotfire, served from a VM that hosts multiple widely-used applications. User queries are starting to drag and data processing is becoming a chore without multi-CPU support.

Does there exist any standard or popular emerging strategies for tackling such applications? What sort of database, web development, and application server technologies do people use successfully? Additionally, is there a significant difference between free (open source) and proprietary solutions?

If you use a similar system, is it proprietary or home-grown? What technology is it built upon? How do you like it?

Update: Some smart people over at StackExchange suggested that this might be a good candidate for an OLAP system. Are there any OLAP systems particularly well suited to genomic data?

genomics database • 4.2k views
ADD COMMENT
1
Entering edit mode

I would love to know some answers to this as well. But it may be that you are one the few that knows first hand what actually works when data grows really large.

ADD REPLY
0
Entering edit mode

This is not a unique problem, by a long shot. While every organization's needs may be unique, the overall data model and data variety should be very similar across organizations involved in bioinformatics. I have seen software companies cook up all kinds of weird, proprietary solutions. I know this is an open-ended question and there is no real answer, but I have a feeling that there could be an ideal generic solution.

ADD REPLY
1
Entering edit mode

What kind of request latencies and traffic are you seeing? Would the database benefit from a caching layer? Have you considered partitioning the database at all? You aren't going to be able to scale it easily without partitioning and/or caching. Is the raw data sitting in object storage or on a file system? How often is that data accessed? Scaling is less about the technology and more about how you build it.

ADD REPLY
0
Entering edit mode

Traffic is relatively light, so that is not an issue. Caching and partitioning is something we have regularly tossed around as a measure to combat poor database performance. The biggest architectural problem we have is that the database data is accessed and used in many different ways, so creating a partitioning scheme has been difficult, and caching would result in absurd resource usage. I wish I could narrow user focus, but pressure from above forces the system to be flexible. Raw data is located on a file system, and so far has not been a performance bottleneck.

ADD REPLY
0
Entering edit mode

As I remember, the total size of 1000g phase3 BAMs is about 50TB or something roughly in that scale. Your project is over half of 1000g. The UCSC hg19 gzip'd dump takes only 70GB. Your database is competing that. These make me a little curious about what data you are keeping for a web application. Just curious...

ADD REPLY
0
Entering edit mode

The ~30TB of networked storage I speak of is archived raw data, nothing that is used in the web application, which uses the ~1TB Oracle database of summarized data. My concern for this project is primarily how to work with that database data, though effective ways of tying in raw data are always worth considering.

ADD REPLY
4
Entering edit mode
11.8 years ago

You mention that your database is highly normalized. As a first step, you might consider a de-normalized version of the data in a standard STAR or snowflake schema to reduce joins on the database. You could also stage some of your data into a non-relational database that allows document-based or key/value storage. Some of these systems (mongo, cassandra, etc.) have advantages in scalability (sharding) as well as in schema flexibility. I am not suggesting getting rid of your relational database; rather, you might consider offloading some of the joins by de-normalizing or using non-relational databases. All that said, as your database grows, it will need more resources. Machines with as much as 256G or 512G of RAM are pretty cheap these days and a database largely in RAM may be all that is needed to keep you going.

ADD COMMENT
0
Entering edit mode

That is a very interesting answer, I have considered offloading some data from Oracle to a NoSQL database, but I have never worked with one before. Moving a lot of my data to memory is an appealing option as well. I see a cost-benefit analysis in my future.

ADD REPLY
0
Entering edit mode

By "a database largely in RAM", I meant simply running Oracle on a very large memory machine where indexes and tables would be largely cached. I have not used Oracle in a long time, but I assume that it would benefit from more memory in a manner similar to other RDBMs. As for NoSQL databases, they are not a panacea, but they can be useful for storing complex data without needing joins (and most do not even do joins) and for scaling horizontally across multiple machines (sharding).

ADD REPLY
0
Entering edit mode

More memory will definitely help, especially with Oracle. Perhaps step one should be some denormalization, and over time a migration to a more distributed system. My guess is that the amount of time and resources here is limited.

ADD REPLY
1
Entering edit mode
11.8 years ago

Not a full answer per se but perhaps there are a few ideas in a number of posts that can be found searching by the term infrastructure:

http://www.biostars.org/search/?q=infrastructure&t=Question&submit=

ADD COMMENT
0
Entering edit mode

Thanks for the link, there is some good reading in those Q&As.

ADD REPLY
0
Entering edit mode
11.8 years ago

Have you considered using InterMine? It's open source, it's designed with biological data in mind, has flexible querying and doesn't require you to de-normalize your data. It's already in use by some pretty big databases too. Pubmed link, InterMine website

ADD COMMENT
0
Entering edit mode
11.8 years ago

This is an extremely young project and is closed source, but perhaps you can get some inspiration from their database design philosophy:

http://www.datomic.com/

There are some pretty cool ideas. It's designed by Rich Hickey who has some of the most original thoughts on programming paradigms and fundamentals in my opinion.

ADD COMMENT

Login before adding your answer.

Traffic: 975 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6