Question

How to design an SQL relational database for gene expression data?

1

Entering edit mode

10.4 years ago

enricoferrero ▴ 920

Hi,

I have to design a relational database to store the results of gene expression experiments (both microarrays and RNA-seq), using MySQL.

I have a number of experiments, and for each experiment several contrasts/comparisons. For each contrast, I have thousands of genes with fold changes and p-values that I want to store. The database will be queried to retrieve genes showing differential expression across comparisons and across experiments.

What's the best design for such a scenario? I'm thinking to create one table per experiment but I'm not sure this is the best way to go about it. I'd be curious to hear from people who have experience in designing relational databases or who use SQL to query databases containing gene expression data.

Thank you!

mysql relational-database • 8.8k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by enricoferrero ▴ 920

1

Entering edit mode

Using a database system to store genes with just p-val and fold-change seems a bit overkill. What's the size of your data? If it is less than few hundred megabytes, you could probably just load the entire thing into memory to serve to the user via a simple webserver framework like bottle.py or node.js.

ADD REPLY • link 10.4 years ago by Damian Kao 16k

0

Entering edit mode

Thanks for the suggestion Damian.

That's not what I need though. I'm oversimplifying the situation in my question. In reality I'll have several columns to store for each comparison and experiment. Besides, this will be part of a larger projects and has to be a relational database.

ADD REPLY • link 10.4 years ago by enricoferrero ▴ 920

Ram · Answer 1 · 2015-01-23

2

Entering edit mode

10.4 years ago

ilya.shamovsky ▴ 20

You can have a look at the table organization for CummeRbund. It's for one experiment only, but you can easily add experiment on top of that, something similar to this.

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by ilya.shamovsky ▴ 20

score 1 · Answer 2 · 2015-01-23

When designing databases, I try to follow a business object model. This usually means having a table per entity/class (e.g. a gene table, an experiment table...)and mapping their attributes to columns. I then apply normalization constraints, typically trying to achieve third normal form. I denormalize when it makes sense (e.g. faster queries). I would start simple with just core entities and their relationships. I also use MySQL Workbench to help with the design.

Ram · Answer 3 · 2015-01-23

1

Entering edit mode

10.4 years ago

pld 5.1k

If this is for a single large experiment where the points of comparison are fixed the column approach could work, otherwise I would suggest adding an 'experiment' component.

What I've done in the past is have experiment, condition and fold change tables. Each experiment has several conditions, then you make two keys for the fold change table which refer to conditions. In the fold change table I made keys for the experiment and the two conditions. The exact implementation is a bit more complex, it was for microarray data so I built in the array's probes and so on into this. That way it was possible to pull fold change by probe id as well as the raw intensity values for each instance of that probe spotted on the array.

This made it relatively easy to just dump the fold changes from limma/etc into a single table using a python script to insert the relevant rows into the experiment and condition tables.

The idea was similar to what you are saying, the plan was to be able to pull fold changes for a given gene across individual experiments to make meta-analysis easier.

There is probably a better way, but you want to avoid adding tables for each experiment or comparison.

ADD COMMENT • link 10.4 years ago by pld 5.1k

0

Entering edit mode

Thanks Joe, this is exactly the sort of answer I was looking for!

I'm not sure I completely understand what you mean with the two keys for the fold change table. Do you care to explain with an example? Or, even better, can you show me what the tables would look like? Cheers!

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by enricoferrero ▴ 920

1

Entering edit mode

You have one key per condition, a fold change is derived from two conditions. Maybe key isn't the right term.

Say in my comparison table I have two rows:

Experiment        Condition        Description
foo               foo.1            <text description of condition>
foo               foo.2            <text description of condition>

A row in the fold change column may look like

Condition1 Condition2 Marker FC    log2FC pval   fdr.pval
foo.1      foo.2      geneX  0.25  -2     0.0001 0.001

It isn't the best approach, I had to join the condition table twice. It isn't properly normalized but I was trying to keep a happy mixture.

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by pld 5.1k