Question

What'S The Best Way To Visualize Multiple-Treatment, Unreplicated Gene Expression Data

1

Entering edit mode

13.1 years ago

Ted ▴ 70

Suppose I have data from a pilot gene expression study, i.e. normalized gene counts from an RNA-seq experiment. However, there are no replicates and multiple treatment groups (control, knockout1, knockout2, etc). I can easily cluster the treatments to see what is similar and what is different, but what's the best way to find biomarker genes that separate certain treatments from each other?

I've tried PCA, but the results are not sparse enough (i.e. I could mine the loadings, but there is some indication this is not effective, as in the Zou et al 2006). SparsePCA seems like it could work, but I can't find much mentioned on how this works with small sample sizes (no replicates). Would sparsePCA as in the elasticnet R package work?

All of these results are just hypothesis-generating. No p-value calculations or inference is needed. The primary goal is to figure out what genes' expression makes these different.

-Ted

gene • 3.3k views

ADD COMMENT • link updated 13.1 years ago by Damian Kao 16k • written 13.1 years ago by Ted ▴ 70

score 1 · Answer 1 · 2011-12-10

In my experience, it's very rare to get a clear cut visualizations where you can easily see distinct clusters.

In terms of clustering algorithms, I've tried PCA, k-means, SOM, neural network on my expression data. They all give similar results and visualizations still don't give anything that great.

These are a couple of things I would perform for my samples to form a biological hypothesis:

Do a hierarchal clustering on your sample. Seeing how your treatments correlate to each other can be extremely informative. Does a RNAi of gene X look more similar to RNAi of gene Y than RNAi of gene Z?
It can also be informative to see overlaps between lists of differentially expressed genes. For example, I have list of genes that are downregulated from control to irradiation. We hypothesize that these genes are involved in cell proliferation. We also found that the down-regulated genes from control-treatment X overlap with the irradiation list by 80%. We can then form a hypothesis about treatment X as related to cell proliferation.
The most common way of visualizing global expression data for any expression data set is just a standard heatmap. I actually consider heatmaps to be the most tried and true way of looking at the global expression. Z-score heatmaps are more useful than a log-transformed heatmap in my opinion.
Don't just use expression data. Do you have gene ontology annotations? If not, run some HMM scans to get GO terms for your genes. Visualizing your expression data with a heatmap or PCA plot along with GO terms on top can give you a lot of interesting information.
A less common way of visualization of expression data with GO classification is a RadViz (or polyviz) plot. You can read about it [?]here[?]. Also, [?]here[?] is an implementation of it. I find radviz plots to be pretty useful, albeit with a lot of caveats. How you arrange your samples around the circle makes a huge difference.
Every data set has it's own patterns that needs to be teased out with custom visualization solutions. After you see some kind of cluster/pattern in the heatmaps or PCA plots, you can extract those genes and dig deeper by visualizing the subset in another way. [?]Here[?] is a plot I generated few days ago after picking out some interesting GO terms and saw some kind of pattern in expression.