Why Is Correlation In Gene Expression Usually Done In Log-Space And Not Linear Intensity-Space?
2
3
Entering edit mode
13.2 years ago
Brian Tsai ▴ 100

I'm trying to compute correlation between two genes across multiple samples from a microarray analysis (Affy). The pearson correlation changes depending on whether i compute the correlation in the original intensity domain or in log2-domain. I'm told that i should be doing this in log2-space, but i'm not sure what the reasoning is?

gene • 27k views
ADD COMMENT
0
Entering edit mode

I'm not an expert on statistics. But usually you got better statistic properties after log transformation. For example, in some cases, you'll have a distribution closer to normal distribution after taking the log, which is more trackable statistically.

ADD REPLY
5
Entering edit mode
13.2 years ago

You shouldn't really be doing any "downstream analysis" using original intensity data, but rather use some normalized version of the data.

You might find this useful:

There is no silver bullet -- a guide to Low-level Data Transforms and Normalization Methods for Microarray Data (PDF)

In particular, look at the "Explicit error models" section where it mentions that "... a log-transform decouples a random multiplicative error (e^n) from true signal intensity...", and the related reference (Brown et al.)

ADD COMMENT
0
Entering edit mode
13.2 years ago

Correlation is usually computed on log2 data because regardless of the normalization method (e.g. RMA), that's the scale typically used for microarray analysis. The reason for this is nicely stated in the manual to Cluster 3.0:

The results of many DNA microarray experiments are fluorescent ratios. Ratio measurements are most naturally processed in log space. Consider an experiment where you are looking at gene expression over time, and the results are relative expression levels compared to time 0. Assume at timepoint 1, a gene is unchanged, at timepoint 2 it is up 2-fold and at timepoint three is down 2-fold relative to time 0. The raw ratio values are 1.0, 2.0 and 0.5. In most applications, you want to think of 2-fold up and 2-fold down as being the same magnitude of change, but in an opposite direction. In raw ratio space, however, the difference between timepoint 1 and 2 is +1.0, while between timepoint 1 and 3 is -0.5. Thus mathematical operations that use the difference between values would think that the 2-fold up change was twice as significant as the 2-fold down change. Usually, you do not want this. In log space (we use log base 2 for simplicity) the data points become 0,1.0,-1.0.With these values, 2-fold up and 2-fold down are symmetric about 0. For most applications, we recommend you work in log space.

ADD COMMENT
2
Entering edit mode

Although if the poster is using Affy arrays then there's no 'ratio' of two samples.

ADD REPLY
0
Entering edit mode

It's true that one-color arrays don't present data as sample/reference. However, the larger point that log2-transforming the data makes the fold-change values symmetric (50 vs 100, 100 vs 200, both fold-change of 2) holds for Affy data.

ADD REPLY

Login before adding your answer.

Traffic: 2822 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6