Question

Maths implications: log2 transformation before or after normalisation

1

Entering edit mode

5.2 years ago

gablenord ▴ 10

Hi guys,

A (hopefully) quite straightforward question:

What are the different implications of log2 transforming variables before or after performing normalisation (for example quantile norm) on a dataset?

I have in mind microarray gene expression data but I guess the question would stand for any type of data as well.

I found contradictory sources aorund, from norm functions (in R packages) that even expect log2 data input to people stating log2 transformation MUST be done after normalisation, I would like to understand the implications better.

Thanks,

log2 normalisation gene-expression • 8.4k views

ADD COMMENT • link updated 11 months ago by Ram 44k • written 5.2 years ago by gablenord ▴ 10

1

Entering edit mode

Well it depends on the normalization, if it's quantile normalization it should not make a difference if you log transform before or after, provided you don't have negative numbers.

ADD REPLY • link 5.2 years ago by Martombo ★ 3.2k

1

Entering edit mode

See Normalisation before log2 transformation or after in Microarray Gene expression data?

ADD REPLY • link 5.2 years ago by ATpoint 86k

0

Entering edit mode

I had already read that, but I found it more prescriptive than descriptive, I was interested more in the why, not in the how :)

ADD REPLY • link 5.2 years ago by gablenord ▴ 10

1

Entering edit mode

Given the diversity of microarray designs and detection systems for each, I'm not surprised that you have come across seemingly contradictory material online.

As an example, for two-colour arrays, the 'raw' signal intensities are log (base 2) ratios between the cDNA in the test and reference samples - these are then further normalised and kept on the log (base 2) scale. Agilent produces most if not all of these two-colour arrays, I believe.

For the Affymetrix and Illumina arrays, the raw data is just fluorescent signal intensity from whatever detection system that they are using, so, it's not yet logged.

ADD REPLY • link 5.2 years ago by Kevin Blighe 88k

score 5 · Answer 1 · 2019-11-06

5

Entering edit mode

5.2 years ago

jomo018 ▴ 730

Log2 is monotonic but a non-linear transformation. The ratios between elements in a sample are not kept. Once performed, a downstream linear operation such as depth normalization is less appropriate. In such cases, it makes more sense to begin with the linear operations and end with non-linear ones. Quantile normalization is also non-linear. In this case, both workflow arrangements are reasonable noting that right from start, ranking is kept but not the original ratios.

ADD COMMENT • link 5.2 years ago by jomo018 ▴ 730

0

Entering edit mode

Thank you Jomo, that's what I wanted to know!

ADD REPLY • link 5.2 years ago by gablenord ▴ 10

0

Entering edit mode

I moved this to an answer. Just to 'sure it up', the two approaches do produce different end results:

mat <- matrix(c(5,2,3,4,4,1,4,2,3,4,6,8), ncol = 3)

log2(preprocessCore::normalize.quantiles(mat))
         [,1]     [,2]     [,3]
[1,] 2.502500 2.369234 1.000000
[2,] 1.000000 1.000000 1.584963
[3,] 1.584963 2.369234 2.222392
[4,] 2.222392 1.584963 2.502500

preprocessCore::normalize.quantiles(log2(mat))
          [,1]      [,2]      [,3]
[1,] 2.4406427 2.3178151 0.8616542
[2,] 0.8616542 0.8616542 1.5283208
[3,] 1.5283208 2.3178151 2.1949875
[4,] 2.1949875 1.5283208 2.4406427

ADD REPLY • link 5.2 years ago by Kevin Blighe 88k

0

Entering edit mode

Thanks Kevin. I did some tests myself on my data and I saw that, while the actual values are indeed slightly different as you showed, the results coming from downstream analysis (unsupervised clustering, PCA, paired t-tests,...) are basically totally comparable, if not the same. I was therefore interested in understanding if, apart from the specific empirical experience, there was a "mathematical" reason why one way should be preferred over the other and why.

From @jomo018 answer, I got that the choice should be based (mostly?) on the linear/non-linear nature of the transformation methods involved, so that there would be no "wrongs" in going either way in case of log2 and quantile normalisation (being both non-linear), while there would be a problem in case of other linear normalisations for example.

Did I get it right?

Thanks

ADD REPLY • link 5.2 years ago by gablenord ▴ 10