Question

Limma: is it necessary to log-transform the data?

0

Entering edit mode

2.1 years ago

eodmacmd • 0

I am trying to apply limma to a TMT-labelled proteomic dataset. I wish to keep the data in non-log transformed format (fold change). I noticed that limma asks for log-transformed data, but I couldn't find the information on why that is and what assumptions it is trying to satisfy by doing that. If I already know that my non-log transformed data is normally distributed, is log transformation necessary? I'd greatly appreciate any insight.

proteomics limma • 4.2k views

ADD COMMENT • link updated 6 months ago by Gordon Smyth ★ 7.9k • written 2.1 years ago by eodmacmd • 0

score 3 · Answer 1 · 2023-04-06

3

Entering edit mode

2.1 years ago

Gordon Smyth ★ 7.9k

If I already know that my non-log transformed data is normally distributed

Well, if you really have normally distributed data than you could input it directly into limma without transformation. Howevr, I think it is impossible for you to know that, because it cannot possibly be true for unlogged fold-changes to be normally distributed.

limma is not designed to analyse proteomic fold-changes, whether log-transformed or not. limma is rather designed to analyse log-expression or log-intensity values.

I am not an expert in low-level processing of mass spectrometry proteomic data, but it is almost universal in the literature to undertake differential expression analyses of mass spectrometry data on a log scale. So it seems that you are processing or interpretting your proteomics data in an unusual way. I suggest revisiting the format of your data and consider pre-processing it in a more standard way.

ADD COMMENT • link 6 months ago by Gordon Smyth ★ 7.9k

0

Entering edit mode

Thank you very much for the helpful advice. This is going to be a naïve question and probably outside the scope of this site, but why would it be impossible for unlogged fold-changes to be normally distributed?

ADD REPLY • link 2.1 years ago by eodmacmd • 0

2

Entering edit mode

For so many reasons. For one, unlogged values have a strong mean variance relationship (large values have large standard deviations, small values have small standard deviations) contrary to the constant variance assumption. For another, unlogged values have right skew distributions. For another, unlogged changes are nonlinear and asymmetric because doubling is a larger absolute change than halving.

This is a rule across all of science, whenever we measure concentration quantities that vary by orders of magnitude. Why for example is acidity measured by pH instead of by unlogged hydrogen concentration?

Why do you wish to keep the data non-log transformed anyway? How could that be an advantage? And why are your raw data fold-changes instead of the usual mass spec intensities?

ADD REPLY • link 2.1 years ago by Gordon Smyth ★ 7.9k

0

Entering edit mode

Thank you very much for the thorough explanation.

For one, unlogged values have a strong mean variance relationship (large values have large standard deviations, small values have small standard deviations) contrary to the constant variance assumption. For another, unlogged values have right skew distributions.

When you say "value" here, you are referring to fold change values of intensities and not raw mass spec intensities in general, right? It is clear to me why paired fold changes (e.g., if treated = t1, t2 and control = c1, c2, fold change1 = t1/c1, fold change2 = t2/c2) wouldn't be normally distributed. I shouldn't have phrased it as fold change for my case. In my experiment, intensities across samples/columns (e.g., n=10) are divided by constants (e.g., n=10) that represent the relative abundance of the original sample input. Other times, sample intensities are divided row-wise by row mean intensity or mean intensity of controls (e.g., t1/mean(ctrls) or t1/rowmean, etc). When I have multiple experiments joined by a bridge channel, I would row-wise divide the intensity of all samples by that of the bridge channel. In these cases, I am expecting that the distribution wouldn't change. I'm also assuming that mass spec TMT intensities approximately follow Gaussian distribution. Please correct me if I am mistaken. In such cases, would you say that I can log transform the data and use limma?

I really appreciate your comments. A bunch of us in the lab have had lively discussions because of them, so thank you.

ADD REPLY • link 2.1 years ago by eodmacmd • 0

2

Entering edit mode

When you say "value" here, you are referring to fold change values of intensities and not raw mass spec intensities in general, right?

No, I very much mean raw mass spec intensities as well as fold-changes. Raw mass spec intensities cannot possibly be normally distributed.

In my experiment, intensities across samples/columns are divided by constants that represent the relative abundance of the original sample input.

I am far from clear what you are doing but ad hoc standardization like this is usually undesirable. It throws away valuable information on the variability of the control samples and it also screws up the global mean-variance relationship (that limma estimates) by putting every protein on a different measurement scale. It is far better to implement a standard limma approach using the complete data in which samples are compared to controls as part of the linear model.

I'm also assuming that mass spec TMT intensities approximately follow Gaussian distribution.

This is pretty frustrating. You asked me why unlogged values can't be normally distributed and I gave you a number of reasons. You said you appreciate my comments, but you go back to your original assumption, which seems to dismiss everything that I've said.

Note that normality of log-intensities and normality of log-fold-changes is essentially the same thing. If the log-intensities are approximatetely normal (as we usually assume) then so are any log-fold-changes that you compute from them.

There is a growing literature on statistical analysis of mass spec data. I suggest that you have a look at what others have done and follow an analysis method that is supported by the literature.

ADD REPLY • link 2.0 years ago by Gordon Smyth ★ 7.9k