I am trying to apply limma to a TMT-labelled proteomic dataset. I wish to keep the data in non-log transformed format (fold change). I noticed that limma asks for log-transformed data, but I couldn't find the information on why that is and what assumptions it is trying to satisfy by doing that. If I already know that my non-log transformed data is normally distributed, is log transformation necessary? I'd greatly appreciate any insight.
Thank you very much for the helpful advice. This is going to be a naïve question and probably outside the scope of this site, but why would it be impossible for unlogged fold-changes to be normally distributed?
For so many reasons. For one, unlogged values have a strong mean variance relationship (large values have large standard deviations, small values have small standard deviations) contrary to the constant variance assumption. For another, unlogged values have right skew distributions. For another, unlogged changes are nonlinear and asymmetric because doubling is a larger absolute change than halving.
This is a rule across all of science, whenever we measure concentration quantities that vary by orders of magnitude. Why for example is acidity measured by pH instead of by unlogged hydrogen concentration?
Why do you wish to keep the data non-log transformed anyway? How could that be an advantage? And why are your raw data fold-changes instead of the usual mass spec intensities?
Thank you very much for the thorough explanation.
When you say "value" here, you are referring to fold change values of intensities and not raw mass spec intensities in general, right? It is clear to me why paired fold changes (e.g., if treated = t1, t2 and control = c1, c2, fold change1 = t1/c1, fold change2 = t2/c2) wouldn't be normally distributed. I shouldn't have phrased it as fold change for my case. In my experiment, intensities across samples/columns (e.g., n=10) are divided by constants (e.g., n=10) that represent the relative abundance of the original sample input. Other times, sample intensities are divided row-wise by row mean intensity or mean intensity of controls (e.g., t1/mean(ctrls) or t1/rowmean, etc). When I have multiple experiments joined by a bridge channel, I would row-wise divide the intensity of all samples by that of the bridge channel. In these cases, I am expecting that the distribution wouldn't change. I'm also assuming that mass spec TMT intensities approximately follow Gaussian distribution. Please correct me if I am mistaken. In such cases, would you say that I can log transform the data and use limma?
I really appreciate your comments. A bunch of us in the lab have had lively discussions because of them, so thank you.
No, I very much mean raw mass spec intensities as well as fold-changes. Raw mass spec intensities cannot possibly be normally distributed.
I am far from clear what you are doing but ad hoc standardization like this is usually undesirable. It throws away valuable information on the variability of the control samples and it also screws up the global mean-variance relationship (that limma estimates) by putting every protein on a different measurement scale. It is far better to implement a standard limma approach using the complete data in which samples are compared to controls as part of the linear model.
This is pretty frustrating. You asked me why unlogged values can't be normally distributed and I gave you a number of reasons. You said you appreciate my comments, but you go back to your original assumption, which seems to dismiss everything that I've said.
Note that normality of log-intensities and normality of log-fold-changes is essentially the same thing. If the log-intensities are approximatetely normal (as we usually assume) then so are any log-fold-changes that you compute from them.
There is a growing literature on statistical analysis of mass spec data. I suggest that you have a look at what others have done and follow an analysis method that is supported by the literature.