Sometimes, in an experiment, I want to model RNA expression as a function of some continuous variable such as age, dose, or time after treatment, using a linear model. Doing this is easy enough, but the problem is that, as the name suggests, the log expression is modelled as a linear function on the covariate in question. But how do I know that a linear relationship is the correct one? What if the covariate needs to be log-transformed, or square-root transformed? How would I figure that out? Obviously I could try a bunch of common functions and see which one works "best", but that constitutes data snooping. Also, simply plotting expression vs the covariate of interest might work if there is only one covariate, but it will be less effective if there are multiple such covariates.
So, is there a statistically principled way to determine the appropriate transformation for a continuous covariate in a linear model?
Are we excluding doing a pilot experiment or subsetting the data and doing the snooping and testing on different subsets? I strongly suspect that those are the only really reliable methods without data snooping (assuming no a priori knowledge about what the covariate relationship might reasonably be like).
Yes, I'm asking if there's a way to determine the appropriate transformation from the data itself. Perhaps by discovering the globally optimal transformation across all genes, so that any one gene only contributes a tiny fraction and data snooping is minimized?
I suspect that the answer will be that the closest you can get is to try to interpret a PCA plot. The data snooping there is about as low as you're going to get. You might want to post this to cross-validated and see what the statistics folks think, hopefully they know of a better option.