I am working with a proteomics dataset and would like to use LIMMA for differential expression analysis. As it was recommended to perform a VOOM transformation I read into the topic and found a way to asses for heteroscedasticity before actually transforming my data without need. I found the lmtest package and applied it on my data (I extracted the expression matrix from my expressionSet and calculated the mean and variance for each protein using dplyr. The resulting df I fed into lm(variance ~ mean, data=before_voom)
Questions:
1) It seems that according to Residual vs fitted I do not have heteroscedasticity, so no voom is necessary before LIMMA - correct?
2) Few proteinIDs are shown as outliers in the plot. Should I remove them from the following expression analysis?
3) The normal Q-Q plot is skewed with an upraising tail. It seems this means that my data are not normally distributed. Do I need to/can I correct for this before using LIMMA?
Thanks a lot for your suggestions! Sebastian
Just a recommendation to ask this on Bioconductor forum and link back here. Gordon Smyth, Limma developer, is more active there (although there is much cross-talk between both forums).
Following on this comment, I see the author of limma has commented about VOOM for spectral count data on Bioconductor here. For TMT data, related Bioconductor discussion here. Sebastian, worth a read if you've not already reviewed.
Always interesting discussions with Aaron, Wolfgang, et al!
Agree with Kevin's suggestion - curious, re (1) are the before-voom values log2FC from SILAC, TMT intensities, Maxquant LFQs, or MS/MS counts? For something that's a linear-count scale, like MS/MS counts I'd be tempted to log2 transform and look at the residuals after that, just for comparison. In these very dense plots with large symbols and compressed scales due to a few outliers, sometimes difficult to see trends clearly. In your variance ~ mean model, did you compute variances including data for each protein that went beyond technical replicates? If so, the variances in your model will include other design-driven variance. I'd offer that what's important for checking heteroscedasticity is the variance of the residuals in your experimental design-fitted model, so if you have experimental factors in your design, you may want to look at lm(y ~ <factors> + error) and look at the variances of residuals in that model vs. fitted values. On (2), I probably wouldn't, but perhaps depends on answer to above question.
Thank you all for your inputs! I made more graphs/calculations and moved the question here