Using LIMMA's removeBatchEffect and PVCA to check for batch influences before and after removal I just noticed that if I don't use log2 values the batch effect removal does not work in the sense that PVCA shows retained batch effects after removal. With log2 values all runs fine.
Is this a general feature of removeBatchEffect?
Its a bit strange as later on, eg for VOOM, non-log2 values are required.
(though to be fair not for the corrected data as those won't go into the diff expression analysis).
There are multiple tools live SVA, COMBAT, edgeR function, limma function, RUVSEq, etc.
PVCA is doing PCA + Variance component analysis to extract factors that influence your variability using linear mixed model
Steps :
So normalized data + PVCA = confounders of batch effects (Assessing and identifying confounders)
Perform correction or adjustments of the confounders extracted from 1 using any standard batcheffect adjustment methods and perform again
PVCA to visualize on normalized-adjusted- log transformed data to see if confounders are assessed and what you see as primary gnee variability is due to biological phenotypic variation.( Viewing post-batch adjustments on all the genes expressed in all samples)
However, for DE analysis it is the model matrix and the model
effects/covariates that will be pulled out from 1(batch effect confounders) . These effects
should be modeled around your count data for any linear model
fitting. Any results of DE should be then viewed in with log transformed/corrected (from 2) data for visualizations like heatmaps, expression box plots, etc. (using confounders as covariates in model design for linear fit with limma on counts data to perform DEA)
My only two cents are, batch effect removal is not the key, one needs to adjust for it rather not deduct it. You are trying to understand what are the confounders in your data and how they mess around. You do it from counts data and any normalization that entails should be used while performing the effect analysis. What you see for plots later can be viewed via log transformation. However, for downstream differential analysis you don't use log2 transformed batch corrected data. One uses the counts data, pulls out the effect information, adds it either as covariates in model design for differential expression or use it for adjustments. I suggest to you take a look at the below links to understand how it is done, what is the underlying statistics and variance associated , and that you do not make any over-fitting.
I agree with all your points and actually follow them, only the one about doing the diff ex analysis with counts instead of log2 values is puzzling me. I am working with DIA derived quantitative proteome data. So far I always put log2 values into LIMMA for diff ex analysis as also voom transformation gives out log2 values to be handed over to LIMMA. (though after testing my data showed no heteroscadacity so I did not use VOOM, however I will post here another question about it today).
Anyhow, so far I always gave LIMMA log2 values, do I understand you correctly that I should use raw (no log2) values instead?
You should be using log2-transformed values with limma.
The reason why ivivek mentioned raw counts is that limma is most often used for RNA-seq data these days and RNA-seq data typically generates integer read counts per gene. limma was originally developed to handle microarray data, which comes in log2 space. In order to make limma work with the count data from RNA-seq data, the authors came up with the voom transformation, which will turn the counts into log2-space data. The paper introducing the voom transformation will give you more details on this if you're interested.
Thanks for chiming in. I agree. I did not realize it was preoteomics data. There are some options other than voom but also preteomics data needs to be defined, and also one should be aware what kind of quantification it is based on. Preprocessing needs to be handled accordingly. As long as you are using voom I am sure the scaling is taken care of at least with my experience in RNASeq. However, my experience in proteomics analysis is very limited. I have mostly worked with normalized preteomics data.
Please find the link that gives some insight. I believe this will shed more light.
As VOOM gave strange results that still showed heteroscedasticity I solved it now using log2 values and the LIMMA option eBayes(fit2_object, trend = F, robust = T)
Thanks a lot for your answer ivivek!
I agree with all your points and actually follow them, only the one about doing the diff ex analysis with counts instead of log2 values is puzzling me. I am working with DIA derived quantitative proteome data. So far I always put log2 values into LIMMA for diff ex analysis as also voom transformation gives out log2 values to be handed over to LIMMA. (though after testing my data showed no heteroscadacity so I did not use VOOM, however I will post here another question about it today).
Anyhow, so far I always gave LIMMA log2 values, do I understand you correctly that I should use raw (no log2) values instead?
You should be using log2-transformed values with limma.
The reason why ivivek mentioned raw counts is that limma is most often used for RNA-seq data these days and RNA-seq data typically generates integer read counts per gene. limma was originally developed to handle microarray data, which comes in log2 space. In order to make limma work with the count data from RNA-seq data, the authors came up with the voom transformation, which will turn the counts into log2-space data. The paper introducing the voom transformation will give you more details on this if you're interested.
Thanks for chiming in. I agree. I did not realize it was preoteomics data. There are some options other than voom but also preteomics data needs to be defined, and also one should be aware what kind of quantification it is based on. Preprocessing needs to be handled accordingly. As long as you are using voom I am sure the scaling is taken care of at least with my experience in RNASeq. However, my experience in proteomics analysis is very limited. I have mostly worked with normalized preteomics data.
Please find the link that gives some insight. I believe this will shed more light.