I want to apply ComBat function in the sva package to an RNA-Seq dataset containing FPKM values. I first added 1 to all counts and then log-transformed the data followed by calling the ComBat function. However, I have no actual zero counts in the cleaned data while there were many zeros in the original data. This is expected since ComBat standardizes the data. All zeros are mapped to values between -0.36 and 4.45 (after exp-transformation and subtracting 1), and there are no exact zeros. However, it is kind of weird to have negative values and also no zero counts in the RNASeq data. So, my question is "what is the best way to use ComBat on RNA-Seq data?". Thanks.
Its natural to have negative values after using Combat.
Some of the posts that could be helpful : https://support.bioconductor.org/p/88522/ Remove Batch Effect From RNAseq with SVAseq and Combat Combat normalization returns negative values
Can you specifify whether you are using counts or FPKM ? You did not mention FPKM after first sentence.
Yes, I am using FPKM. So the values are continuous, but there are also actual zeros in the original data containing FPKMs.
Here is a paper here using log2(FPKM+1) as input for Combat with some discussion. I think there is no need to worry about below zero values, and I will recommend keep using log-transformed values. The real problem with Combat may be whether the prior distribution fits RNA-seq data well (check the plot). Also make sure you only use Combat corrected value for exploratory purpose (PCA, clustering), not for differential expression analysis.
One does have to worry about the negative values from ComBat. Negative values make no sense in RNA-seq (but they make sense in microarray studies).
In the manuscript that you mention, the following is stated:
They then go on to normalise by TMM, as per EdgeR.
In the reviewers' comments, it becomes even more interesting.
Lior Pachter writes:
The text in bold appears to see Pachter admitting that FPKM is neither optimal for differential expression analysis nor clustering.
Mick Watson writes:
Rafael Irizarry writes: