Question

Using ComBat on RNASeq FPKM counts

1

Entering edit mode

8.5 years ago

ebrudermanver ▴ 100

I want to apply ComBat function in the sva package to an RNA-Seq dataset containing FPKM values. I first added 1 to all counts and then log-transformed the data followed by calling the ComBat function. However, I have no actual zero counts in the cleaned data while there were many zeros in the original data. This is expected since ComBat standardizes the data. All zeros are mapped to values between -0.36 and 4.45 (after exp-transformation and subtracting 1), and there are no exact zeros. However, it is kind of weird to have negative values and also no zero counts in the RNASeq data. So, my question is "what is the best way to use ComBat on RNA-Seq data?". Thanks.

RNA-Seq ComBat • 9.0k views

ADD COMMENT • link updated 6.8 years ago by Kevin Blighe 89k • written 8.5 years ago by ebrudermanver ▴ 100

1

Entering edit mode

Its natural to have negative values after using Combat.

Some of the posts that could be helpful : https://support.bioconductor.org/p/88522/ Remove Batch Effect From RNAseq with SVAseq and Combat Combat normalization returns negative values

ADD REPLY • link 8.5 years ago by Ron ★ 1.2k

0

Entering edit mode

Can you specifify whether you are using counts or FPKM ? You did not mention FPKM after first sentence.

ADD REPLY • link 8.5 years ago by GZ1995 ▴ 410

0

Entering edit mode

Yes, I am using FPKM. So the values are continuous, but there are also actual zeros in the original data containing FPKMs.

ADD REPLY • link 8.5 years ago by ebrudermanver ▴ 100

1

Entering edit mode

Here is a paper here using log2(FPKM+1) as input for Combat with some discussion. I think there is no need to worry about below zero values, and I will recommend keep using log-transformed values. The real problem with Combat may be whether the prior distribution fits RNA-seq data well (check the plot). Also make sure you only use Combat corrected value for exploratory purpose (PCA, clustering), not for differential expression analysis.

ADD REPLY • link 8.5 years ago by GZ1995 ▴ 410

1

Entering edit mode

One does have to worry about the negative values from ComBat. Negative values make no sense in RNA-seq (but they make sense in microarray studies).

In the manuscript that you mention, the following is stated:

A previous evaluation of normalization methods for RNA-Seq data suggested that FPKM values were not optimal for clustering analysis.

They then go on to normalise by TMM, as per EdgeR.

In the reviewers' comments, it becomes even more interesting.

Lior Pachter writes:

For example, one fundamental analysis choice is whether to quantify abundances of genes by summing raw "fragment counts" from alignments to gene regions, or via the summing of abundances as quantified by probabilistic assignment of ambiguously mapped reads. Gilad--Mizrahi-Man cite a paper by Dillies et al . (and the French StatOmique Consortium) suggesting that "FPKM values were not optimal for clustering analysis" to argue for using "fragment counts". I strongly disagree with this choice because transcript abundances are necessary to accurately estimate gene-level abundances, a point that Dillies et al. fail to realize. As pointed out in my own paper on Cufflinks 2 (Trapnell et al. 2012) wrong does not cancel wrong for differential analysis, nor does it for the purpose of clustering

The text in bold appears to see Pachter admitting that FPKM is neither optimal for differential expression analysis nor clustering.

Mick Watson writes:

The authors may also wish to discuss use of FPKM, which may not be the most useful measure of gene expression in this study, as the human and mouse orthologues have different lengths.

Rafael Irizarry writes:

I am not sure what it means in biology when points are log (FPKM + 1) values for thousands of genes

ADD REPLY • link 6.8 years ago by Kevin Blighe 89k

score 2 · Answer 1 · 2018-08-12

Logging FPKM counts does not make things better. The combination of using ComBat and FPKM data is, in addition, akin to throwing your data in the trash and testing noise.

You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:

Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.