Question

which matrix should be used to draw heatmap in RNAseq?

3

Entering edit mode

6.0 years ago

pbigbig ▴ 250

Hi everyone,

I have some confusion about which type of expression matrices should I use for heatmap visualization of RNAseq data. There are 3 options listed below: (raw count matrix was obtained from featureCounts, TMM cross-sample normalization is performed by edgeR)

TMM-normalized raw count
TPM value calculated from raw count
TPM value calculated from raw count, then TMM-normalized

Additionally, should I use log2(x+1) transformation before row-scaling when drawing heatmap? Because in some cases, I saw that row-scaling was enough to signify the difference. I am new to this, so any detail explanation is highly appreciated.

Thank you very much in advance!

RNAseq heatmap normalization • 6.9k views

ADD COMMENT • link updated 3.2 years ago by Thind amarinder ▴ 340 • written 6.0 years ago by pbigbig ▴ 250

score 8 · Answer 1 · 2019-05-31

In what follows, I'm assuming that you want to have genes in rows, and samples in columns. The answer might be different if this is not the case, hopefully for reasons that will make sense after the following....

Both TMM and TPM proceedures include a step to normalise for the difference between samples in an attempt to make the measurement for a given gene comparable between samples. However, TMM does a better job of this.

TPM also includes steps that attempt to normalize expression values such that they are comparable between two different genes WITHIN one sample (e.g. is Gene A or Gene B more highly expressed).

Normally the recommendation, if you have to choose between counts and TPM is to choose TPM (or TPM caluculated from TMM-normalised counts). But if you plan to do row normalisation, then this will undo the TPM transformation anyway.

However, as hinted at in your final question, there is another transformation that needs to be considered: variance stabilisation. Log2 is often used as variance stabilising transform in many fields, but because we deal with a lot of zeros, it is often not suitable. One solution is to add a pseudo-count - this both further stabilises the variance, and deals with the zeros problem, but the choice of + 1 is pretty arbitrary. Luckily, there are more sophisticated alternatives, the most common being regularized log and vst both provided by DESeq2. These transforms will also deal with normalising raw counts in a manner similar to the TMM normalization of edgeR.

A final alternative, if you wish to stay in the edgeR universe, is limma.voom which will take an edgeR object and apply transforms so that its variance is somewhat stabilised, but I know less about that.

score 1 · Answer 2 · 2019-05-31

Depends on why do you need the visualization, right? Visualization can be done to explore the data, or to make a point in the publication etc

If you want to explore the data, you can also try and specify for yourself, what exactly you want to find out - specific genes? Certain pathways? all this matters a lot

Various normalizations and transformations can be quite useful, but they also distort the original data.