which matrix should be used to draw heatmap in RNAseq?
2
3
Entering edit mode
5.6 years ago
pbigbig ▴ 250

Hi everyone,

I have some confusion about which type of expression matrices should I use for heatmap visualization of RNAseq data. There are 3 options listed below: (raw count matrix was obtained from featureCounts, TMM cross-sample normalization is performed by edgeR)

  1. TMM-normalized raw count

  2. TPM value calculated from raw count

  3. TPM value calculated from raw count, then TMM-normalized

Additionally, should I use log2(x+1) transformation before row-scaling when drawing heatmap? Because in some cases, I saw that row-scaling was enough to signify the difference. I am new to this, so any detail explanation is highly appreciated.

Thank you very much in advance!

RNAseq heatmap normalization • 6.4k views
ADD COMMENT
8
Entering edit mode
5.6 years ago

In what follows, I'm assuming that you want to have genes in rows, and samples in columns. The answer might be different if this is not the case, hopefully for reasons that will make sense after the following....

Both TMM and TPM proceedures include a step to normalise for the difference between samples in an attempt to make the measurement for a given gene comparable between samples. However, TMM does a better job of this.

TPM also includes steps that attempt to normalize expression values such that they are comparable between two different genes WITHIN one sample (e.g. is Gene A or Gene B more highly expressed).

Normally the recommendation, if you have to choose between counts and TPM is to choose TPM (or TPM caluculated from TMM-normalised counts). But if you plan to do row normalisation, then this will undo the TPM transformation anyway.

However, as hinted at in your final question, there is another transformation that needs to be considered: variance stabilisation. Log2 is often used as variance stabilising transform in many fields, but because we deal with a lot of zeros, it is often not suitable. One solution is to add a pseudo-count - this both further stabilises the variance, and deals with the zeros problem, but the choice of + 1 is pretty arbitrary. Luckily, there are more sophisticated alternatives, the most common being regularized log and vst both provided by DESeq2. These transforms will also deal with normalising raw counts in a manner similar to the TMM normalization of edgeR.

A final alternative, if you wish to stay in the edgeR universe, is limma.voom which will take an edgeR object and apply transforms so that its variance is somewhat stabilised, but I know less about that.

ADD COMMENT
0
Entering edit mode

Thank you very much for your comprehensive answer!

So as I understand, graphically in expression matrix, purpose of TPM is for same-column comparison and TMM is for same-row comparison. I think scaling by row will only benefit those who only interested in clusters of highly/lowly expressed genes in relative meaning (high/low compared to the same gene in other samples). Clustering in non-scaled-row matrix may give more informative clusters, I suppose.

ADD REPLY
0
Entering edit mode

Depends on your distance matrix. Euclidean distance on a row-scale matrix is roughly equivalent to pearson distance on a none-scaled matrix.

ADD REPLY
0
Entering edit mode

How about Deseq2 based normalized count and 2 different scaling methods. normalized_counts <- counts(dds, normalized=TRUE).

Following is the actual normalized count, scaling through Pheatmap seems not good. I compared CPM(normalized_counts) vs normalized_counts with the combination of with/without log2 and pheatmap based row scaling.

normalized data

CPM_on_normalized_with_pheatMap_row_scaling
CPM_on_normalized_with_pheatMap_row_scaling

cpm_on_normalized_with_log2_scaling
cpm_on_normalized_with_log2_scaling

log2 on normalized count (no CPM)

log2 on normalized count (no CPM)

normalized count with pheatmap row scaling
normalized count with pheatmap row scaling

cpm on normalized count without scaling
cpm on normalized count without scaling

ADD REPLY
1
Entering edit mode
5.6 years ago
predeus ★ 2.1k

Depends on why do you need the visualization, right? Visualization can be done to explore the data, or to make a point in the publication etc

If you want to explore the data, you can also try and specify for yourself, what exactly you want to find out - specific genes? Certain pathways? all this matters a lot

Various normalizations and transformations can be quite useful, but they also distort the original data.

ADD COMMENT

Login before adding your answer.

Traffic: 4048 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6