Question

Scaling RNA-Seq data before clustering?

3

Entering edit mode

4.5 years ago

im ▴ 30

Hi, should RNA-Seq data be scaled before creating a heatmap using hierarchical clustering? Should I use the scale argument in the heatmap function itself? And if scaling, should I scale by row or by column? I can't wrap my head around what each of those would mean.

Thank you!

RNA-Seq Clustering Hierarchical clustering Scaling • 7.8k views

ADD COMMENT • link updated 3.1 years ago by ATpoint 87k • written 4.5 years ago by im ▴ 30

score 12 · Answer 1 · 2020-10-10

tl;dr => Clustering based on logcounts emphasizes grouping by expression level while clustering based on scaled logcounts focuses on clustering by differences which is what is often desired.

In detail:

Scaling commonly means transforming data to the Z-scale. It means that for every gene we do not report e.g. the read count but its deviation from the mean of all samples for that gene. That would be a scaling per row. I could not immediately what a scaling by column would be good for. Anyway, you can scale prior to heatmap/clustering but you don't have to, it depends on what you want to show. For showing relative expression differences you typically do it as then the heatmap is focused on the relative difference of each sample from all other samples for every gene. It also has the advantage that all genes are more or less on the same scale while read counts have large span. Genes that are long and/or highly-expressed would dominate the heatmap. If you want to show absolute differences, so whether a group of genes is more highly-expressed than another, then don't scale, use e.g. log2-transformed values. In any case you typically log2-transform the normalized read counts prior to do anything with them, be it heatmapping/clustering or Z-scaling.

You can use in-build functions of tools or simple do t(scale(t(your.log2matrix))) to transform a matrix to the Z-scale. Does that make sense to you?

Example using the data from the airway package:

library(airway)
library(edgeR)
library(ComplexHeatmap)
data(airway)

# standard edgeR diff analysis, see its manual
y <- DGEList(counts = assay(airway, "counts"),
             group  = airway$dex)
y <- y[filterByExpr(y),,keep.lib.size=FALSE]
y <- calcNormFactors(y)
design <- model.matrix(~group, data = y$samples)
y <- estimateDisp(y, design = design)
tt <- topTags(glmQLFTest(glmQLFit(y,design=design), coef=2), n = Inf)
DEGs <- rownames(tt[tt$table$FDR<0.05,])

# extract the log read counts of the DEGs (here the top 100 so the heatmap does not get too big):
cts <- head(cpm(y, log = TRUE)[rownames(y) %in% DEGs,], 100)

#/ Heatmap with logcounts (unscaled)
hm1 <- Heatmap(matrix = cts,
               cluster_rows = TRUE,
               clustering_method_rows = "ward.D2",
               cluster_columns = TRUE,
               clustering_method_columns = "ward.D2",
               show_row_names = FALSE, 
               show_column_names = FALSE,
               column_title = "logcounts unscaled")

hm2 <- Heatmap(matrix = t(scale(t(cts))),
               cluster_rows = TRUE,
               clustering_method_rows = "ward.D2",
               cluster_columns = TRUE,
               clustering_method_columns = "ward.D2",
               show_row_names = FALSE,
               show_column_names = FALSE,
               column_title = "logcounts scaled")

See the heatmaps below. The unscaled one is not very informative as the large range of the logcpms does not really allow to draw any meaningful conclusions. One can barely distinguish the upregulated from the downregulated genes. In contrast, the scaled data clearly show a pattern.

enter image description here