Scaling RNA-Seq data before clustering?
1
2
Entering edit mode
4.1 years ago
im ▴ 20

Hi, should RNA-Seq data be scaled before creating a heatmap using hierarchical clustering? Should I use the scale argument in the heatmap function itself? And if scaling, should I scale by row or by column? I can't wrap my head around what each of those would mean.

Thank you!

RNA-Seq Clustering Hierarchical clustering Scaling • 6.9k views
ADD COMMENT
12
Entering edit mode
4.1 years ago
ATpoint 85k

tl;dr => Clustering based on logcounts emphasizes grouping by expression level while clustering based on scaled logcounts focuses on clustering by differences which is what is often desired.

In detail:

Scaling commonly means transforming data to the Z-scale. It means that for every gene we do not report e.g. the read count but its deviation from the mean of all samples for that gene. That would be a scaling per row. I could not immediately what a scaling by column would be good for. Anyway, you can scale prior to heatmap/clustering but you don't have to, it depends on what you want to show. For showing relative expression differences you typically do it as then the heatmap is focused on the relative difference of each sample from all other samples for every gene. It also has the advantage that all genes are more or less on the same scale while read counts have large span. Genes that are long and/or highly-expressed would dominate the heatmap. If you want to show absolute differences, so whether a group of genes is more highly-expressed than another, then don't scale, use e.g. log2-transformed values. In any case you typically log2-transform the normalized read counts prior to do anything with them, be it heatmapping/clustering or Z-scaling.

You can use in-build functions of tools or simple do t(scale(t(your.log2matrix))) to transform a matrix to the Z-scale. Does that make sense to you?

Example using the data from the airway package:

library(airway)
library(edgeR)
library(ComplexHeatmap)
data(airway)

# standard edgeR diff analysis, see its manual
y <- DGEList(counts = assay(airway, "counts"),
             group  = airway$dex)
y <- y[filterByExpr(y),,keep.lib.size=FALSE]
y <- calcNormFactors(y)
design <- model.matrix(~group, data = y$samples)
y <- estimateDisp(y, design = design)
tt <- topTags(glmQLFTest(glmQLFit(y,design=design), coef=2), n = Inf)
DEGs <- rownames(tt[tt$table$FDR<0.05,])

# extract the log read counts of the DEGs (here the top 100 so the heatmap does not get too big):
cts <- head(cpm(y, log = TRUE)[rownames(y) %in% DEGs,], 100)

#/ Heatmap with logcounts (unscaled)
hm1 <- Heatmap(matrix = cts,
               cluster_rows = TRUE,
               clustering_method_rows = "ward.D2",
               cluster_columns = TRUE,
               clustering_method_columns = "ward.D2",
               show_row_names = FALSE, 
               show_column_names = FALSE,
               column_title = "logcounts unscaled")

hm2 <- Heatmap(matrix = t(scale(t(cts))),
               cluster_rows = TRUE,
               clustering_method_rows = "ward.D2",
               cluster_columns = TRUE,
               clustering_method_columns = "ward.D2",
               show_row_names = FALSE,
               show_column_names = FALSE,
               column_title = "logcounts scaled")

See the heatmaps below. The unscaled one is not very informative as the large range of the logcpms does not really allow to draw any meaningful conclusions. One can barely distinguish the upregulated from the downregulated genes. In contrast, the scaled data clearly show a pattern.

enter image description here

ADD COMMENT

Login before adding your answer.

Traffic: 1704 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6