Question

Gene-based clustering of RNA-seq data

2

Entering edit mode

7.1 years ago

IrK ▴ 100

Hey everyone,

I am trying to understand the gene-based clustering, and what strategy can be applied to see what genes fall into the same cluster based on their expression profile.

I have RNA-seq data that has been analyzed with DESeq2 package for differentially expressed genes. I selected rlog normalization as advised at "the RNA-seq workflow", (from 11 April 2018), due to a small dataset size (3KOs vs 3WTs). I'd like to explore my data more and to use gene-based clustering to see what genes are fall into the same cluster. Section 6.3 of "the RNA-seq workflow", suggest visualization of 20 genes with the highest variance across.

I'd like to cluster a list of 4000 genes, so I am wondering if it is acceptable to use the same strategy
Also, I am trying to understand how a dramatic difference in signal between WT and KO experiments (pls see Fig.dummy example, row two) would affect clustering process? Basically, can I use matrix with KO and WT samples as an input for clustering, to see what genes fall into the same cluster group? KO and WT are different type of data that gives us an idea if there is change in KO experiment or not. Therefore, I am bit confused how would it effect the clustering.
Or, alternatively, could logFC values (results from DESeq2) be used as input for clustering, so we see what genes have similar expression profile?

Fig. dummy example:

KO1 KO2 KO3 WT1 WT2 WT3
2.2 3.2 3.8 4   4.5 4.6
1.7 1.8 1.4 10.3    11.2    10.9
2.2 3.2 3.8 4   4.5 4.6

I guess I have a confusion of using both KO and WT as an input for clustering, and how reliable are results from this kind of matrix.

Would appreciate any help in making these questions more clear. Thank you!

DESeq2 gene-based clustering clustering RNA-Seq • 7.2k views

ADD COMMENT • link updated 4.0 years ago by Kevin Blighe 89k • written 7.1 years ago by IrK ▴ 100

score 7 · Answer 1 · 2018-06-20

Hello and good morning Down Under,

You must be referring to this workflow: RNA-seq workflow: gene-level exploratory analysis and differential expression. The part on gene clustering in that workflow is only a very simple example and would only be the tip of the iceberg in terms of what you can do with clustering. Clustering the top variable genes is not something that I regularly do - if I need to examine the variance of genes, I usually leave that to principal component analysis (PCA), which is fundamentally based on variance / covariance. Most people use PCA for the detection of outliers, but one can go a lot further into it and, for example, examine which genes are contributing most variance along a particular axis / principal component (PC).

The typical procedure is to perform differential expression analysis, set a cut-of for statistical significance, and then perform hierarchical clustering (and generate a heatmap) with the genes that pass your set thresholds for statistical significance. One usually uses both cases and controls (or wild-types and knock-outs) for this purpose such that one can see how the genes segregate these groups. I have posted plenty of code on various Biostars threads about how you can generate these heatmaps and perform hierarchical clustering:

Typical packages that people use are:

heatmap.2
ComplexHeatmap
pheatmap

As you are using rlog (regularised log) expression levels, you should be using either Euclidean Distance or 1 minus Pearson / Spearman correlation distance (given your low sample number, you should use Spearman if you are going to use any type of correlation distance). For the linkage method, you could use any of complete, average, Ward's ('ward.D2'), etc.

Depending on how your samples are segregated in the dendrogram, you can then go back and re-set your thresholds for statistical significance and then repeat the hierarchical clustering.

Some clarification because it always comes up:

in RNA-seq, the statistical inferences are made on the normalised counts, which follow the negative binomial distribution. The plotting and other downstream functions are then performed on the transformed expression levels, usually rlog, variance-stabilised, logged (natural or log2), or Z-scaled. The reason for this is that most plotting and downstream functions expect data to follow the binomial / normal distribution, which negative binomial counts do not follow.

From what I understand about plotting log2 fold changes is that the hierarchical clustering is still performed on the transformed expression levels (e.g. rlog) but, in the heatmap, the colour shading is based on log2 fold changes. To do this, you would have to create dendrogram objects outside of the heatmap function and then supply these independently to the heatmap function itself. The log2 fold changes themselves would be passed as the x object to the heatmap function. When doing this, people typically set custom 'breaks' at certain log2 fold change cut-offs.

There are likely many variations on how to plot log2 fold changes in a heatmap, though.

Hope that this helps.

Kevin