Question

Get the gene functions specific for clusters (RNA seq)

0

Entering edit mode

15 months ago

Diana ▴ 10

I am trying to reproduce the transcriptome clustering as in this article https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4580370/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4580370/figure/F5/ this is the figure of a result from the article (A), I can't understand at all, how they got the annotations for the gene functions? There is no differential expression (as i understood it), it is RSEM data that was log2 transformed and median centered, 1500 most variable genes defined by MAD.

I did the clustering with ConsensusClusterPlus, if that matters...

rnaseq heatmap clustering • 859 views

ADD COMMENT • link updated 15 months ago by jv ★ 1.8k • written 15 months ago by Diana ▴ 10

score 2 · Answer 1 · 2024-01-29

The 1,500 most variably expressed

My interpretation of what's provided in the manuscript is that these 1500 genes were selected by ranking the genes based on expression variation across samples - this is not the same as differential expression analysis. E.g., the top variable gene shows the widest range of variation in expression across all the samples, versus a non variable gene which would show little difference in expression across samples. This is a common method for selecting genes for unsupervised clustering since you focus on the genes that give you the most information regarding difference between samples.

I can't understand at all, how they got the annotations for the gene functions

This is essentially an over-representation analysis. Given genes in cluster x is there a predefined pathway/gene set/custom list of genes that is over-represented (i.e. enriched)? This is a very common analysis method. According to the paper the gene list defining MITF-low, Keratin, and Immune clusters are provided in Table S4A, S4B

score 1 · Answer 2 · 2024-01-28

I may be completely wrong here, but this is my guess:

It looks like they maybe did do a differential expression analysis. In the Supplemental-2, which I think you were reading as well. It says: "The 1,500 most variably expressed genes were selected and used for consensus average linkage hierarchical clustering (GDAC Firehose AWG, http://gdac.broadinstitute.org/runs/awg_skcm__2014_02_23)"

I am guessing they maybe did differential gene expression analysis using some tool (that they didn't mention), and then arranged by p value from least to greatest and then used the top 1,500 (with lowest p-value at the top and greatest p-value at the bottom). They called these "variably" expressed genes which I imagine they meant "differentially" expressed genes.

Then they did hierarchical clustering I am guessing using the RSEM count data (log transformed and median centered) of just these 1,500 variably/differentially expressed genes. They probably "sliced" the hierarchical cluster at some level which allowed them to "capture the genes" for those distinct looking clusters on the heatmap, and then added the hierarchical clustering to a heatmap from ComplexHeatmap to make the initial heatmap (see here for a question on StackOverflow that does this: https://stackoverflow.com/questions/77085300/getting-different-hierarchical-clustering-in-complexheatmap-for-the-same-method )

Then, I am guessing, they ran those genes in groups into some sort of gene enrichment database/tool to get the annotations for gene functions, and then finally, labeled them using maybe a variation of this: https://jokergoo.github.io/ComplexHeatmap-reference/book/heatmap-annotations.html#draw-textbox-annotation