This is certainly a tricky problem. My favourite method for exploring what's happening with individual samples is GSVA. GSVA takes your expression matrix (genes x samples) and a set of gene sets (pathways, functions, whatever you like), and calculates a score that summarises the geneset for a given sample. You end up with a matrix genesets x samples.
This approach has benefits over gene x sample heatmaps by associating functional relevance to your genesets, and possibly offering more insightful views into sample population heterogeneity.
Resources
Use Case I - Interested in a few Pathways and Genesets
Sometimes there's a critical pathway(s) that investigators are interested that yielded the generation of this experiment.
- Make sure you've got your expression matrix (I'd recommend Salmon quantifications from a Gencode reference, imported using tximport, and processed using the Limma-Voom workflow)
- Grab a your genesets of interest and format them as a named list object where the elements are a vector containing the same identifier schema as your expression matrix. So on the example above make sure they're using Ensembl IDs. Subset this for the genesets you're interested in.
- Run the GSVA function with the above inputs
- Make this into a heat map using your favourite library (ggplot2 for example) and tack on some phenotypic annotation to interpret
Use Case II - I've no idea what I'm interested in
Often you'd like to see what functional elements are statistically relevant against a large reference set
- Make sure you've got your expression matrix (see above)
- Prepare your geneset list object (see above, just keep all genesets)
- Run the GSVA function with the above inputs
- Fit a model to the GSVA matrix (see example below)
- Grab statistically relevant genesets, and subset GSVA matrix
- Make heat map
Coded Example
Taken from the GSVA function manual
library(limma)
p <- 10 ## number of genes
n <- 30 ## number of samples
nGrp1 <- 15 ## number of samples in group 1
nGrp2 <- n - nGrp1 ## number of samples in group 2
## consider three disjoint gene sets
geneSets <- list(set1=paste("g", 1:3, sep=""),
set2=paste("g", 4:6, sep=""),
set3=paste("g", 7:10, sep=""))
## sample data from a normal distribution with mean 0 and st.dev. 1
y <- matrix(rnorm(n*p), nrow=p, ncol=n,
dimnames=list(paste("g", 1:p, sep="") , paste("s", 1:n, sep="")))
## genes in set1 are expressed at higher levels in the last 10 samples
y[geneSets$set1, (nGrp1+1):n] <- y[geneSets$set1, (nGrp1+1):n] + 2
## build design matrix
design <- cbind(sampleGroup1=1, sampleGroup2vs1=c(rep(0, nGrp1), rep(1, nGrp2)))
## fit linear model
fit <- lmFit(y, design)
## estimate moderated t-statistics
fit <- eBayes(fit)
## genes in set1 are differentially expressed
topTable(fit, coef="sampleGroup2vs1")
## estimate GSVA enrichment scores for the three sets
gsva_es <- gsva(y, geneSets, mx.diff=1)$es.obs
## fit the same linear model now to the GSVA enrichment scores
fit <- lmFit(gsva_es, design)
## estimate moderated t-statistics
fit <- eBayes(fit)
## set1 is differentially expressed
topTable(fit, coef="sampleGroup2vs1")