Hey,
The DESeq2 PCA function will [I believe] convert a continuous variable to categorical, which is likely not what you want. This is fine, as DESeq2 is primarily about normalising RNA-seq data and conducting differential expression.
In any case, you can do this with my own package, PCAtools, which should fit nicely with your DESeq2 variance-stabilised expression values and metadata. Here is a reproducible example, taken from my vignette: PCAtools: everything Principal Component Analysis:
1, load breast cancer data from GEO
library(Biobase)
library(GEOquery)
# load series and platform data from GEO
gset <- getGEO('GSE2990', GSEMatrix = TRUE, getGPL = FALSE)
x <- exprs(gset[[1]])
# remove Affymetrix control probes
x <- x[-grep('^AFFX', rownames(x)),]
# extract information of interest from the phenotype data (pdata)
idx <- which(colnames(pData(gset[[1]])) %in%
c('age:ch1', 'distant rfs:ch1', 'er:ch1',
'ggi:ch1', 'grade:ch1', 'size:ch1',
'time rfs:ch1'))
metadata <- data.frame(pData(gset[[1]])[,idx],
row.names = rownames(pData(gset[[1]])))
# tidy column names
colnames(metadata) <- c('Age', 'Distant.RFS', 'ER', 'GGI', 'Grade',
'Size', 'Time.RFS')
# prepare certain phenotypes
metadata$Age <- as.numeric(gsub('^KJ', NA, metadata$Age))
metadata$Distant.RFS <- factor(metadata$Distant.RFS, levels=c(0,1))
metadata$ER <- factor(gsub('\\?', NA, metadata$ER), levels=c(0,1))
metadata$ER <- factor(ifelse(metadata$ER == 1, 'ER+', 'ER-'), levels = c('ER-', 'ER+'))
metadata$GGI <- as.numeric(metadata$GGI)
metadata$Grade <- factor(gsub('\\?', NA, metadata$Grade), levels=c(1,2,3))
metadata$Grade <- gsub(1, 'Grade 1', gsub(2, 'Grade 2', gsub(3, 'Grade 3', metadata$Grade)))
metadata$Grade <- factor(metadata$Grade, levels = c('Grade 1', 'Grade 2', 'Grade 3'))
metadata$Size <- as.numeric(metadata$Size)
metadata$Time.RFS <- as.numeric(gsub('^KJX|^KJ', NA, metadata$Time.RFS))
# remove samples from the pdata that have any NA value
discard <- apply(metadata, 1, function(x) anyis.na(x)))
metadata <- metadata[!discard,]
# filter the expression data to match the samples in our pdata
x <- x[,which(colnames(x) %in% rownames(metadata))]
# check that sample names match exactly between pdata and expression data
all(colnames(x) == rownames(metadata))
2, add ESR1 gene expression to the metadata
metadata$ESR1 <- x["205225_at",]
head(metadata)
Age Distant.RFS ER GGI Grade Size Time.RFS ESR1
GSM65752 40 0 ER- 2.480050 Grade 3 1.2 2280 5.893152
GSM65753 46 0 ER+ -0.633592 Grade 1 1.3 2675 11.419733
GSM65755 41 1 ER+ 1.043950 Grade 3 3.3 182 10.922990
GSM65757 34 0 ER+ 1.059190 Grade 2 1.6 3952 9.629512
GSM65758 46 1 ER+ -1.233060 Grade 2 2.1 1824 11.367321
GSM65760 57 1 ER+ 0.679034 Grade 3 2.2 699 10.161146
3, run PCA
library(PCAtools)
p <- pca(x, metadata = metadata, removeVar = 0.1)
Here, your x
object would be assay(vsd)
4, generate bi-plot with continuous scale
biplot(p,
lab = NULL,
colby = 'ESR1',
shape = 'ER',
hline = 0, vline = 0,
legendPosition = 'right') +
scale_colour_gradient(low = "gold", high = "red2")
Good validation in its own right, as those with higher ESR1 gene expression are also ER-positive by IHC / histology...
Kevin
Hi Kevin, in your github vignette for PCAtools you also show the
pairsplot
function to look at multiple PC's instead of only PC1 and 2 that you show above. And also there you can plot a continuous variable like you did here, viacolby =
. However, it seems to me the+ scale_colour_gradient(low = "gold", high = "red2")
doesn't work forpairsplot
. Is this correct? And was this meant to be like this?That may very well be the case, as the pairs plot is ultimately many bi-plots pieced together, but the internal coding is therefore different. The pairs plot is definitely the 'weak point' of the package, and I believe ggplot2 now has specific functionality for generating such plots.