Hi All,
I've run DESeq2 on my counts comparing two conditions, treated vs untreated. I have my dds object and have generated my vsd object using the command:
vsd <- vst(dds, blind = FALSE)
then plotted my PCA using:
plotPCA(vsd, intgroup = "condition", ntop = 500)
However, I'd like to produce another plot using just a subset, which is those genes I identified as DE (padj < 0.05). My problem is, when I try to extract a subset from the vsd data using:
de <- rownames(res[res$padj<0.05 & !is.na(res$padj), ])
de_mat <- assay(vsd)[de,]
Then try to use this object to plot the PCA:
plotPCA(de_mat, intgroup = "condition", ntop = 500)
I receive the error message:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘plotPCA’ for signature ‘"matrix"’
Which makes sense because de_mat is a matrix and vsd is not. Can anyone help me with another way to extract my subset of genes from the dds/ vsd object so I can use the subsequent object to plot a PCA?
I need to confirm, via PCA, that this subset does indeed discriminate between the two groups sufficient to justify my downstream analyses. My PCA using the full gene set shows mixing of the groups and it would be great to be able to justify using a great visual like a PCA for just the DE genes at this point, so that I can show clear separation of the groups.
I greatly appreciate all help in advance. If anyone has ideas on other visuals, I'm open to those as well. I've already covered heatmaps and volcano plots, though; it's just that my PI really likes the idea of the DE genes only PCA.
Thanks, Kevin,
Subsetting vsd worked great for me. I've also looked at PCATools, which looks really comprehensive but a little intimidating to me. My counts data is in a text file, with genes as rows and columns as samples:
My phenotypic data is also a text file, with samples as rows and each column representing a different phenotypic trait:
How do I import this data into PCATools?
Cheers, Cece
Hey, you would just have to do:
However, I have a strict check in place whereby the rownames of
MyMetadata
have to be the same as the colnames ofMyData
(same names and in same order).For this, as per DESeq2's function, the input data should be the transformed counts.
Don't worry if you just want to generate a bi-plot - that is easy with DESeq2's function.
Hey @Kevin, what is the difference between top 500 in vsd and top 500 of the vsd based on the padj values for de?
I thought when you run the first one, it was plotting top 500 genes most variable based on p value? if not, what those it mean ntop=500 in vsd?
ntop = 500
is choosing the top 500 genes based on variance (across all samples). On the other hand, the top DE (differentially expressed) genes are related to your groups that are being compared and are chosen based on p-values and/or fold changesok thanks, I think I understood. However, my groups being compared I said to be between "samples" so I do not understand how this is generating two different PCAs? if both are based on pvalues?
What do you mean by "different PCAs"? - bi-plots?
After I have run the above dds and done the vsd transformation, I can run two PCAs and it generates very different visualisations. This can only be because the ntop 500 on each of them is not the same subset of genes. Therefore my question of why? Those ntop most variable genes of the first one, arent they based on the same pvalue I set on (de)?
No,
ntop
is, in no way, based on any p-value.ntop
selects the top genes based on variance across all samples.Ok, sorry, then I might have been misunderstanding the statistics behind it. Then the most variable gene across samples is NOT the one that presents the smallest p-value.
So the top gene would be the one that presents the highest variance to the mean averaged across samples? Whereas the top of (de) is the one which present the most significant variation across samples based on p value? Is this right?
It's just the gene that has the highest variance (in R, via
var()
) across, in this case, the transformed variance-stabilised expression levels.It's the one that has the lowest p-value after comparing expression levels between your two groups being compared. Granted, this may indirectly be affected by dispersion and, therefore, also, variance. Note, that, in DESeq2, the statistical inferences are made on the normalised counts, not the transformed variance-stabilised or regularised log expression levels.
Thanks Kevin, that was a clear anwer.