There are different ways to gauge [graphically] how effective a normalisation has been. Looking at your second plot, it would appear in this case that normalisation has been successful.
Apart from box-and-whisker plots, one can also do:
Violin plot
Using regularised log or variance stabilised counts:
require(reshape2)
violinMatrix <- reshape2::melt(loggedCounts)
colnames(violinMatrix) <- c("Gene","Sample","Expression")
library(ggplot2)
ggplot(violinMatrix, aes(x=Sample, y=Expression)) + geom_violin() + theme(axis.text.x = element_text(angle=45, hjust=1))
pairwise sample scatter plots
Using regularised log or variance stabilised counts:
require(car)
scatterplotMatrix(loggedCounts, diagonal="boxplot", pch=".")
Dispersion plot
Just looking at the unlogged, normalised counts, a dispersion plot gives a good idea of how good the modelling of dispersion dependent on the mean normalised counts has been.
options(scipen=999)
plotDispEsts(dds, genecol="black", fitcol="red", finalcol="dodgerblue", legend=TRUE, log="xy", cex.axis=0.8, cex=0.3, cex.main=0.8, xlab="Mean of normalised counts", ylab="Dispersion")
options(scipen=0)
------------------------------
-------------------------------
More for outlier detection:
Bootstrapped hierarchical clustering (unsupervised - i.e. entire dataset)
Using regularised log or variance stabilised counts:
require(pvclust)
pv <- pvclust(loggedCounts, method.dist="euclidean", method.hclust="ward.D2", nboot=100)
plot(pv)
Principal components analysis
Symmetrical sample heatmap
Using regularised log or variance stabilised counts:
require(gplots)
distsRL <- dist(t(loggedCounts))
mat <- as.matrix(distsRL)
rownames(mat) <- colnames(mat) <- with(colData(dds), paste(metadata$IDlist, metadata$condition, sep=", "))
hc <- hclust(distsRL)
heatmap.2(mat, Rowv=as.dendrogram(hc), symm=TRUE, trace="none", col=rev(hmcol), cexRow=1.0, cexCol=1.0, margin=c(13, 13), key=FALSE)
Kevin
Thank you very much. This is really going to help me.
I more thing I need to clear. For box/violin plots do the first/third quartile or median for all the samples be at the same levels. For the same data set i was trying a different low count filter and after that I saw that the third quartile and the median are at level across sample but the first quartile of group A is same but for group B its much lower.
It is normal to exclude low counts, but which cut-off did you use?
I was experimenting with several like:
The plots I provided above are from #4 and this is from DESeq2 manual (if I remember correctly). Initially it was good to see the number of genes getting reduced but after DE I realised I lost some important genes only because in 1 out of 45 samples it had 0 count.
The results of #1,2 & 3 were similar.
Hey, well, you definitely should exclude anything with just 0 counts across all samples (#1). I can see why #1, #2, #3 give similar results. #4 may be too stringent, as I think that it should be expected that some samples will return a 0 count value (#4 requires that all samples have counts>0 ? ).
So what's your opinion on using this. Something else I should try. Apart from this if I read correctly DESeq performs further cleaning.
I think that either of your first three are valid - I have seen them being used in various studies. #4 is too stringent - you would require a good reason for using such a threshold, like, for example, you needed to use some statistical test where zero vales were not permitted.
I guess I need a bit more of help regarding this. I going through the graphs generated by the first three methods and something feels bad is going on; especially if you see box plot of normalized counts and density plot. Please see the attached files. The plots are similar for all of them so representing only the ones generated by #3 method. For normalized count boxplot the median and upper quartile are in range but not the lower one. And for density I guess the curves should have overlapped.