Hi there,
I have a RNA-seq experiment with four samples.
control, treatment1, treatment2, treatment3. each with duplicates, I used DEseq to determined the differentially expressed genes for: treatment1 vs control, treatment2 vs control and treatment3 vs control.
treatment1 over-expresses a wild type protein while treatment2 and 3 express mutant proteins which have partial function of the WT protein.
treatment1 will cause some genes to be upregulated and downregulated. treatment2 and 3 will have some overlapping genes that are also upregulated and downregulated as treatment1 but will have some genes do not change (those mutant proteins only have partial function.
Now, I want to make a heatmap (using log2 fold change) with genes that are differentially changed with treatment1.
res1 <- nbinomTest( cds, 'control', 'treatment1' )
res2 <- nbinomTest( cds, 'control, 'treatment2' )
res3 <- nbinomTest( cds, 'control, 'treatment3' )
df<- data.frame(id=res1$id, logFC1=res1$log2FoldChange,padj1=res1$padj,
logFC2=res2$log2FoldChange, padj2=res2$padj,logFC3=res3$log2FoldChange, padj3=res3$padj)
df<- df[abs(df$logFC1)>1 & df$padj1<0.01,] # all the genes significantly changed with treatment1
df <- na.omit(df)
df<- df[is.finite(df$logFC1) & is.finite(df$logFC2) & is.finite(df$logFC3),]
df <- na.omit(df) #remove NA
df<- df[is.finite(df$logFC1) & is.finite(df$logFC2) & is.finite(df$logFC3),] #remove Inf
dim(df)
[1] 1669 7
if I plot a scatter plot of logFC1:treatment1
vs control
AND logFC2:treatment2
vs control
by
library(ggplot2)
ggplot(df) + geom_point(aes(x=df$logFC1, y=df$logFC2)) + geom_hline(y=c(1,-1), color='red',linetype=2)
it shows me:
It makes sense, because I pre-choosed for logFC1 with abs(logFC1)>1
.
Now, if I want make a heatmap with heatmap.2
:
m<- as.matrix(df[,c(2,4,6)])
> head(m)
logFC1 logFC2 logFC3
12 -1.799917 -5.0813424 -2.6516590
23 -2.843064 -0.2089547 -0.9493254
33 -1.474999 -0.2902581 -0.1040179
51 -1.447214 -0.5684074 -0.4131914
61 1.999213 1.8802237 1.2279896
65 1.022975 0.6953835 0.8228315
colnames(m)<- c("treatment1_VS_control","treatment2_VS_control","treatment3_VS_control")
hmcols<- colorRampPalette(c("green4","green","white", "red","red4"))(256)
# cluster with the pearson distance according to this post https://www.biostars.org/p/91978/#120843
library(gplots)
heatmap.2(m, col=hmcols, scale="row", hclust=function(x) hclust(x, method='complete'), distfun=function(x) as.dist(1-cor(t(x))), trace="none",margin=c(17,15), density.info="none", labRow=NA)
I can also change the method to "single", scale to "none", but I still did not get what I expected. I should see group of genes are both up-regulated with treatment1 and treatment2, treatment1 and treatment3, treatment1 2 and 3, genes only upregulated in treatment1 but not 2 and 3.
if I use the default euclidean distance which also have problems http://liorpachter.wordpress.com/2014/01/19/why-do-you-look-at-the-speck-in-your-sisters-quilt-plot-and-pay-no-attention-to-the-plank-in-your-own-heat-map/
heatmap.2(m, col=hmcols, scale="none", hclust=function(x) hclust(x, method='complete'), distfun=function(x) dist(x,method='euclidean'), trace="none",margin=c(17,15), density.info="none", labRow=NA)
It looks even worse:
How can I cluster the genes as what I want?
Thank you so much
Why don't you make a heatmap of RPKM/FPKM or normalized count data for your set of genes.
I can also make a heatmap with variance stabilized counts, or log2 transformed library size normalized counts, but the problem will be same for me.
I guess I don't understand, why do you think they look bad? I see everything you want in the first heat map. I mean you could always order them yourself and turn off reordering via
ROWV=FALSE
, otherwise you're saying that you just don't like the way it looks. OK, but so what? It's an unsupervised method.