I'm hoping someone here can help me with a frustrating result I'm getting from setting multiple breaks with a heatmap in R. I'm plotting log2 Fold-Change of differential gene expression.
When I run my code in RStudio I get a nice looking result in the plots panel which shows what I would like to see: All values less than -20 are a solid darkgreen, those between -19.9 and -1 are a continuous color ramp, between -1 and 1 is a solid white color, those between 1 and 20 are a continuous color ramp, and all values greater than 20 are a solid dark blue color. http://i64.tinypic.com/2nr09oi.jpg
However, when I export a pdf file I'm seeing something much different - a very weird result. http://i66.tinypic.com/10di2a8.jpg
Here was how I specified the breaks and colors:
# Specify color breaks
bk1 <- c(seq(-70,-20.1, by=0.01))
bk2 <- c(seq(-20,-1,by=1))
bk3 <- c(seq(-0.99,0.99, by=0.01))
bk4 <- c(seq(1, 20, by=1))
bk5 <- c(seq(20.1,70, by=0.01))
bk <- c(bk1,bk2,bk3,bk4,bk5)
# Specify the colors for each break
colors <- c(colorRampPalette(colors = c("darkgreen", "darkgreen"))(n = length(bk1)-1), c(colorRampPalette(colors = c("seagreen4", "aquamarine"))(n = length(bk2)-1),c(colorRampPalette(colors = c("white", "white"))(n = length(bk3)-1),c(colorRampPalette(colors = c("deepskyblue", "blue2"))(n = length(bk4)-1),c(colorRampPalette(colors=c("darkblue","darkblue"))(n=length(bk5)-1))))))
What was even more surprising is when I switch the "darkblue" to "black" you can see there is black from ~+18-20 and >20 is a grey color.
This makes me thinks that this may be an error in my code to specify breaks/colors. Still somewhat odd though that the initial figure "looked" like what I wanted from my code. Help greatly appreciated!
> sessionInfo()
R version 3.4.1 (2017-06-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252
[3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C
[5] LC_TIME=English_Canada.1252
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] pheatmap_1.0.8
loaded via a namespace (and not attached):
[1] colorspace_1.3-2 scales_0.5.0 compiler_3.4.1 plyr_1.8.4
[5] tools_3.4.1 gtable_0.2.0 RColorBrewer_1.1-2 Rcpp_0.12.13
[9] grid_3.4.1 knitr_1.17 munsell_0.4.3
UPDATE
I figured out how to properly set multiple breaks and colors. I'm not sure if this is the optimal way to set a "solid" color (e.g. color1. color5) but it seems to work.
bk = seq(-3, max(heat), length.out = 100)
color1 <- colorpanel(sum(bk[-1]<=-2), "darkblue","darkblue")
color2 <- colorpanel(sum(bk[-1]>-2 & bk[-1]<=-0.2), "blue3", "cornflowerblue")
color3 <- colorpanel(sum(bk[-1]>-0.2 & bk[-1]<=0.2), "white","white")
color4 <- colorpanel(sum(bk[-1]>0.2 & bk[-1]<=2), "orangered", "red3")
color5 <- colorpanel(sum(bk[-1]>2), "darkred","darkred")
colors <- c(color1,color2,color3,color4,color5)
Hey moldach, when I output anything as PDF, the colours always slightly change from when the figure is produced in the R plotting window. Why don't you try to output the PDF as CMYK colour instead of RGB? - just add
colormodel="cmyk"
as a parameter to thepdf()
function.Edit: I had a feeling that I had encountered this issue before and it just came back to me. The issue was that the number of breaks has to be 1 more than your colour vector. I usually set breaks and colours as follows (I also normally convert my data into the Z-scale manually and then switch off scaling in the heatmap function):
Thank you for your suggestions Kevin! I have a couple more questions for you:
1) I'm wondering why you choose to convert data into the z-scale (and switch off scaling). What benefit does this have for displaying the data rather than just plotting log2FC?
2) Many manuscripts have a log2FC scale from (-2, 2). Please excuse my rather naive question but why are we not interested in very large log2 fold changes? I did some quick napkin math and maybe I've thought this through. Say Treatment1 FPKM was 0.1 and Control DMSO was 0.00000000001. 0.1/0.00000000001 = 10 million and log2(10million) would be 33.219. I'm guessing we aren't interested in such changes because that's not a very big/interesting change between conditions?
I've chosen to set any lof2FC above 2 (or smaller than -2) to a solid color, as seen here:
Would this be ideal to show or would it be better to show the true range of my data? For example, I filtered cuffdiff's isoform_exp.diff output by FDR corrected p-value (q-value) <0.05. When I sort by log2FC column I see the range is from (-15, 11)?
As always your help has been greatly appreciated.
Hey Moldach, for me, I choose whichever metrics result in the best presentation of my results. It could be converting the data to Z-scores and providing a more uniform colour-gradient in the heatmap, or it could be leaving the data as log2 fold-changes and fixing the colour-gradient at specific cut-offs of log2 fold-change.
The one thing that I'll say is that various people do not anymore trust RPKM/FPKM normalisation for RNA-seq data. Take a look at this thread: the problem with rpkm (and tpm) Programs like DESeq2 perform a geometric normalisation and then perform a regularised log transformation, which deals very well with low and high counts, and gives more credible fold-change differences.
You may also want to check exactly how your heatmap function scales your data. I think that you are using pheatmap?
heatmap.2
, for example, scales the data as follows:...i.e. divides by the mean and then by the standard deviation.
Hope that this helps. We may have drifted fro the original question though!
Kevin
Yes I totally understand that the majority of researchers do not trust RPKM/FPKM normalization anymore. When I did my graduate studies in 2015 my PI gave me a copy of "RNA-seq Data Analysis: A Practical Approach" by Korpelainen et al., and at that time cufflinks was still being used; however, I chose to use DESeq2 instead because of my complex experimental design (time-course).
I had heard at that time that cufflinks was pretty much going the way of the Dodo and it's strange to see it still persist even today. To be 100% honest why I'm using this is because I started a new job and the sequencing facility STILL uses cufflinks. It turns out that many of the researchers who get data from them are M.D.'s (probably with little computational experience) and love the EDA graphs generated by cummeRbund.
I was given the cufflinks/cuffdiff output (not even given FASTQ files just BAM's) and my PI and a postdoc want to have a paper out by the end of next week for fear of getting scooped. So in the interest of time I was just working with what I had rather than going back to FASTQ's. I probably should have but there was other problems like the .pkl file from the sequencing facility being read-protected (don't want people knowing there secret sauce - essentially a black box) - I didn't even know what annotation (ensembl, Refseq, UCSC, etc.) was used for alignment a non-trivial point (See: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4339237/)
In fact Cufflinks was created by Lior Pachter and Steve Salzburg but it's now being maintained by Cole Trapnell. I heard there was a more recent Tuxedo suite (ref 1) so i contacted Cole to ask him why they don't even mention that on the website. His response strongly makes me believe he is trying to let Cufflinks die a slow and quite death. This is rather unfortunate and i believe one of the reasons bad practices in bioinformatics persist.
In addition to your point about FPKM normalization being inappropriate Cufflinks should only be used to compare a healthy state to a disease state - it cannot handle more complex experimental designs multiple factors (explanatory variables) - e.g. experiments involving multiple treatments and time-points. However, the authors of the "new Tuxedo" workflow claim it can handle more complex designs FWIW.
Other papers I've read recently (ref 2&3) suggest using DESeq2 or ultra-fast alignment-free methods such as Sailfish, Salmon and Kallisto which report transcripts per million (TPM). TPM is recommended to replace FPKM; however, in some cases FPKM > TPM (Zhang et al., 2017).