Hello all, I have been thinking about RNAseq heatmaps for a while now and would appreciate feedback from others. I am working with non-model organisms from messy microbiome datasets that don't work well for tools like deseq2/edgeR. It's difficult to determine what is "differentially expressed" in this case because I am not running actual tests. Rather, I am looking for qualitative differences that are consistent across many samples ( sort of "replicates"), and for this, clustered heatmaps are helpful. I have been library-normalizing data and plotting after log2 transformation.
I noticed that sorting by variance on log2-transformed data identifies weakly/moderately expressed genes that are highly variable across samples, and sorting on the non log2-transformed data will show less variable (but still variable) genes with higher baseline expression. I believe these latter genes are missed otherwise because calculating variance on log-transformed large numbers yields small variances (see useful write up by Friederike Dündar here: https://github.com/friedue/Notes/blob/master/RNA_heteroskedasticity.md).
I can't find much on the discussion boards or tutorials where people have actually used non-log-transformed data for the purpose of measuring variance. What are you thoughts on presenting data this way, if biologically it provides interesting results? I think the downside might be that you are interpreting genes that are not really variable, but merely abundant, though I think your heatmap would tell if you if that was the case (i.e., no clustering across samples, just noise).
Thank you for the comment. Do you know what tools I can use to normalize for this? As I understand, variance stabilizing transformation (vst) within DESeq2 might help but I am not using DESeq2 for this analysis. Is there any harm in showing both log-normalized and non-log normalized heatmaps to account for either bias?
Please tell a bit more what you are using. If you use DESeq2 or edgeR you do not need to normalize because their method accounts for this. There is no physical harm doing both heatmap log-normalized or not log-normalized, just do not show the "not log-normalized" to me, you can show it to your mother. Ivo.
I am not using either of those tools. I am normalizing using RPKM which accounts for changes in library size across samples. Why do you think the "not log-normalized" variance-sorted approach is worthless, if log2 transforming is understood to bias against abundant genes? The heatmap would still be showing log2 transformed data... just highlighting a mostly different (but some consistent) subset of samples as the "top X variable".
hello Ivo, any other thoughts or recommendations would be appreciated. Thank you very much for your time.