Hi,
I have carried out gene pairwise correlations between every pair of genes in my RNAseq data.
I have done this with non-log transformed normalised counts generated from estimatesizefactors
in DESeq2, as well as with VST counts and log2 transformed counts.
All pairwise correlations make sense and I am inclined to simply use non-log transformed normalised counts.
Is there a specific reason using non-log transformed normalised counts from estimatesizefactors
would be detrimental compared to using log2/vst transformed counts? The only thing i can think of is that for extremely highly expressed genes in one sample, the mean will shift if that gene is much more lowly expressed in other samples and this variance would be less pronounced after variance stabilisation...
Any thoughts are much appreciated
thanks
IMHO I'd say that if you're using Pearson correlation in particular, it may be more sensitive to outliers when using the non-log transformed counts. I personally tend to use more non-parametric correlations (e.g. Spearman) with these counts (or even with log-counts). Is there any reason why you prefer to use non-rlog/vst counts? Nonetheless, as you say this will depend very much on your data. You could try different strategies and compare the results.
Hi @Papyrus,
thank you for your response...I thought this to be the case, although I haven't had results that convince me not to use non-log transformed counts. Spearman is a good choice for this also -
The specific choice for this is, that, I don't lose any meaningful biology when I do the Pearson on non log transformed and also on VST counts, however, I find that fewer genes meet a threshold that I use (say coefficient of 0.8) after which i do clustering - The additional genes that meet the thresholds in non-log transformed counts make biological sense when doing clustering - so, for example:
a cluster of 10 genes that are correlated in non log transformed data tend to become 5 genes (for example) in a Pearson run on VST counts. these 5 genes are present in the list of 10 from the non-log transformed data, but the additional 5 genes in the non-log transformed cannot be noise as they all collectively form bone-fide biological functions - it seems the trade-off for outliers versus the loss of genes is worth using non log transformed data (unless I lower the threshold when clustering with VST counts)