I have reads from 16 conditions, 3 replicates each. I used RSEM to align, so I have TPMs, but I've imported the counts into DESeq2 with tximport so I can normalize the counts and extract DE genes in specific contrasts from the dataset. I have also used DESeq2 to produce batch-corrected variance-stabilized transformations (vst) of the dataset, which produced some nice h-cluster heatmaps, PCA plots, and did k-means clustering. Now, if I want to produce plots that examine the expression of individual genes or clusters, should I plot the DESeq normalized counts or the TPMs, using gene lists derived from the DESeq results? Is it "okay" to define clusters with the vst data and but then show the TPMs? Not sure what standard practice is. Thanks!
In an ideal case (any case?), your differential genes from DESeq2 should show same pattern when you plot a box/violin plot with TPMs. Meaning, over expressed genes should show high TPMs and vice versa. But I would once check the clustering heatmap produced by VST counts and TPMs. If the clusters are same, I wouldn't worry too much.
Ok. Even if the results are similar, does defining clusters with VST information but then presenting TPM information per gene raise any eyebrows? In a publication, would people expect normalized counts instead if everything upstream was done with DESeq too?
Edit: I just made some TPM plots and compared them to plots of normalized counts, and they are almost the same. So it looks like just personal preference at this point? The only thing I can think of is that would argue for one over the othe is that the normalized counts would be normalized across all my samples, but TPMs are only normalized within sample, correct?
I would guess that 99% of the reviews would not even notice as long as the results make biological sense. Still, I would keep things simple and show the counts that the results are based on, so
vst
orrlog
, depending on what you used. On the y-axis, just label it asnormalized counts
. That is at least what I mostly do.