Hi! I am having trouble with analyzing outputs from salmon-tximport-DESeq2 pipeline. So, naturally, I used counts to perform difseq analysis, And then I use mean TPM values to analyze various aspects, like see median expression level of some subset of genes, etc. One peculiar thing is when I plot log2 TPM treated vs log2 TPM untreated and then color dots based on their being identified as differentially expressed (log2Fold change > 1, or < -1, and p adjusted < 0.05 in DESeq2 output), I see assimilarity of up- and down-regulated genes in relation to the x=y line. Can someone please explain to me why this happens? Here is the resulting plot I can attach some R code that I used too
This is interesting. According to the plot a lot of the highly expressed genes are down-regulated, if you think it's not biology then DESeq normalization was off, did you use the default one?
No, we think this can actually occur because of biology
Then I think the what you see is the effect of different normalization. Try plotting the normalized counts from DESeq (basically MAplot) and see if the picture is more balanced, my guess is that it should be.
Yeah I did MA-plot and it looks ok, but the thing is that I am using TPMs for subsequent analysis and now I'm not sure if they're not a total garbage. DESeq doesn't produce normalized counts for each of conditions though, so I can't use it either. I'm also kinda curious what normalization did they use for this logFC. Based on their tutorials and article I figured that only genes with low read counts should be affected by their normalization algo
You're confusing two things - normalization and dispersion estimate. Normalization is bringing all libraries to a comparable level which is done by multiplying the read counts by a normalization factor which is different for each library and determined using several methods. I guess DESeq and the one used for computing TPM were different. You can get the normalized counts from DESeq2 using
counts(dds, normalized=TRUE)
and you can use the rlog function if you really want to work with log values.Thank you, Asaf! Although you explained normalization which I already knew and said nothing about dispersion estimate. Could you please elaborate a little? I am really confused. Can I use these normalized counts to compare expression instead of TPMs?
You can use the normalized expression but it's best if you used the DESeq results directly. You can read about the dispersion estimation in DESeq2 manual and paper, in short, it's their way of estimating the "noise" of each gene.
Thanks! I will
Yeah, maybe, but don't you want to include all others genes when you analyze let's say ChIP-seq and RNA-seq together and not only those 1000 that are differentially expressed? Or do you mean that I should use only log2fc and baseMean (I still don't understand the use of this) from DESeq output to test hypotheses?
Basically baseMean and log2fc (with SE) give you all you should know