Hi, I'm doing a differential expression analysis to RNA-seq data with limma - voom. Unfortunately, I do not have acceess to the raw counts, just normalized TPM data. I know that all libraries, including DESeq2 and limma, expect raw counts and they don't perform very good when receiving nromalized data. This is my first time using limma ( I usually prefare DESeq2) so please bear with me.
Nontheless, I did the analysis according to the bioconductor guide, but without the CPM or logCPM part cause that wont work with TPM.
Despite all this, my results still make little sense, I'm lost here and cant actually point at what is wrong. The TPM data has 28000 genes, the analysis result says that 26000 of these are differentialy expressed !! which is way too much I guess, this never happened with me before and I suspect it's due to the normallized data. Or am I using the data in a false way?
Here is my code, hope it helps:
## Remove all zero rows and noise genes
TPM <- TPM[rowSums(TPM[])>0,]
thresh <- TPM > 0.5
keep <- rowSums(thresh) >= 2
table(keep)
TPM <- TPM[keep,]
## Design and Contrast
design <- model.matrix(~Response, Metadata)
contrast <- matrix(c(1 ,0), ncol = 1)
dimnames(contrast) <- list(c('Response', 'No Response'), 'Diff')
## Voom - Make RNA-seq follow normal distribution
Voom <- voom(TPM, design, plot = TRUE)
vfit <- lmFit(Voom, design)
vfit <- contrasts.fit(vfit, contrasts = contrast)
efit <- eBayes(vfit)
plotSA(efit, main = 'final model: Mean-Variance trend')
summary(decideTests(efit))
deg <- topTable(efit, coef = 'Diff', p.value = 0.05, adjust.method = 'fdr',
number = Inf)
And checkout this volcano plot, it is not showing the down regulated genes ( there are 9581 of those), only upregulated LFC
EnhancedVolcano(deg,
lab = rownames(deg),
x = 'logFC',
y = 'adj.P.Val',
labSize=4,
FCcutoff=2 )
They didn't talk much about TPM, however I see that the person who asked the question used voom on normalized data, same mistake as I did.
So what should I do ? maybe something like this ? (without voom)
Yes, TPM/FPKM is the same in this context.
I tried it, it's still the same. Now the summary tells me there are zero significant downregulated genes , and 28000 significant up regulated genes (almost all dataset), yeah it's still rubbish results.
Maybe I should filter more genes? cause when I use
filterByExpr()
to filter the TPM, only 6000 genes are left.I do not have the data at hand, so I cannot tell. That filter works on raw counts as well, so it makes limited sense here. Are these data even properly normalized, can you make an MA-plot to explore? One of the many reasons FPKM/TPM is bad is that it only cares about sequencing depth and not about library composition, see e.g. TMM-Normalization
MA-plots is AveExpr vs logFC. Normalization is probably way off here.
I see.. well this is how I normalized the data:
So yeah I have the raw data I just can't use it for the analysis for some stupid reasons. Anyway, this is an SA Plot, after applying
eBayes(vfit, trend = TRUE)
using
trend = TRUE
or not, same result.So you do have access to the raw counts?! This is not an MAplot.
I do but I wasn't allowed to use it for some research purposes. I did a little rebellion and now I'm allowed to use it finally. Thank you so much for the help and sorry for asking lots of questions.