I'm conducting a DE analysis over normalized TPM, RPKM, or FPKM data. Raw counts are not available. The data is about the clinical outcome of a certain drug. I know this is not the ideal way to do this kind of analysis, but I have no choice. This complicates things for me, I'm not getting results and I just want to make sure my code is right. Here is my code, from one of the datasets, that has TPM data:
TPM = TPM[rowSums(TPM[])>0,]
thresh = TPM > 10
keep = rowSums(thresh) >= 12
table(keep)
TPM = TPM[keep,]
# Design and Contrast
design = model.matrix(~ 0 + outcome, metadata)
colnames(design)[c(1,2)] = c('NoResponse','Response')
contrast = makeContrasts(NoResp_vs_Resp = NoResponse - Response, levels = colnames(design))
log_TPM = scale(log2(TPM + 0.1))
vfit = lmFit(log_TPM, design)
vfit = contrasts.fit(vfit, contrasts = contrast)
efit = eBayes(vfit, robust = TRUE, trend = TRUE)
plotSA(efit, main = 'final model with TPM ~ Mean-Variance trend')
summary(decideTests(efit))
DEG = topTable(efit, coef = 'NoResp_vs_Resp', p.value = 0.05, adjust.method = 'fdr', number = Inf)
I'm trying to get figures that are as close to the Bioconductors guide for limma.
Can anybody please just confirm wheather my code is alright, or is there any other way I could do this? regarding filtering and stuff like that, cause I cant use edgeR::filterByExpr
. Thanks!
can you run normality test on log_TPM object?
I don't think any data of this type (RNA-seq) is normally distributed.. or is it ?
RNAseq data is mostly modeled after NB distribution, for biological replicates. Since data is normalized (TPM) and subsequently log transformed, it is no more original data. As TPM data is log transformed, my understanding is that values in log_TPM object are going to be normally distributed. However, I was not sure of your data. Hence, I asked you to test for normality of the data. However, as several times mentioned on biostars, raw counts are better in analysing the RNA seq data.
The data is not normally distributed. I tried
ggqqplot(log_TPM$sample)
, on many samples that I chose randomley, and got something like this:You think you could help me or is it a gone cause ?
The plot you show here is irrelevant for a limma analysis. limma does make a normality assumption, but in a much more subtle sense than you think. limma makes no assumptions at all about qqplots or histograms of expression values.
The SA plot that your plotSA code above makes is far more relevant for checking that everything is ok.
Isn't this duplicated operation?
Basically you're right. The first line is supposed to delete all zero rows. And the next lines are supposed to delete genes based on the threshold. I do both no reason at all behind it.