Hello everyone, I have a question as to EdgeR's differential expression analysis and the use of log2 transformation as part of the normalization process. I want to test for DE using TMM normalized log2 transformed counts data (log2 transformation serving as a means to reduce skew and the effects of outliers). I know it is possible to output normalized, log2 transformed values by using cpm() on the output of EdgeR's cpm() function:
#Use counts and phenotype data to create a dgelist object:
group <- phenotype_data_matrix$test_group
df.counts <- as.data.frame(counts_matrix)
dgelist <- DGEList(counts=df.counts, group=group)
# conduct normalization. The normalization factors will be stored in the object with an unaltered copy of the counts matrix:
calc.normfact.out <- calcNormFactors(dgelist, method = "TMM")
#extract normalized, log2 transformed expression data using cpm() note: a default prior count of .25, scaled to effective library size will be added by EdgeR to prevent taking log of 0 counts:
cpm.tmm.log2 <- cpm(calc.normfact.out, log=TRUE, normalized.lib.sizes = TRUE)
My concern is this: does EdgeR use log2 transformation as part of the normalization prior to estimating differential expression? If it is not a default, is it possible to secify this as part of the normalization process, or would such a transformation of counts into log2 transformed values actually violate the assumptions EdgeR makes about the RNA-seq data's distribution? Below is a copy of the code I'm using to test for DE between group A and group B,
#additional covariates you want in your glm:
batch <- phenotype_data_matrix$Batch
sex <- phenotype_data_matrix$sex
#set up model with additional relevant covariates:
design <- model.matrix(~group + batch + sex)
# carry out DE analysis to identify DE genes between the two categories in 'group' (this is specified by 'coef=2') adjusting for Batch and Sex using glm():
dge.estDisp.out <- estimateDisp(calc.normfact.out, design)
fit <- glmFit(dge.estDisp.out, design)
lrt <- glmLRT(fit, coef = 2)
I have included my code here so that A )in case there is there is a simple way to modify it to use log2 transformed-normalized values provided that is not the default and B) Allow you to point out any gross mistakes or misconceptions on my part. Many thanks for any help you can provide!
Ok, so if it adjusts library size (creating an effective library size) and then you extract values using cpm() aren't the measurements you are getting for each gene a TMM normalized (and optionally log2 transformed) measurement of that particular genes expression? If you extract in this way what you are using for visualization is normalized gene-expression data correct? Even if it is not 'count' data any more. Am I correct in that?
Additionally, is there a way to use normalized log2 transformed data for DE analysis in the DEseq package?
You can use TMM to normalize the library sizes. And then the normalized library sizes can be used to compute logCPM values. You can call the results TMM normalized counts if you like, but I do not do so because I think that terminology is misleading. In fact it causes the sort of misunderstandings that have lead to your question here.
No you cannot use log2 transformed counts in DESeq. DESeq is a negative binomial package, just like edgeR. log transformed counts are not negative binomial.
I am not clear what is worrying you. As far as I can see, the DE analysis you've done is perfectly ok.
OK Thank you, I was just trying to clarify things to make sure I fully understood. Thank you for answer my questions!