Hello Biostars Community,
I would like to compare the expression levels of specific genes between sample groups by making box plots like these:
The figure legend for the paper says:
Plotted values are quantile-normalized log2-cpm. For each group, all samples are plotted in addition to box-plots summarizing the group. * indicates adj. p < 0.05
I am starting with count matrices of different sample groups with technical replicates.
This was good help: How do you generate TMM normalized counts using EdgeR?
I am kind of convinced using TMM is the best method for this task based on the recommendations of this (HBC Training) source. Please correct me if I am wrong.
My main question granted the above is correct, is if someone could help explain when log transformation should be done? Is it necessary?
Should I do it here:
#/ make the DGEList:
y <- DGEList(...)
#/ calculate TMM normalization factors:
y <- calcNormFactors(y)
#/ get the normalized counts:
cpms <- cpm(y, log=TRUE)
or instead replace the last line, like so: log2cpms <- log2(cpm(y, log=FALSE))
Or should I not do it at all? cpms <- cpm(y, log=FALSE)
Thank you very much in advance!
log2(cpm(y))
would result in a lot of undefined values since there are many 0 counts. You could add 1 to the values before taking the loglog2(cpm(y) + 1)
or settingcpm(y, log=TRUE)
will add a small value to the count before taking the log to avoid this problem. Since you aren't doing stats on these values either method is fine.Thank you rpolicastro ! Swooping in once again!
Hmmm. Looks like I am missing a fundamental understanding here. My goal is to do stats on the values/box plots. I do remember reading a discussion about "only being able to do differential gene expression analysis using raw counts" on the Biostars Slack a while back. Devon Ryan suggested using normalized counts in the StackExchange link below:
https://bioinformatics.stackexchange.com/questions/5545/why-the-t-test-for-a-specific-gene-shows-different-value-compared-to-differentia
This one was good for a refresher for statistical tests to use:
https://www.researchgate.net/post/What-statistical-test-should-I-use-to-analyze-mRNASeq-data-for-differential-gene-expression
Guess reading the edgeR and limma papers are on my to-do list now... Stats papers are dreadful sometimes, especially coming from a pretty much completely bio background. Gotta start somewhere.
If you want to calculate p-values you should set up contrasts in edgeR or DESeq2 instead of performing a stats test on the normalized values. Those programs are already designed to take into consideration some of the pitfalls of DEG analysis.
Why do you think you are even using EdgeR if not to have it do the math for you? Did you read that bioinformatics post? EdgeR exists because simple statistics are not appropriate.
Thank you for making it more clear rpolicastro and swbarnes2.
I think much of my confusion/uncertainty is derived from a lack of understanding of how EdgeR and DESeq2 work (and of course statistics...). I think I will benefit from reading through the papers and trying to tease apart all the statistics terms.
This is just a picture. You don't have to follow a mathematically perfect algorithm. If you have TMM, plot that. Or log that if the values are too spread out.