Log2 transformation to bulk RNA-seq data, can achieve a more uniform distribution across the samples. This transformation is beneficial because it helps in stabilizing the variance and compressing the range of data points. By doing so, we can reduce the impact of extreme values or outliers, ensuring that the data is more suitable to various analytical techniques and provides a clearer representation of underlying patterns.
This is my PI's words and then I validated it with chatGPT, I found that the use of log2 transformation for this purpose is indeed a common and valid practice. However, I am currently in need of a credible academic paper to cite as a reference in my work. Despite my efforts, I have not been able to locate a suitable paper. Could anyone assist me in finding a reliable publication that I can use as a reference? I would greatly appreciate any help in this regard. Thank you!
My personal opinion is that such most basic stats knowledge does not require a citation. Likewise, it does not need a citation that DNA consists of four nucleotides. Same level of basic knowledge. If it's absolutely necessary, then why not taking any textbook on basic data analysis and check the section on common data transformation methods. Cite that.
The Earth is flat. Prove me wrong.
Yeah, people have been using log transformations to stabilise variance in hetroskedastic variablees for about as long as we've known how to calculate logs. Probably goes back before the current system of scholarly publishing.
When it comes to bulk RNA-seq, the papers that are usually cited as the first RNA-seq papers:
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008 Jul;5(7):621-8. doi: 10.1038/nmeth.1226. Epub 2008 May 30. PMID: 18516045.
and
Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008 Jun 6;320(5881):1344-9. doi: 10.1126/science.1158441. Epub 2008 May 1. PMID: 18451266; PMCID: PMC2951732.
Both use log transformed counts without giving it a second thought.
Agreed that you don't need a citation unless you're writing a math paper about a new transformation that outperforms log2. Here are two you can use though:
https://www.jstor.org/stable/3001536?seq=14
https://www.jstor.org/stable/2673623
Box-Cox transformation generally outperforms log2 on skewed datasets, though the differences are often small enough that there is no significant effect on downstream applications. Like in many other areas, the convenience of the tools we use often comes before their objective quality.
Is there ever a good reason to use a log transformation instead of Box-Cox?