I have seem some platforms such as GEPIA offering ANOVA for differential gene expression analysis. However, as far as I'm concerned, ANOVA compares the averages and assumes equal distribution and variance among samples, which, as far I have been lead to assume, is uncommon for any kind of RNA-seq derived data, especially considering the thousands of possibly expressed genes in the human genome. Is ANOVA really appropriate for differential expression?
Nope. Also the distribution of RNA-seq data is not normal (as an ANOVA also assumes). You should use specifically designed tools such as edgeR, DESeq2 or limma.
Limma is based on linear model/ANOVA under the hood actually. While raw RNA-seq data is never normal, Limma uses the log-transformed CPM – which are normal enough for ANOVA. So yes, it is possible to analyze RNA-seq data with ANOVA but I agree that it is rather sub-optimal compared to more modern methods such as DESeq2 and edgeR (based on negative binomial modeling on the raw counts).
The issue is not so much "normality" (limma-voom and sleuth don't do negative binomial modeling and they work super well). The issue lies with variance estimation which is why limma does not use t-tests/ANOVA in the traditional sense; it uses those tests but regularizes the variance estimates which is necessary in almost all cases.
Most differential gene expression packages support ANOVA-like comparisons, so just stick with those.
Yes, I agree that Limma is not 'classical' ANOVA but rather an extension of ANOVA. Still, I wanted to add some nuance to the clear-cut answer above stating that ANOVA can not be used for RNA-seq analysis because the counts are not normal.
Also, my understanding is that normality would be an issue without the log transformation of the count data for linear model/ANOVA -based method such as Limma, but not for edgeR or DESeq2 since they assume different properties from the data.
Hi Carlo,
Sorry to pop in - I have a sort of a non-classical problem. I am a non-informatic person using a tool called Partek Genomics Suite to provide a collaborator with some extremely rough view of his scRNA-seq data. Partek does not (as far as I can tell) contain edge or DE. The informatics guys might get to this data next week, but my collaborator needs to show his PI just a peek this weekend. I have RNA-seq CPM. Values are, obviously, often zero. Is the following reasonable:
1) make the zero values non-zero (a very small number)
2) log2 transform
3) Run ANOVA
I know this is definitely not kosher, but could I at least stack-rank the genes by p-value or fold-change to give a fuzzy picture of the biology?
According to A Beginner’s Guide to Analysis of RNA Sequencing Data (https://www.atsjournals.org/doi/10.1165/rcmb.2017-0430TR) ANOVA is an appropriate analysis for RNA-seq data. However, the review doesn't specify a tool/package to do this analysis. Searching for how to do ANOVA on RNA-seq data brought me to this page.
Limma is based on linear model/ANOVA under the hood actually. While raw RNA-seq data is never normal, Limma uses the log-transformed CPM – which are normal enough for ANOVA. So yes, it is possible to analyze RNA-seq data with ANOVA but I agree that it is rather sub-optimal compared to more modern methods such as DESeq2 and edgeR (based on negative binomial modeling on the raw counts).
The issue is not so much "normality" (limma-voom and sleuth don't do negative binomial modeling and they work super well). The issue lies with variance estimation which is why limma does not use t-tests/ANOVA in the traditional sense; it uses those tests but regularizes the variance estimates which is necessary in almost all cases.
Most differential gene expression packages support ANOVA-like comparisons, so just stick with those.
Yes, I agree that Limma is not 'classical' ANOVA but rather an extension of ANOVA. Still, I wanted to add some nuance to the clear-cut answer above stating that ANOVA can not be used for RNA-seq analysis because the counts are not normal.
Also, my understanding is that normality would be an issue without the log transformation of the count data for linear model/ANOVA -based method such as Limma, but not for edgeR or DESeq2 since they assume different properties from the data.
Hi Carlo, Sorry to pop in - I have a sort of a non-classical problem. I am a non-informatic person using a tool called Partek Genomics Suite to provide a collaborator with some extremely rough view of his scRNA-seq data. Partek does not (as far as I can tell) contain edge or DE. The informatics guys might get to this data next week, but my collaborator needs to show his PI just a peek this weekend. I have RNA-seq CPM. Values are, obviously, often zero. Is the following reasonable: 1) make the zero values non-zero (a very small number) 2) log2 transform 3) Run ANOVA I know this is definitely not kosher, but could I at least stack-rank the genes by p-value or fold-change to give a fuzzy picture of the biology?
You should ask a separated question. scRNA-seq is not my specialty so others might provide better answers.
You should ask a separated question. scRNA-seq is not my specialty so others might provide better answers.