Hello, my name is Lucas. I am master's student in Brazil, on UFG Bioinformatics and molecular biology.
I started recently my analyses with number of counts and need avalaible diferences of counts in three groups, but, i have 3 samples in each groups.
I performed normality test in graphpad prism, and all groups passed, but, 3 samples is very small for ANOVA one way? what is a confidence level this test?
there are 3 groups containing 3 samples each, n=9. Sample lows refer to the high cost of cell culture RNA sequencing.
Hi Lucas.
I think you should perform the statistical analyses of the raw count RNAseq dataset with the DESeq2 package (with R programming language). Running a normality test or an ordinary t-test does not seem to be a suitable method for RNAseq data analysis. However, three samples for each group is statistically suitable in the RNAseq experiment.
I performed normality test in graphpad prism, and all groups passed
Keep in mind that most (all?) tests in classical statistics assess the compatibility of the observed data with a preset null hypothesis. Or equivalently, they express how surprised you should be from observing that data in world where the null hypothesis holds. That is: a small p-value means "very surprising" while a large p-value indicates "compatible". Note that there is no mention about the null hypothesis being probably true or false. As you notice, a small sample size is indeed likely compatible with any null hypothesis since you don't have enough information to detect incompatibility. 3 heads in a row are still compatible with the hypothesis "coin is fair".
For RNAseq you can still get meaningful results from n=3 using appropriate methods (see MRezaei's answer), but also because (implicitly) you/we are making these assumptions:
The estimate of variance for one gene can be improved by looking at the variance from other genes in the same dataset (this is what edgeR, DESeq, limma do, among other things). This is much better than analysing each gene in isolation from the others.
You are interested in large changes (i.e logFC > 1 or even >2)
Variation within groups is expected to be small. N=3 may be good enough with cell cultures or other experimental setups. It's probably hopeless for samples from the wild since you may have confounders associated to the variable of interest or otherwise large within group variation. I don't think the edgeR/DESeq machinery can help here. In fact I think the numbers N=3 for RNAseq and N=5 for microarrays were regarded as the minimal number of replicates necessary to control the technical variation, but probably you have some biological variation on top of that even within groups.
Granted that I don't claim any authority and I'm just expressing my confused understanding - Perhaps you are referring to this sentence:
I don't think the edgeR/DESeq machinery can help here.
What I meant is that large variation can be remedied only by increasing sample size. Confounders can be remedied by increasing sample size only if the association between counfounder and variable of interest occurs by chance (think of randomization as a strategy to break unwanted links). In bayesian statistics you can plug in prior information in the form of a distribution describing what you think the reasonable range of parameters values should be. But that opens another cans of worms in terms of why and how to choose such priors. So it seems to me there is no statistical way around it other than presenting the results as fairly and as cleanly as you can. After all, statistics is a tool to communicate results more than to generate results (in my opinion).
So from a statistical point of view, for typical RNAseq experiments interested in differential expression it is unlikely you can do better than DESeq or similar packages. In fact, the limma package, initially developed for gene expression microarrays, performs very well whenever you have few samples and many features sharing the same characteristics (genes, proteins, etc...).
Okay, won't using DESeq2 be useful? What other statistical test can I do to test my hypothesis? simpler tests like anova?
Granted that I don't claim any authority and I'm just expressing my confused understanding - Perhaps you are referring to this sentence:
What I meant is that large variation can be remedied only by increasing sample size. Confounders can be remedied by increasing sample size only if the association between counfounder and variable of interest occurs by chance (think of randomization as a strategy to break unwanted links). In bayesian statistics you can plug in prior information in the form of a distribution describing what you think the reasonable range of parameters values should be. But that opens another cans of worms in terms of why and how to choose such priors. So it seems to me there is no statistical way around it other than presenting the results as fairly and as cleanly as you can. After all, statistics is a tool to communicate results more than to generate results (in my opinion).
So from a statistical point of view, for typical RNAseq experiments interested in differential expression it is unlikely you can do better than DESeq or similar packages. In fact, the limma package, initially developed for gene expression microarrays, performs very well whenever you have few samples and many features sharing the same characteristics (genes, proteins, etc...).