According to central limit theorem, t-test can be used for non-normally distributed sample. Beside, RNA-Seq fits better to negative binomial distribution which doesn't significantly differ from normal distribution. So why can't we just use t-test for DE estimation?
My understanding from a presentation I saw (I am not a statistician) is that you could use a t-test IF you have a large number of samples (think tens or more). I recall the
n
being something like 20.By "if I have large number of sample", is it because that gene expression follows non-normal distribution (although follows a similar one). Only with large number of samples would the expression statistics converge to normal distribution (by central limit theorem). Am I understand it correctly?
Counts themselves will never approach a normal distribution, since they're integer and bounded at 0. They can be transformed to be "close enough", though, which is part of what voom() does.
Even though counts don't approach normal distribution, central limit theorem still allow me to use t-test on normalized counts if we have sufficient sample size (although most like we don't), right?
Do you have any reference about this n > 20?
Have a look at the studies by Gierlinski et al., particularly this, this, and this
Out of interest, is this pure academic interest or do you have data that do not behave as expected with standard tools and you try to tweak parameters now?
It is pure academic interest. Want to get a rough picture of how DE is done.