Question

Biostatistics with few samples in seqRNA counts

1

Entering edit mode

8 months ago

Lucas ▴ 10

Hello, my name is Lucas. I am master's student in Brazil, on UFG Bioinformatics and molecular biology.

I started recently my analyses with number of counts and need avalaible diferences of counts in three groups, but, i have 3 samples in each groups.

I performed normality test in graphpad prism, and all groups passed, but, 3 samples is very small for ANOVA one way? what is a confidence level this test?

there are 3 groups containing 3 samples each, n=9. Sample lows refer to the high cost of cell culture RNA sequencing.

biostatistics • 745 views

ADD COMMENT • link updated 8 months ago by dariober 15k • written 8 months ago by Lucas ▴ 10

score 2 · Answer 1 · 2024-09-12

Hi Lucas. I think you should perform the statistical analyses of the raw count RNAseq dataset with the DESeq2 package (with R programming language). Running a normality test or an ordinary t-test does not seem to be a suitable method for RNAseq data analysis. However, three samples for each group is statistically suitable in the RNAseq experiment.

Cheers

score 1 · Answer 2 · 2024-09-12

More of a general comment than an answer...

I performed normality test in graphpad prism, and all groups passed

Keep in mind that most (all?) tests in classical statistics assess the compatibility of the observed data with a preset null hypothesis. Or equivalently, they express how surprised you should be from observing that data in world where the null hypothesis holds. That is: a small p-value means "very surprising" while a large p-value indicates "compatible". Note that there is no mention about the null hypothesis being probably true or false. As you notice, a small sample size is indeed likely compatible with any null hypothesis since you don't have enough information to detect incompatibility. 3 heads in a row are still compatible with the hypothesis "coin is fair".

For RNAseq you can still get meaningful results from n=3 using appropriate methods (see MRezaei's answer), but also because (implicitly) you/we are making these assumptions:

The estimate of variance for one gene can be improved by looking at the variance from other genes in the same dataset (this is what edgeR, DESeq, limma do, among other things). This is much better than analysing each gene in isolation from the others.
You are interested in large changes (i.e logFC > 1 or even >2)
Variation within groups is expected to be small. N=3 may be good enough with cell cultures or other experimental setups. It's probably hopeless for samples from the wild since you may have confounders associated to the variable of interest or otherwise large within group variation. I don't think the edgeR/DESeq machinery can help here. In fact I think the numbers N=3 for RNAseq and N=5 for microarrays were regarded as the minimal number of replicates necessary to control the technical variation, but probably you have some biological variation on top of that even within groups.