Question

Statistical test in the case of low replicate number in RNAseq analysis

3

Entering edit mode

2.3 years ago

zbidav ▴ 30

Hello, I hope this is all right to ask here.

I frequently see that different published experiments use three or even two samples in each group (duplicates/triplicates). In many cases, there is simply no other choice - as the price of generating another sample is relatively high.

However, I am not sure how much statistical power can be obtained if there is a low amount of replicates. In specific algorithms such as DEseq, it is possible to use them on a low amount of samples (if there is a negative binomial distribution unless I am mistaken). However, what happens in the general case?

Specifically, I want to check RNA modification events (specific sites) in data containing triplicates and multiple groups, and am not sure what to do except t.test/ANOVA.

Thanks in advance!

replicate Statistics RNAseq low • 877 views

ADD COMMENT • link updated 2.3 years ago by 4galaxy77 2.9k • written 2.3 years ago by zbidav ▴ 30

1

Entering edit mode

To clarify: it's not the negative binomial model that fixes the problems with low number of replicates.

The problem with low number of replicates (e.g. n=2) is that estimating the variance is difficult. Therefore, DEseq and other methods (limma/sleuth/cuffdiff/edgeR/etc.) all perform an Empirical Bayes method called shrinkage to get better estimates of the variance. This method has been used since the microarray days and is designed to handle the issue with low number of replicates.

ADD REPLY • link 2.3 years ago by dsull ★ 6.9k

1

Entering edit mode

The trick here is to "share information across genes". In a 2 vs 2 setup you do not have four but 4*n_genes datapoints. By modelling the trend of the variance between samples of the same group across the mean (so for every possible average expression level) you get a fairly decent estimate of the expected variance. With expected variance one can then help decide if observed variance qualifies as a DE gene or is likely just a reflection of the experimental noise. The distribution does not matter, for example limma does not use the NB, it just happened that people realized RNA-seq can be decently modelled with NB.

ADD REPLY • link 2.3 years ago by ATpoint 85k

score 2 · Answer 1 · 2022-08-22

To expand on what ATPoint says - DESeq/edgeR/limma work by using emprical bayes to share information between genes/probes/regions to account for the low number of samples. In many cases, this approach inspired by this can be used in many problems when you have a low number of samples. Indeed, while limma is documented as a system for doing microarray/RNA-seq analysis, the core DE engine in limma is a system for doing moderated t-tests that should be applicable to any situation where you wish to do a large number of t-tests/linear models, each with a small number of replicates, where you have reason to believe that the tests are informative of each other.

You will often find that for things like ChIP-seq and CLIP-seq and Methylation and Editing that while results are reported at the level of individual sites, generally conclusions are drawn from averages across many sites. You might find results that say such and such a TF binds to this type of site or that binding sites upstream of this category of gene change. When you do this sort of analysis, the accuracy of calls at individual sites/genes is less important, as long as the error is unbiased.

For RNA editing, you could try limma. How successful this is will probably depend on the resolution you are aiming for. Because you could look at individual bases. But also, RNA editing tends to come in clusters, so you could look at windows, and try to find regions of differentially edited bases.

If you want to look at individual bases, then you might want to investigate empirical Bayes with the beta binomial model, as explained (with base-ball statistics) here: http://varianceexplained.org/r/empirical_bayes_baseball/