T-tests

Question

Pooling animal samples to a lower number of replicates vs. sequencing a subgroup of the animal

1

Entering edit mode

20 months ago

Jingyue ▴ 70

Hi, community,

A collaborator has n=10 in their animal experiment in 5 conditions and wants to do some sequencing projects. It will be expensive to sequence all 50 animals.

Will you choose randomly for a subgroup n=3 out of 10, or will pool these 10 into 3 pooled replicates? The sequencing price will be the same for these 15 samples either way, but which way will be better?

Thanks!!

experimental-design RNA-seq RRBS • 1.2k views

ADD COMMENT • link updated 20 months ago by i.sudbery 20k • written 20 months ago by Jingyue ▴ 70

1

Entering edit mode

My intuition is telling me that pooling your 10 (9) samples per group into 3 groups of 3 will provide better results for RNA-seq and RRBS. Generally speaking those pooled samples will be akin to each sample being a mean instead of singular value. This is generally good because assuming good sampling that mean value will tend to be closer to the true population mean than any individual point. I'm curious to see what others say.

ADD REPLY • link 20 months ago by rpolicastro 13k

score 13 · Accepted Answer · 2023-04-06

I found this question quite interesting, so I seem to have gone off the deep-end a bit investigating (perhaps I just didn't want to do my real work today!).

My initatial thought is that pooling samples would artificially reduce the estimate of biological variance, and would, therefore would cause an increase in the false positive rate. While I was pretty much 100% sure that the biological variance estimate would be reduced, I wasn't 100% sure how this would affect statistical tests between two groups, so I set out to find out.

I used the classic approach of simulating two sets of samples that have no real difference, and testing them. Repeat this many times, and you get an idea of the false positive rate. For a correctly calibated test the pvalue should be less than 0.05 in 5% of cases where there is no real difference.

T-tests

I started out with t-tests. I simulated 9 samples each from two conditions, and pooled them into 3 samples from each condition by taking the mean of each set of 3 samples, and performed a t-test. This I compared to just directly simulating 3 samples directly. In each case the mean of condition 1 was 1, and the standard deviation was 1. I did 10,000 simulations, and claculated the proportion of pvalues < 0.05 in each case.

| mu-condition 2 |  fraction p < 0.05 |
|                | pooled  | unpooled |
+----------------+---------+----------+
|              1 |  0.049  |    0.047 |
|            1.5 |  0.191  |    0.077 |
|            2.0 |  0.371  |    0.155 |
|            3.0 |  0.879  |    0.457 |

By look at the case where the mean of condition 2 is 1, we can see the false positive rate, which is pretty close to 5%, as expected, and is not different between pooled and unpooled.

By looking at cases were there is a difference, we can estimate power. In the ideal world, each of these would be 1, and the higher the better. We can see that in all cases, the pooled power is higher than the unpooled.

So in the case of a simple t-test the answer is pooling does improve things.

Negative Binomial and DESeq2

I wondered if the fact that t-tests were well behaved was connected to the additive nature of the varances involved. To simulate an RNAseq experiement, I took advantage of the structure of the standard model of RNA-seq: The biological variance in abundance of a transcript in a cell is modelled by the gamma distribution, and then the sequencing process is modelled by a poisson, the convolution of these two processes resulting in a negative binominal.

I modelled the abundance in indevidual samples as draws from a gamma distribution with shape parameter 5 (taken from the edgeR vignette). I simulated pooling by averaging several such draws. I then simulated sequencing with a poisson draw on the mean of the gamma draws. I each gene had a grand mean expression that was drawn from a log-normal distribution.

I simulated 10,000 genes with a Log2FoldChange of 0, and 5,000 genes each with LFCs of -1, -0.5, -0.1, 0.1, 0.5 and 1, both from 9 samples pooled 3 samples per pool into 3 pools, and from 3 unpooled samples per condition, and then ran a standard DESeq2 analysis.

Unpooled:

| lfc   | pvalue < 0.05 | padj<0.05
|   0   |    0.0661 |    0.0057     
| 0.1   |    0.0696 |    0.0067     
| 0.5   |    0.1740 |    0.0289     
|   1   |    0.4846 |    0.1554

And for the pooled:

| lfc   | pvalue < 0.05 | padj<0.05
|   0   |    0.0563 |    0.0164     
| 0.1   |    0.0752 |    0.0255     
| 0.5   |    0.3802 |    0.2109         
|   1   |    0.9017 |    0.7872

While the test is anti-conservative in both cases: the fraction of genes with unadjusted p-value < 0.05 when the LFC is 0, this is the case for both pooled and unpooled, and probably points to some problem with my simulations (perhaps @MileLove can advise). It is worth noting that while the proportion of genes with pvalue < 0.05 is similar, the proprotion of genes with padj<0.05 when LFC is 0 is somewhat higher in the pooled than in the unpooled. However, the fraction of genes discovered with LFC is not 0 is much higher for the pooled samples, suggesting that the pooled approach is more powerful, if with a slightly higher false positive rate.

Final thoughts

Its been pointed out that some empirical research exists on this question (thanks @h.mon), but that it is not consistant in its conclusions:

https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-6721-y

https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1767-y

It tends to compare pooling to sequencing all the samples seperately, which is not the suggestion here. At least one of these papers assumes that the results from the unpooled anaysis is the ground truth, but this might not be true if the RNA was sufficiently poor quality that the sequencing results from single samples is not good.

Pooling does mean that it is not possible to control for factors other than codition in the analysis. It would therefore, be important to design the pools with care. For example, either all the animals in a pool should be from the same litter, or there should be one animal from each litter in each pool. Ditto for sex, etc.