Question

Stratified sample assignment for analysis

0

Entering edit mode

19 months ago

fr ▴ 220

I have a general question in deciding experiment design for analysis for sample analysis.

I have a dataset of about 100 samples with different metadata categories: time (t1 - t4), treatment (high, low), coming from repeated measures of patients (about 15 patients, each represented in time / treatment). I'm aiming to split it into about 6 batches. I'm wondering what can be the best strategy to minimize between-batch variability and get a good spread for all categories across the batches. Importantly, in addition to patient having an effect (due to repeated measures) I'm not sure if and which of the other categories may bias the data, which is also why it is important to have a balanced distribution across batches.

How would you approach this question in a general sense? Would you put all samples from a patient in the same batch so that you can compare against itself? Or spread them across batches?

experimental-design stratification • 837 views

ADD COMMENT • link updated 19 months ago by ATpoint 86k • written 19 months ago by fr ▴ 220

0

Entering edit mode

Batches for what? Processing/sequencing, what data are we talking about? If this is indeed paired then you basically must keep samples of a patient together, or it would not be paired anymore. Please elaborate.

ADD REPLY • link 19 months ago by ATpoint 86k

score 0 · Answer 1 · 2023-06-06

First, you didn't tell us what kind of data you are working with. For arguments sake, I'll assume bulk RNA-seq.

Generally, your options are:

1) stratified analysis 2) combined analysis, controlling for batch (or a surrogate for batch) as a covariate 3) a linear mixed model (GLMM, not GLM), with a blocking variable. 4) Repeated measures

selecting between them takes experience more than anything else. For me, I always cluster samples via a distance matrix, and additionally I use an agnostic descriptive statistical technique such as principal component analysis.

Either/both of these techniques will let you view the relationship(s) between the samples in your data, including how similar/different they are to one another.

This, in turn, can give you crucial information regarding how best to structure the downstream analyses so as to maximize statistical power.

Stratified Analysis

Generally, a stratified analysis is best if the differences between samples in a group are both A. large enough to make direct comparison difficult, AND, B. there is no good way to group the samples to partial out variance. For example, consider 50 paired tumor-normal samples. If each one is performed on a different platform, by a different experimenter at a different institution, probably your best bet is to conduct a stratified analysis, because at most you wouldnt be able to put 6 or 7 samples together anyway - they are all on different platforms.

Control for batch as a covariate in GLM framework

Generally, these will perform well in a situation where all of your samples can be annotated with multiple covariates, but aren't separated into highly disjunct groups like above. Consider 50 RNA-seq samples all of which can be labelled as animal 1,2,3, or 4, all having the same time points, all done by the same experimenter on the same one or two platforms.

Use of a general linear mixed model

GLMM might be best in a situation between these two. say you have a couple different batches, maybe 4, with 12 samples belonging to each, and you have both cases and controls/all time points in both. Here you could use GLMM to control for variation within each batch and because mixed models allow for the variation in each batch to vary as a random variable, they are sometimes better suited to controlling across federated samples.

Repeated measures

Finally, there are statistical tests that are devised specifically for use of repeated measures within an individual, that are worth looking into here granted your description of several observations per individual. They bear resemblance to some of the frameworks above.

You can also use principal components or another agnostically derived metric as a surrogate for batch, so long as the two tightly correlate. For more information on that, see this answer.