Question

Identifying statistically significant clusters between two populations of variable sizes

0

Entering edit mode

3 months ago

Stephen • 0

Hi Biostars,

I am a relatively new bioinformatician currently working on a project related to immune response between two different diseases using flow cytometry data. We have 4 time points per patient with multiple panels measuring different immune markers and are using clustering as a unsupervised method of identifying cellular populations associated with one disease or the other. One issue that I am running into is that for each time point there may be variability in the population sizes of each disease. For example at some time points the ratio of cells for each disease is close to 50/50 but some points the ratio can be as large as 80/20. Our initial approach is to combine the flow panels for each disease, cluster, and then identify if there are any clusters in which membership of one disease is statistically significant compared to the other. Our current approach is to look at the standard deviation for the percentage of each disease across all clusters and classify those clusters that are 1 to 2 times higher than the standard deviation as significant and highly significant respectively but I would like to hear from others if this is a valid approach or if there are other suggestions? Below are two examples of sample data to illustrate my point (red is >1.96xSD and yellow is >1xSD:

Example 1

Example 2

immune-response flow-cytometry • 374 views

ADD COMMENT • link updated 3 months ago by LChart 4.7k • written 3 months ago by Stephen • 0

score 0 · Answer 1 · 2024-09-12

One issue that I am running into is that for each time point there may be variability in the population sizes of each disease. For example at some time points the ratio of cells for each disease is close to 50/50 but some points the ratio can be as large as 80/20

Do you mean that at some points there could be 50 or 60 cells, and at other time points there could be 500 or 600? And when you average across timepoints and patients, you average the ratios, do you take the ratio of the total counts?

As you point out, the number of cells inversely relates to the uncertainty of the ratio; and there is certainly patient-to-patient variability that is not properly partitioned using this approach.

I'd strongly suggest incorporating a statistician at your institution. Some places to start would be a bayesian binomial model, where the the number of observed cells looks like

n_{ij} ~ Binom(N_{ij}, p_{ij})

p_{ij} = logit(patient_i + timepoint_j + disease_i + covariates + err_{ij})

or a mixed model where

log(ratio_{ij}) ~ patient_i + timepoint_j + disease_i + (1|patient_i) + (1|1/N_{ij})

again, these are places to start a statistical discussion, and not immediate solutions to your problem.