Hi Biostars,
I am a relatively new bioinformatician currently working on a project related to immune response between two different diseases using flow cytometry data. We have 4 time points per patient with multiple panels measuring different immune markers and are using clustering as a unsupervised method of identifying cellular populations associated with one disease or the other. One issue that I am running into is that for each time point there may be variability in the population sizes of each disease. For example at some time points the ratio of cells for each disease is close to 50/50 but some points the ratio can be as large as 80/20. Our initial approach is to combine the flow panels for each disease, cluster, and then identify if there are any clusters in which membership of one disease is statistically significant compared to the other. Our current approach is to look at the standard deviation for the percentage of each disease across all clusters and classify those clusters that are 1 to 2 times higher than the standard deviation as significant and highly significant respectively but I would like to hear from others if this is a valid approach or if there are other suggestions? Below are two examples of sample data to illustrate my point (red is >1.96xSD and yellow is >1xSD: