Question

scRNASeq equal number of cells across conditions

2

Entering edit mode

4.1 years ago

TriS ★ 4.8k

hi all

I am working with scRNAseq data with different conditions, and each condition has a very different number of cells, e.g. 1,000 vs. 2,000 vs. 5,000. otherwise, it's the same experiment.

I am wondering whether it would be more prudent to downsample to the lowest number across replicates, in this case 1,000, and then run the various analyses with Seurat. the reason is that in the first group there weren't simply many cells post collection, while in the other conditions we just had more. On a separate occasion, we had the same issue, but it was caused by problems with flow sorting the single cells.

here, when I then visualize the three groups via UMAP I see the last one having a ton more cells in all clusters, while instead, it is just due to the different number of cells we started with.

what are your thoughts?

thank you :)

-- EDIT - added extra info to address ATpoint comment

in the image below you can see what I mean

enter image description here

the green group is the one with ~5k cells, ~2,500 for the red, and 1k for the blue. if I had to look at the abundance of the immune cells in the various groups, then in the green group will have the highest abundance, simply because it started with a (much) higher number of cells.

if you now look at the next image

enter image description here

the differences that you now see in abundance are not driven by the initial number of cells. subsampling was made by randomly selecting the same number of cells in each group

as far as gene expression goes, the higher/lower number of cells (for my experiment) will not be affected, so I agree that downsampling for this would not be a good idea. however, I work with a lot of immunologists that want to know the frequency of cells positive for x, y, z comparing across samples, so wouldn't downsampling to the same number be helpful in this case?

scRNAseq seurat • 2.5k views

ADD COMMENT • link updated 4.1 years ago by ATpoint 88k • written 4.1 years ago by TriS ★ 4.8k

0

Entering edit mode

Subsampling always comes together with loss of information. I would only do that if there is evidence coming up that unequal cell numbers cause any problems. Be sure to perform a proper integration, and then see whether the clusters show evidence of being driven by the cell number differences, or not, if the latter, don't subsample. This is very general out-of-my-gutfeeling advice, so take it with a grain of salt. Can you add some plots to illustrate the problem that you think you see and some idea which code you ran?

ADD REPLY • link 4.1 years ago by ATpoint 88k

0

Entering edit mode

thank you, yes, I added some extra info above to help. the main point is that working with immunologists they always want to know the abundance of cell type x, y, z across groups, and starting with different number of cells will not actually reflect the correct proportions across samples. maybe something like cumulative distribution ecdf could also help showing the relative frequency and not the overall abundance?

ADD REPLY • link 4.1 years ago by TriS ★ 4.8k