Hi,
I have epithelial cells subsetted for different samples. When I aggregate the counts for this cell type for each sample to create a pseudobulk counts matrix, should I be dividing by the number of total cells to account for the differences in cell count between samples? The reason I ask is because if one sample had higher cell number, then some genes, when aggregated, would have higher counts relative to another sample that had lower total cell number. This would then induce that this gene is differentially expressed between those two samples when that is just an artifact of the total cell counts each sample had. I was following this tutorial and they do not do this but crossed my mind (https://hbctraining.github.io/scRNA-seq_online/lessons/pseudobulk_DESeq2_scrnaseq.html). Would appreciate any guidance. Thank you
Hi,
Thank you for the clarification. So you are saying I do not need to divide by the total cell count before aggregating as when I take the aggregated raw counts through the DESeq2 pipeline, it will normalize the counts with respect library size (or sequencing depth) as well. Also, what is the cutoff for very few cells or difference is cell numbers. Like, I have some samples with counts in the 500, 600, 700s while others in 2 or 3 thousands. I would think these cell counts should not be a problem, Appreciate your help and overall makes much more sense.
There's no hard and fast numbers, as it depends on the technology and seq depth you have, but generally once you get into the few dozen cells category, it tends to be okay in my experience.
Thank you!. And am doing check on all my samples but just wanted your opinion, which is very much appreciated.
Agreed. In my head the cutoff is always 50 cells (no basis for this, it's just my head). The issue in single-cell is that per cell you often miss many expressed genes. Hence, you want a good number of cells in the bulk so genes that are expressed aren't missed by dropouts. Agree with Jared that usually a few dozen give reasonable results. If discrepancy is big, like 50 vs 2000, I sometimes simply subsample the big one, then do DE analysis, repeat 1000 times and take the average of logFC and pvalues. Not sure whether statisticians like that, but it gives me a feel how robust that all is.