Question

How to account for total number of cells when aggregating for pseudobulk?

0

Entering edit mode

7 months ago

mropri ▴ 160

Hi,

I have epithelial cells subsetted for different samples. When I aggregate the counts for this cell type for each sample to create a pseudobulk counts matrix, should I be dividing by the number of total cells to account for the differences in cell count between samples? The reason I ask is because if one sample had higher cell number, then some genes, when aggregated, would have higher counts relative to another sample that had lower total cell number. This would then induce that this gene is differentially expressed between those two samples when that is just an artifact of the total cell counts each sample had. I was following this tutorial and they do not do this but crossed my mind (https://hbctraining.github.io/scRNA-seq_online/lessons/pseudobulk_DESeq2_scrnaseq.html). Would appreciate any guidance. Thank you

Pseudobulk • 996 views

ADD COMMENT • link updated 7 months ago by ATpoint 87k • written 7 months ago by mropri ▴ 160

score 1 · Answer 1 · 2024-07-03

1

Entering edit mode

7 months ago

ATpoint 87k

That is no problem. You would still normalize the pseudobulk counts to correct for library size as you would for a "normal" bulk sample. Remember that you aggregate raw counts, getting pseudobulk raw counts, and from there do your normal DESeq2/edgeR/anything normalization as usual.

Only if you have very! few cells for some pseudobulk groups, like 10, while all others are in the hundreds, you might maybe want to exclude it from analysis categorically due to low information content.

ADD COMMENT • link 7 months ago by ATpoint 87k

0

Entering edit mode

Hi,

Thank you for the clarification. So you are saying I do not need to divide by the total cell count before aggregating as when I take the aggregated raw counts through the DESeq2 pipeline, it will normalize the counts with respect library size (or sequencing depth) as well. Also, what is the cutoff for very few cells or difference is cell numbers. Like, I have some samples with counts in the 500, 600, 700s while others in 2 or 3 thousands. I would think these cell counts should not be a problem, Appreciate your help and overall makes much more sense.

ADD REPLY • link 7 months ago by mropri ▴ 160

2

Entering edit mode

There's no hard and fast numbers, as it depends on the technology and seq depth you have, but generally once you get into the few dozen cells category, it tends to be okay in my experience.

ADD REPLY • link 7 months ago by jared.andrews07 ★ 18k

0

Entering edit mode

Thank you!. And am doing check on all my samples but just wanted your opinion, which is very much appreciated.

ADD REPLY • link 7 months ago by mropri ▴ 160

0

Entering edit mode

Agreed. In my head the cutoff is always 50 cells (no basis for this, it's just my head). The issue in single-cell is that per cell you often miss many expressed genes. Hence, you want a good number of cells in the bulk so genes that are expressed aren't missed by dropouts. Agree with Jared that usually a few dozen give reasonable results. If discrepancy is big, like 50 vs 2000, I sometimes simply subsample the big one, then do DE analysis, repeat 1000 times and take the average of logFC and pvalues. Not sure whether statisticians like that, but it gives me a feel how robust that all is.

ADD REPLY • link 7 months ago by ATpoint 87k