Question

pseudobulk differential expression design matrix

0

Entering edit mode

14 months ago

nhaus ▴ 420

Hi all,

I have the following situation and I just want to make sure that I understand everything correctly from a statistics point of view...

I run a pseudobulk differential expression analysis, where we have a treatment group and a control group. Each group has two replicates (i.e. Ctrl_1, Ctrl_2, Treat_1 and Treat_2). The replicates were performed in batches, i.e. replicate 1 in batch 1 and replicate 2 in batch 2. After summarizing all the counts for one cell population of interest, we end up with metadata that essentially looks like this:

sample_id	group_id	batch
Ctrl_1	Ctrl	1
Ctrl_2	Ctrl	2
Treat_1	Treat	1
Treat_2	Treat	2

I am interested in comparing Treat vs Ctrl while adjusting for batch, so our model matrix looks like this: mm <- model.matrix(~ batch + group_id, data = mdata)

(Intercept)	batch2	group_idTreat
1	0	0
1	1	0
1	0	1
1	1	1

This is all very straight forward.

Here is where the part comes which confuses me slightly. We are using a method, which classifies some cells from the Treat group as controls (because the experimental perturbation did not properly work). This means that we end up with new group_ids, namely: Ctrl_like and Treat_like. I am still interested in comparing the expression of Treat_like vs Ctrl_like, but is my assumption correct, that it is now impossible to perform a standard pseudobulk differential expression analysis, because one sample (i.e Treat_1) can belong to two groups (i.e. Ctrl_like and Treat_like) simultaneously and thus it is not possible anymore to adjust for batch effects? This is how the meta data would look like:

sample_id	group_id	batch
Ctrl_1	Ctrl_like	1
Ctrl_1	Treat_like	1
Ctrl_2	Ctrl_like	2
Ctrl_2	Treat_like	2
Treat_1	Treat_like	1
Treat_1	Ctrl_like	1
Treat_2	Treat_like	2
Treat_2	Ctrl_like	2

Any insights on that matter are greatly appreciated!

pseudobulk single-cell scRNA-seq • 756 views

ADD COMMENT • link 14 months ago by nhaus ▴ 420

score 0 · Answer 1 · 2024-02-20

0

Entering edit mode

14 months ago

rpolicastro 13k

You should be fine because every sample_id and batch apppears in every group_id. You would use the formula ~ group_id + sample_id + batch.

ADD COMMENT • link 14 months ago by rpolicastro 13k

0

Entering edit mode

Will this account for the fact that some cells come from the same original sample? This seems like relevant information for a correct analysis.

ADD REPLY • link 14 months ago by nhaus ▴ 420

0

Entering edit mode

Also, I just tried to do a formula like this and got the following error: Design matrix not of full rank.

I assume that is because the design matrix has columns that are linearly dependent? I.e. the sample_id column also encodes the batch column. Is that correct?

ADD REPLY • link 14 months ago by nhaus ▴ 420