I am analyzing a single-cell RNA-seq dataset with:
- 4 patients
- 2 conditions per patient
- 2 cell types of interest per patient-condition combination
I want to perform pseudobulk differential expression analysis to examine:
- Condition effects (differences due to treatment/condition).
- Cell type effects (differences between the two cell types).
Approach in question
I plan to create 16 pseudobulk samples (4 patients × 2 conditions × 2 cell types) and run DESeq2 with these models:
- Condition effects
~ Patient_CellType_combination + Condition
- Cell type effects
~ Patient_Condition_combination + CellType
Each formula accounts for patient-specific effects by grouping the data accordingly:
- There are 8 Patient_CellType_combinations (4 patients × 2 cell types).
- There are 8 Patient_Condition_combinations (4 patients × 2 conditions).
Concerns
While this approach maximizes the use of single-cell data, I worry that pseudobulk samples within each patient-treatment-cell type combination are not true biological replicates. This could lead to inflated false positives because samples from the same patient and treatment may be more similar than expected in traditional bulk RNA-seq.
For example, in the first comparison (across conditions), Patient1-Cluster1-Condition1 may be more similar to Patient1-Cluster2-Condition1 than independent biological samples would be.
Question
Is this the correct approach for pseudobulk analysis?
You have paired data since it's all within each patient so make good use of that, it's statistically powerful. I would aggregate indeed by patient-celltype-condition and for the design I would make a new group
Celltype_Condition
and then~Celltype_Condition + Patient
. You can then contrast viaCelltype_Condition
as needed, both for between-celltype, or between-condition effects. Since a paired analysis you do not need to worry that the cells within patient are more correlated than between patient.If I use Celltype_Condition as a contrast, there would be four levels : CelltypeA_Control, CelltypeA_Treated, CelltypeB_Control, CelltypeB_Treated.
However, there is no comparison within these that enables one to see the overall effect of Celltype (while accounting for treatment), nor the overall effect of treatment (across both Celltypes).
Instead, all comparisons would be only between specific Celltype_Condition combinations.
I would test the celltype and treatment effect here via the averages, for example for the celltype effect as below. I think that's one way of doing this others might correct me if wrong: