In a project from a sample they use flow cytometry to split the population in 4 fractions (CD8, CD4, DN, CD19). Later each fraction was bulk sequenced (RNA-seq).
Now I need to analyze this data and I was wondering which is the best approach to calculate the differential expressed genes (DEG).
I am doubting between two strategies to compare between time points.
The lab has been splitting the samples for each category and the compare it. In pseudocode:
# SE SummarizedExperiment
SE_CD4 <- SE[ , SE$cell_type == "CD4"]
model_CD4 <- model.matrix(~ timepoint, data = colData(SE_CD4))
# Compute DEG for CD4 using SE_CD4
SE_CD8 <- SE[ , SE$cell_type == "CD8"]
model_CD8 <- model.matrix(~ timepoint, data = colData(SE_CD8))
# Compute DEG for CD8 using SE_CD8
# The same for the other populations
But I was reading the guidelines for design matrix and I was considering if it is better to use something like:
SE$cell_time <- paste0(SE$cell_type, "_", SE$timepoint)
model <- model.matrix(~ cell_time, data = colData(SE))
contr <- limma::makeContrasts(CD4_T1_vs_T2 = CD4_T2 - CD4_T1,
CD8_T1_vs_T2 = CD8_T2 - CD4_T1,
levels = model)
# Compute the DEG for all the contrasts using SE and contr
I noticed some differences between these approaches. In both approaches I use the standard limma-voom approach using duplicateCorrelation
to block per patient.
Is there a recommended approach?
Keeping data together is most powerful and most convenient as you have a single analysis object and a single count matrix. I would always do that unless one group has notably more intra-group variability compared to the rest so it could hurt comparisons between other groups. This has been asked many times before. See e.g. https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#if-i-have-multiple-groups-should-i-run-all-together-or-split-into-pairs-of-groups (deseq2, but same principle for limma/edgeR)
Many thanks for the helpful comment. Indeed, I read some of them, but I was not fully convinced and I missed the link FAQ of DESeq2. The reason provided there is very helpful: "Having a single dispersion parameter for each gene is usually sufficient for analyzing multi-group data, as the final dispersion value will incorporate the within-group variability across all groups."