Question

Longitudinal analysis of subpopulations: which approach is better?

0

Entering edit mode

15 months ago

Lluís R. ★ 1.2k

In a project from a sample they use flow cytometry to split the population in 4 fractions (CD8, CD4, DN, CD19). Later each fraction was bulk sequenced (RNA-seq).

Now I need to analyze this data and I was wondering which is the best approach to calculate the differential expressed genes (DEG).
I am doubting between two strategies to compare between time points.

The lab has been splitting the samples for each category and the compare it. In pseudocode:

# SE SummarizedExperiment
SE_CD4 <- SE[ , SE$cell_type == "CD4"]
model_CD4 <- model.matrix(~ timepoint, data = colData(SE_CD4))
# Compute DEG for CD4 using SE_CD4

SE_CD8 <- SE[ , SE$cell_type == "CD8"]
model_CD8 <- model.matrix(~ timepoint, data = colData(SE_CD8))
# Compute DEG for CD8 using SE_CD8

# The same for the other populations

But I was reading the guidelines for design matrix and I was considering if it is better to use something like:

SE$cell_time <- paste0(SE$cell_type, "_", SE$timepoint)
model <- model.matrix(~ cell_time, data = colData(SE))
contr <- limma::makeContrasts(CD4_T1_vs_T2 = CD4_T2 - CD4_T1, 
                                     CD8_T1_vs_T2 = CD8_T2 - CD4_T1, 
                                     levels = model)
# Compute the DEG for all the contrasts using SE and contr

I noticed some differences between these approaches. In both approaches I use the standard limma-voom approach using duplicateCorrelation to block per patient.

Is there a recommended approach?

differential-expression DEG model • 729 views

ADD COMMENT • link 15 months ago by Lluís R. ★ 1.2k

0

Entering edit mode

Keeping data together is most powerful and most convenient as you have a single analysis object and a single count matrix. I would always do that unless one group has notably more intra-group variability compared to the rest so it could hurt comparisons between other groups. This has been asked many times before. See e.g. https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#if-i-have-multiple-groups-should-i-run-all-together-or-split-into-pairs-of-groups (deseq2, but same principle for limma/edgeR)

ADD REPLY • link 15 months ago by ATpoint 88k

0

Entering edit mode

Many thanks for the helpful comment. Indeed, I read some of them, but I was not fully convinced and I missed the link FAQ of DESeq2. The reason provided there is very helpful: "Having a single dispersion parameter for each gene is usually sufficient for analyzing multi-group data, as the final dispersion value will incorporate the within-group variability across all groups."

ADD REPLY • link 15 months ago by Lluís R. ★ 1.2k