Longitudinal analysis of subpopulations: which approach is better?
0
0
Entering edit mode
7 months ago
Lluís R. ★ 1.2k

In a project from a sample they use flow cytometry to split the population in 4 fractions (CD8, CD4, DN, CD19). Later each fraction was bulk sequenced (RNA-seq).

Now I need to analyze this data and I was wondering which is the best approach to calculate the differential expressed genes (DEG).
I am doubting between two strategies to compare between time points.

The lab has been splitting the samples for each category and the compare it. In pseudocode:

# SE SummarizedExperiment
SE_CD4 <- SE[ , SE$cell_type == "CD4"]
model_CD4 <- model.matrix(~ timepoint, data = colData(SE_CD4))
# Compute DEG for CD4 using SE_CD4

SE_CD8 <- SE[ , SE$cell_type == "CD8"]
model_CD8 <- model.matrix(~ timepoint, data = colData(SE_CD8))
# Compute DEG for CD8 using SE_CD8

# The same for the other populations

But I was reading the guidelines for design matrix and I was considering if it is better to use something like:

SE$cell_time <- paste0(SE$cell_type, "_", SE$timepoint)
model <- model.matrix(~ cell_time, data = colData(SE))
contr <- limma::makeContrasts(CD4_T1_vs_T2 = CD4_T2 - CD4_T1, 
                                     CD8_T1_vs_T2 = CD8_T2 - CD4_T1, 
                                     levels = model)
# Compute the DEG for all the contrasts using SE and contr

I noticed some differences between these approaches. In both approaches I use the standard limma-voom approach using duplicateCorrelation to block per patient.

Is there a recommended approach?

differential-expression DEG model • 356 views
ADD COMMENT
0
Entering edit mode

Keeping data together is most powerful and most convenient as you have a single analysis object and a single count matrix. I would always do that unless one group has notably more intra-group variability compared to the rest so it could hurt comparisons between other groups. This has been asked many times before. See e.g. https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#if-i-have-multiple-groups-should-i-run-all-together-or-split-into-pairs-of-groups (deseq2, but same principle for limma/edgeR)

ADD REPLY
0
Entering edit mode

Many thanks for the helpful comment. Indeed, I read some of them, but I was not fully convinced and I missed the link FAQ of DESeq2. The reason provided there is very helpful: "Having a single dispersion parameter for each gene is usually sufficient for analyzing multi-group data, as the final dispersion value will incorporate the within-group variability across all groups."

ADD REPLY

Login before adding your answer.

Traffic: 1367 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6