Hello everyone, I'm working with RNA-seq gene expression data derived from multiple tissue samples collected from different patients. My primary goal is to identify differentially expressed genes (DEGs) while minimizing the confounding effects of tissue of origin on the results.
Here’s a brief overview of my approach so far:
- Data Preparation with
edgeR
:
- I've converted the raw count expression data into CPM values using
edgeR
to normalize between samples, accounting for library size. - Filtering was applied using filterByExpr to retain relevant genes.
- Normalization was conducted using the TMM method
keep <- filterByExpr(counts)
counts <- counts[keep, , keep.lib.sizes=FALSE]
counts <- normLibSizes(counts, method = "TMM")
counts_cpm <- cpm(counts, log = TRUE)
- Modeling with variancePartition:
To account for patient and tissue variability, I applied a mixed linear model using the variancePartition
package. My formula models the contribution of both patient and tissue to the gene expression variation:
form <- ~ (1 | Tissue) + (1 | Patient)
vp_modelFit <- fitVarPartModel(counts_cpm, form, df)
vp_modelFit_res <- residuals(vp_modelFit)
My understanding is that the residuals from this model should, in theory, represent gene expression values devoid of tissue- and patient-specific effects, potentially revealing the intrinsic cancer-related signals.
Question:
- Is this approach statistically sound for achieving my aim? Specifically, does this methodology appropriately remove the unwanted variation from tissue and patient sources while preserving biologically relevant signals?
- Any recommendations for improving the robustness of this approach, especially in terms of ensuring that intrinsic cancer-related signals are not inadvertently removed?
I appreciate any insights or suggestions you might have.
Can you show the metadata, to get an idea which patients have which samples, tissues, etc? Basically, the relevant
y$samples
wherey
would be the DGEList.Related: Any particular reason why you're using a random effect for both tissue and patient? I would have certainly treated tissue as a fixed effect, and probably patient as well.
Cross-posted to Bioconductor: https://support.bioconductor.org/p/9159654/