Question

Seeking Advice on Using `edgeR` and `variancePartition` for RNA-seq Data with Multiple Tissues and Patients

0

Entering edit mode

4 months ago

h.moosavi57 • 0

Hello everyone, I'm working with RNA-seq gene expression data derived from multiple tissue samples collected from different patients. My primary goal is to identify differentially expressed genes (DEGs) while minimizing the confounding effects of tissue of origin on the results.

Here’s a brief overview of my approach so far:

Data Preparation with edgeR:

I've converted the raw count expression data into CPM values using edgeR to normalize between samples, accounting for library size.
Filtering was applied using filterByExpr to retain relevant genes.
Normalization was conducted using the TMM method

keep <- filterByExpr(counts)

counts <- counts[keep, , keep.lib.sizes=FALSE]

counts <- normLibSizes(counts, method = "TMM")

counts_cpm <- cpm(counts, log = TRUE)

Modeling with variancePartition:

To account for patient and tissue variability, I applied a mixed linear model using the variancePartition package. My formula models the contribution of both patient and tissue to the gene expression variation:

form <- ~ (1 | Tissue) + (1 | Patient)

vp_modelFit <- fitVarPartModel(counts_cpm, form, df)

vp_modelFit_res <- residuals(vp_modelFit)

My understanding is that the residuals from this model should, in theory, represent gene expression values devoid of tissue- and patient-specific effects, potentially revealing the intrinsic cancer-related signals.

Question:

Is this approach statistically sound for achieving my aim? Specifically, does this methodology appropriately remove the unwanted variation from tissue and patient sources while preserving biologically relevant signals?
Any recommendations for improving the robustness of this approach, especially in terms of ensuring that intrinsic cancer-related signals are not inadvertently removed?

I appreciate any insights or suggestions you might have.

linear-mixed edgeR modeling variancePartition • 792 views

ADD COMMENT • link updated 4 months ago by Gordon Smyth ★ 7.8k • written 4 months ago by h.moosavi57 • 0

0

Entering edit mode

Can you show the metadata, to get an idea which patients have which samples, tissues, etc? Basically, the relevant y$samples where y would be the DGEList.

ADD REPLY • link 4 months ago by ATpoint 86k

1

Entering edit mode

Related: Any particular reason why you're using a random effect for both tissue and patient? I would have certainly treated tissue as a fixed effect, and probably patient as well.