Question

DESeq2: error in checkFullRank(modelMatrix)

0

Entering edit mode

8 months ago

tilia • 0

Hi everyone,

I got this well known error when using DESeq2:

Error in checkFullRank(modelMatrix):
the model matrix is not full rank, so the model cannot be fit as specified. One or more variables or interaction terms in the design formula are linear combinations of the others and must be removed.

This is my colData:

SampleID    Phenotype   Batch   Condition
Sample01    LLF LLFbatch1   A
Sample02    LLF LLFbatch1   A
Sample03    LLF LLFbatch1   A
Sample04    LLF LLFbatch1   B
Sample05    LLF LLFbatch1   B
Sample06    LLF LLFbatch1   B
Sample07    LLF LLFbatch2   A
Sample08    LLF LLFbatch2   A
Sample09    LLF LLFbatch2   A
Sample10    LLF LLFbatch2   B
Sample11    LLF LLFbatch2   B
Sample12    LLF LLFbatch2   B
Sample13    WT  WTbatch1    A
Sample14    WT  WTbatch1    A
Sample15    WT  WTbatch1    A
Sample16    WT  WTbatch1    B
Sample17    WT  WTbatch1    B
Sample18    WT  WTbatch1    B
Sample19    WT  WTbatch2    A
Sample20    WT  WTbatch2    A
Sample21    WT  WTbatch2    A
Sample22    WT  WTbatch2    B
Sample23    WT  WTbatch2    B
Sample24    WT  WTbatch2    B

I was asked to explore DEGs for Condition B vs. A.

I already analyzed the phenotypes seperately, for example "B vs. A" in WT by taking into account the batch effect (~ Batch + Condition). I also tried ~Phenotype+Condition, which works, but lacks the batch information.

Now I want to consider all sources of variation, which are phenotype and the two batches for each phenotype. There are always three biological replicates for each combination of them, e.g. three times Condition 'A' for phenotype 'LLF' with batch 'LLFbatch1' (Sample 1-3).

This is my design formula:

design= ~ Phenotype + Batch + Condition

I have read the section “Model matrix not full rank” in the DESeq2 vignette and I think this might be an example of "Group-specific condition effects, individuals nested within groups". Even though I understand this example, I still don't understand why my design also produces this error. A solution was given by introducing an additional "ind.n" column. However, I am not sure how to apply this to my situation.

What can I do? Can anyone help please?

DESeqDataSet design DESeq2 RNA-Seq • 1.8k views

ADD COMMENT • link updated 4 months ago by Konrad • 0 • written 8 months ago by tilia • 0

1

Entering edit mode

Were the samples actually sequenced/generated in (a) two total batches (batch1 and batch2), or (b) four total batches (two per phenotype: LLFbatch1, LLFbatch2, WTbatch1, WTbatch2)? If it was (a), you should reformulate the batch column to be batch1 and batch2 (i.e., remove the phenotype prefix) and then the model will be full rank. If (b), then the model matrix isn't full rank because of perfect collinearity: batch is nested within phenotype.

ADD REPLY • link 8 months ago by cfos4698 ★ 1.1k

score 3 · Answer 1 · 2024-11-21

3

Entering edit mode

8 months ago

i.sudbery 22k

The reason that fitting this model doesn't work is that your Phenotype is a linear combination of your batches. That is LLF=LLFbatch1+LLFbatch2 and WT=WTbatch1 + WTbatch2. Because of this, under a standard design the linear model is uable to distinguish what part of an effect is from phenotype, and what part of an effect is a batch effect that happens to be share by, for example, LLFbatch1 and LLFbatch2.

ADD COMMENT • link 8 months ago by i.sudbery 22k

0

Entering edit mode

Thank you very much for your answers! I realised that I have not described the situation clearly enough. I'm sorry for that. Especially the column title "Batch" was misleading. New attempt..

This is the renamed table:

SampleID    Phenotype   Genotype   Treatment
Sample01    LLF LLF1   untreated
Sample02    LLF LLF1   untreated
Sample03    LLF LLF1   untreated
Sample04    LLF LLF1   treated
Sample05    LLF LLF1   treated
Sample06    LLF LLF1   treated
Sample07    LLF LLF2   untreated
Sample08    LLF LLF2   untreated
Sample09    LLF LLF2   untreated
Sample10    LLF LLF2   treated
Sample11    LLF LLF2   treated
Sample12    LLF LLF2   treated
Sample13    WT  WT1    untreated
Sample14    WT  WT1    untreated
Sample15    WT  WT1    untreated
Sample16    WT  WT1    treated
Sample17    WT  WT1    treated
Sample18    WT  WT1    treated
Sample19    WT  WT2    untreated
Sample20    WT  WT2    untreated
Sample21    WT  WT2    untreated
Sample22    WT  WT2    treated
Sample23    WT  WT2    treated
Sample24    WT  WT2    treated

There are wildtype samples and samples with a disease phenotype (Phenotype). For each Phenotype, there are samples from two patients each (Genotype). These samples were treated or not (Treatment).

I already did comparisons "treated vs untreated" for each genotype (WT1, WT2, LLF1, LLF2), design: ~Treatment. I also analyzed "treated vs untreated" for each phenotype (WT, LLF), design: ~Genotype+Treatment.

Now I want to analyze "LLF vs WT" for the untreated and the treated samples each. For both cases, the respective table would look like this:

SampleID    Phenotype   Genotype
Sample01    LLF    LLF1
Sample02    LLF    LLF1
Sample03    LLF    LLF1
Sample07    LLF    LLF2
Sample08    LLF    LLF2
Sample09    LLF    LLF2
Sample13    WT    WT1
Sample14    WT    WT1
Sample15    WT    WT1
Sample19    WT    WT2
Sample20    WT    WT2
Sample21    WT    WT2

(A) I could of course use the design ~Phenotype, but this would lack the variation information (e.g. WT is composed of WT1 and WT2) and might not be the correct approach.

(B) If I use the design ~Genotype+Phenotype to provide the variation information, I get "model matrix is not full rank" due to the linear combination of Genotypes.

(C) The third option I see is to make four comparisons (LLF1 vs. WT1, LLF2 vs. WT1, LLF1 vs. WT2, LLF2 vs. WT2) and look at the individual results. This would make analysing the experiment cumbersome. As the next step is gene set enrichment analysis (GSEA), four different results would have to be compared.

What procedure would you recommend? Is there another option?

ADD REPLY • link 6 months ago by tilia • 0

0

Entering edit mode

What good is the batch/genotype information if you are performing LLFvsWT? For example LLF/LLF1 doesn't compare with WT/WT1 or WT/WT2 directly. So you only need to compare LLF vs WT and ignore genotypes since there is no matching genotype between your phenotypes. You may want to treat as 6 reps each, although from my understanding, in this comparison, it is more like 2 biological reps with 3 technical repeats per biological rep.

ADD REPLY • link 6 months ago by rfran010 ★ 1.6k

0

Entering edit mode

Yes, that's it. There are 2 biological reps with 3 technical reps each. I was not sure if it would be correct to treat them as 6 reps. According to this post (https://support.bioconductor.org/p/9155748/) every source of variation should be included, otherwise the model will be mis-specified. PCA also does not support this:

enter image description here

ADD REPLY • link 6 months ago by tilia • 0

1

Entering edit mode

This is a nested design. I think the best way to analyse this would be to treat it as a mixed effects model, with the patient being a fixed effect and the replicate within a patient being a random effect. The easiest way to do this would be to do the analysis with limma/voom rather than DESeq2, and use the replicateCorrelation function to aggregate the "technical" replicates.

ADD REPLY • link 6 months ago by i.sudbery 22k

score 0 · Answer 2 · 2025-03-03

Wouldn't it be possible to model the nested structure by an interaction term (Phenotype:Genotype) as a workaround:

~ Phenotype + Phenotype:Genotype + Treatment

or

~ Phenotype + Genotype + Phenotype:Genotype + Treatment

Then Genotype is not modeled as random effect (needing a mixed model) but fixed effect and he can stay with DeSeq2. The interpretation is a bit different. As random effects: Genotype comes from a big unknown population (Genotype = individual patients of a huge population?). As fixed effect: there are only the measured Genotypes (LLF1, LLF2, WT1, ...) and we are just interested in their differences/contributions. And depending on whether these assumptions/interpretations are justified, the result wouldn't differ a lot to a mixed model.