Question

How to solve DESeq2 Error in checkFullRank(modelMatrix) cause by biological replicates information?

0

Entering edit mode

3 months ago

ace • 0

I have 4 factors in the design formula in DESeq2

dds <- DESeqDataSetFromMatrix(countData = dfinput0,
                          colData = sampleTable,
                          design= ~ condition+age+sex+type,condition:type )

I got the error:

Error in checkFullRank(modelMatrix) : 
the model matrix is not full rank, so the model cannot be fit as specified.
One or more variables or interaction terms in the design formula are linear
combinations of the others and must be removed.

This is my coldata:

   condition type     age    sex
1     control    z      57 Female
2     control   af      72   Male
3     control   ac      22   Male
4     control   ab      71 Female
5     control   ab      71 Female
6     control   at      68   Male
7     control   ba      81   Male
8     control   ba      81   Male
9     control   bh      68   Male
10    control   bh      68   Male
11    control   am      81   Male
12    control   am      81   Male
13    control   an      79   Male
14    control   ao      83 Female
15    control   ap      71   Male
16    control   ap      71   Male
17    control   aq      77   Male
18    control   aq      77   Male
19    control   ar      59   Male
20    control   as      67 Female
21    control   au      80 Female
22    control   au      80 Female
23    control   av      77 Female
100   treated    e      48   Male
101   treated    e      48   Male
102   treated    e      48   Male
103   treated    b      32 Female
104   treated    b      32 Female
105   treated    f      45 Female
106   treated    f      45 Female
107   treated    a      68   Male
108   treated    a      68   Male
109   treated    j      64   Male
110   treated    j      64   Male
111   treated    i      69   Male
112   treated    i      69   Male
113   treated    h      45 Female
114   treated    c      59   Male
115   treated    c      59   Male
116   treated    g      75 Female
117   treated    g      75 Female

the type column which represent biological replicates within the condition cause this error. Each sample is coming from human samples and for some human samples there are 2 or 3 sample, for others there is only one sample per human. How can I include the biological replicate information without having this error to DESEq2 analysis?

DESEq2 DESEq • 1.2k views

ADD COMMENT • link updated 3 months ago by swbarnes2 14k • written 3 months ago by ace • 0

0

Entering edit mode

You'll want to use the collapseReplicates() function in DESeq2, see https://rdrr.io/bioc/DESeq2/man/collapseReplicates.html.

ADD REPLY • link 3 months ago by Trivas ★ 1.8k

0

Entering edit mode

That function is only intended for technical replicates, which is not the case here.

ADD REPLY • link 3 months ago by jared.andrews07 ★ 18k

score 1 · Answer 1 · 2024-08-08

1

Entering edit mode

3 months ago

jared.andrews07 ★ 18k

You can't, as they are not split between your condition of interest. You need them present in both groups to account for them properly. age is also going to be a problem, as it's assuming a linear increase in expression with time.

My advice would be to remove those terms from your design and look at the resulting PCA to see if they're actually confounding factors or not. Nothing you can do for type, but you could consider create age bins that would be usable in your design. Currently, sex is the only thing that can be accounted for properly and simply.

Here's a similar question with some responses from the DESeq2 dev.

ADD COMMENT • link 3 months ago by jared.andrews07 ★ 18k

0

Entering edit mode

Another option is to use something like RUVseq to estimate factors of hidden variation. Maybe this captures a combination of age, sex and other unwanted confounders in one or a few factors so you can adjust for this rather than sex, age individually. With human data it is likely that you have many confounders that you cannot explain by the metadata, for example genetic predisposition, drug intake, overall health status, diet, lifestyle etc.

ADD REPLY • link 3 months ago by ATpoint 85k

0

Entering edit mode

Yeah, RUVseq has its own caveats, but may be worth a shot. I'd prefer to see some indications that there is unwanted variation first before trying it though.

ADD REPLY • link 3 months ago by jared.andrews07 ★ 18k

0

Entering edit mode

I have yet to see human RNA-seq without unwanted variation...

ADD REPLY • link 3 months ago by ATpoint 85k

0

Entering edit mode

...fine, I guess. But I'd still prefer to see that such variation is obfuscating the variable of interest before complicating the model.

In particular because it's not clear how to set k for RUVseq, and I have yet to see the authors provide much guidance beyond "try a few and see which looks best", which feels bad. And in my experience, it changes things in a significant way downstream. I guess I am just conservative when it comes to model design and would rather sacrifice sensitivity for precision.

In short, if it doesn't seem like a problem, why try to address it? The RUVseq devs seem to also recommend this philosophy.

ADD REPLY • link 3 months ago by jared.andrews07 ★ 18k

0

Entering edit mode

I actaully don't think that age will be a problem, if treated as a continuous covariate, rather than a factor. However (as Mike notes in the linked thread), that would assume there was a log-linear relationship between age and expression, which is unlikely to be precisely true.

However, I do think that the problem with age is soluble, where as the problem with type is far more difficult structural problem.

ADD REPLY • link 3 months ago by i.sudbery 20k

score 1 · Answer 2 · 2024-08-09

The problem here is that DESeq2 can only deal with replication at one level, where as you have replication at multiple levels - you have multiple patients, each with multiple replicates. Traditional statistical practice here would be to use a mixed effects model, with the patient as a fixed effect and the replicates within the patient as a random effect.

You can't do this with DESeq2, so you have two/three options.

You could use DESeq2, but choose to only be interested in one of the levels of variation - You could assume there is a "true" level of expression in each patient, of which each sample is a noisey estimate (that is, in the perfect experiment, each replicate from the same patient would be identical), and model the variation between patients - this would involve collapsing the within patient replicates, by summing the counts (this is what limma's collapseReplicates). Or you could assume that there is a "true" level of expression for the condition, of which each patient is a noisy estimate of, but that thing truly vary over time/location within a patient. In this case, you would treat each sample as independent and ignore the patient effect.

Or you could use something other than DESeq2. limma has a duplicateCorrelation function, which, as I understand it, produces results similar to a mixed effects model. It was originally designed to deal with microarrays that have multiple probes for the same gene, but has since been re-purposed to deal with this sort of intra-subject variability.

score 0 · Answer 3 · 2024-08-11

As nice as it would be to be able to account for all those things, if all those things really do affect your end results significantly, you would have to have done a much more massive experiment in order to properly model that. You can't have one type z sample of one age and one treatment and expect to be able to model what that means across all the ages and treatments.

It's great that you kept all that metadata, but you realistically aren't going to be able to use it all. Use PCA to figure out what matters, and dump the metadata elements that don't correspond with the first few PCs. Type is almost certainly going to have to go, unless you have a lot more samples than you are showing here, and have multiple samples for each type for control and treatment.