Question

DESeq2 model matrix not full rank

0

Entering edit mode

3.8 years ago

Assa Yeroslaviz ★ 1.9k

I'm running a deseq2 analysis and when trying to create the model matrix I get the familiar error message

Error in checkFullRank(modelMatrix) : the model matrix is not full rank, ...

I know what the error means and I have tried to circumvent the problem, but I don't see what I can do next to solve it. I know I have a batch effect as one can see from the image below (before and after batch correction using limma::removeBatchEffect.)

batch correction

my coldata is attached here as well and below also the snippet to create the model matrix.

dds <- DESeqDataSetFromMatrix(countData = cts, #( the count data)
                              colData = coldata, (see below for the information)
                              design = ~ social_group + parents)

I'm not sure how to correct it so that i can compensate the data for the batch.

thanks in advance for the help

model matrix

name    index   group   social_group    social_subgroup condition   batches label   cohort  parents
10_3_lo_hi_a    10  3   lo  hi  3_lo_hi a   10_lo_hi    3_lo    parents1
11_3_hi_lo_a    11  3   hi  lo  3_hi_lo a   11_hi_lo    3_hi    parents1
12_3_hi_hi_a    12  3   hi  hi  3_hi_hi a   12_hi_hi    3_hi    parents1
13_3_lo_lo_b    13  3   lo  lo  3_lo_lo b   13_lo_lo    3_lo    parents1
14_3_lo_hi_b    14  3   lo  hi  3_lo_hi b   14_lo_hi    3_lo    parents1
15_3_hi_lo_b    15  3   hi  lo  3_hi_lo b   15_hi_lo    3_hi    parents1
16_3_hi_hi_b    16  3   hi  hi  3_hi_hi b   16_hi_hi    3_hi    parents1
17_3_lo_lo_c    17  3   lo  lo  3_lo_lo c   17_lo_lo    3_lo    parents2
18_3_lo_hi_c    18  3   lo  hi  3_lo_hi c   18_lo_hi    3_lo    parents2
19_3_hi_lo_c    19  3   hi  lo  3_hi_lo c   19_hi_lo    3_hi    parents2
1_mid_lo_1  1   mid mid lo  mid_lo  1   1_mid_lo    mid mid_1
20_3_hi_hi_c    20  3   hi  hi  3_hi_hi c   20_hi_hi    3_hi    parents2
21_3_lo_lo_d    21  3   lo  lo  3_lo_lo d   21_lo_lo    3_lo    parents2
22_3_lo_hi_d    22  3   lo  hi  3_lo_hi d   22_lo_hi    3_lo    parents2
23_3_hi_lo_d    23  3   hi  lo  3_hi_lo d   23_hi_lo    3_hi    parents2
24_3_hi_hi_d    24  3   hi  hi  3_hi_hi d   24_hi_hi    3_hi    parents2
25_1_lo_lo_a    25  1   lo  lo  1_lo_lo a   25_lo_lo    1_lo    parents3
26_1_lo_hi_a    26  1   lo  hi  1_lo_hi a   26_lo_hi    1_lo    parents3
27_1_hi_lo_a    27  1   hi  lo  1_hi_lo a   27_hi_lo    1_hi    parents3
28_1_hi_hi_a    28  1   hi  hi  1_hi_hi a   28_hi_hi    1_hi    parents3
29_1_lo_lo_b    29  1   lo  lo  1_lo_lo b   29_lo_lo    1_lo    parents3
2_mid_hi_1  2   mid mid hi  mid_hi  1   2_mid_hi    mid mid_1
30_1_lo_hi_b    30  1   lo  hi  1_lo_hi b   30_lo_hi    1_lo    parents3
31_1_hi_lo_b    31  1   hi  lo  1_hi_lo b   31_hi_lo    1_hi    parents3
32_1_hi_hi_b    32  1   hi  hi  1_hi_hi b   32_hi_hi    1_hi    parents3
33_1_lo_lo_c    33  1   lo  lo  1_lo_lo c   33_lo_lo    1_lo    parents4
34_1_lo_hi_c    34  1   lo  hi  1_lo_hi c   34_lo_hi    1_lo    parents4
35_1_hi_lo_c    35  1   hi  lo  1_hi_lo c   35_hi_lo    1_hi    parents4
36_1_hi_hi_c    36  1   hi  hi  1_hi_hi c   36_hi_hi    1_hi    parents4
37_1_lo_lo_d    37  1   lo  lo  1_lo_lo d   37_lo_lo    1_lo    parents4
38_1_lo_hi_d    38  1   lo  hi  1_lo_hi d   38_lo_hi    1_lo    parents4
39_1_hi_lo_d    39  1   hi  lo  1_hi_lo d   39_hi_lo    1_hi    parents4
3_mid_lo_2  3   mid mid lo  mid_lo  2   3_mid_lo    mid mid_2
40_1_hi_hi_d    40  1   hi  hi  1_hi_hi d   40_hi_hi    1_hi    parents4
4_mid_hi_2  4   mid mid hi  mid_hi  2   4_mid_hi    mid mid_2
5_mid_lo_3  5   mid mid lo  mid_lo  3   5_mid_lo    mid mid_3
6_mid_hi_3  6   mid mid hi  mid_hi  3   6_mid_hi    mid mid_3
7_mid_lo_4  7   mid mid lo  mid_lo  4   7_mid_lo    mid mid_4
8_mid_hi_4  8   mid mid hi  mid_hi  4   8_hi_4  mid mid_4
9_3_lo_lo_a 9   3   lo  lo  3_lo_lo a   9_lo_lo 3_lo    parents1

full-rank model-matrix deseq2 • 3.1k views

ADD COMMENT • link updated 23 months ago by GenoMax 150k • written 3.8 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

dds <- DESeqDataSetFromMatrix(countData = countData, colData = coldata, design = ~ Conditions + Treatment_time + Conditions:Treatment_time)
as coldata i used this 
Sample_ID       Conditions      Treatment_time
A1      Control         4h
A2      Control         4h
Z1      CK_Treatment    4h
Z2      CK_Treatment    4h
A7      Control         28h
A8      CK_Treatment    28h
Z7      Control         28h
Z8      Control         28h
A13     CK_Treatment    52h
A14     CK_Treatment    52h
Z13     Control         52h
Z14     CK_Treatment    52h

and got an error

Error in checkFullRank(modelMatrix) : 
  the model matrix is not full rank, so the model cannot be fit as specified.
  Levels or combinations of levels without any samples have resulted in
  column(s) of zeros in the model matrix.

  Please read the vignette section 'Model matrix not full rank':

  vignette('DESeq2')
In addition: Warning message:
In DESeqDataSet(se, design = design, ignoreRank) :
  some variables in design formula are characters, converting to factors

I am not able to resolve this issue please help.

ADD REPLY • link updated 23 months ago by GenoMax 150k • written 23 months ago by Simran • 0

0

Entering edit mode

Please do not post new questions as answers in existing threads. Create a new thread if existing threads do not help answer your question.

ADD REPLY • link 23 months ago by GenoMax 150k

score 9 · Accepted Answer · 2021-06-24

I'm afraid the answer is that you can't, not if you want to include the "mid" parents in this analysis.

Your plots are actaully a nice illustration of why not. When you remove batch effects in limma, the assumption that limma is making is that the average of each batch should be the same. Conceptually it calculates the average of all the samples in each batch and compares the differences between batches - if a gene is twice as highly expressed in batch1 as it is in batch2, then the values for all samples in batch1 are divided by two so that the average of batch1 is the same as the average in batch2.

The assumption here is that all variance in the means between batch1 and batch2 is caused by the difference in batch and not differences in biologically meaningful variables. But if all batch1 was condition 1, while batch2 was a mix of condition2 and 3, then that difference in means is just as likely to be due to the different condition as it is the different batch.

If you look at your PCA, you'll see that your "mid" batch has been placed exactly so that its average is 0 on PC1 and 0 on PC2. This is not coincidence, but rather it has been forced to that position by the mathematics.

It is mathematically, philosophically and practically impossible to disentangle the effect of parent social class from the effect of offspring social class using this dataset. Your only choice is to either accept that many of the effects are being mediated by parent of origin, or to drop the "mid" samples out of the analysis, and just use the hi and lo.

A further complication though is that the modeling framework in all of the mainstream RNA-seq analysis tools assumes a fixed-effects model where the independent variables (the predictors/factors) are statistically independent of each other (that is, in this case that the social class of the offspring is not affected by the social class of the parent). This is really only valid in the case of full randomization, and not for an observational study. I get the feeling that here a) at least some of these effects need to be random effects b) the predictors are not independent at all.

Personally I think you need to talk to an epidemiologist/biostatistician that specializes in these sorts of things.