DESeq2: "model matrix is not full rank"
3
0
Entering edit mode
3.0 years ago
bart ▴ 50

Hi,

When I use the DESeqDataSetFromMatrix function from the DESeq2 package in the following way:

DESeqDataSetFromMatrix(df,colData =metadata, design = ~location+time+age+sex+group)

I get the following error:

Error in checkFullRank(modelMatrix): the model matrix is not full rank, so the model cannot be fit as specified.
  One or more variables or interaction terms in the design formula are linear
  combinations of the others and must be removed.

My metadata object looks like this:

x, (cancergroup or control) group, location (of sampling), (sampling) time, age, sex
Sample 1,0,first_location,<12h,4,M 
Sample 2,0,first_location,<12h,7,M 
Sample 3,0,first_location,<12h,6,F 
Sample 4,0,first_location,<12h,6,M 
Sample 5,0,first_location,<12h,2,M 
Sample 6,0,first_location,<12h,5,M 
Sample 7,0,first_location,<12h,2,M 
Sample 7,1,second_location,>24h,2,M
Sample 8,1,second_location,>24h,2,M
Sample 9,1,second_location,>24h,4,F
Sample 10,1,second_location,>24h,2,M
Sample 11,1,second_location,>24h,3,M
Sample 12,1,second_location,<12h,2,F
Sample 13,1,second_location,<12h,5,F

I transformed the ages into factors, for example: a 70 year old patient will get factor level 7, 60 year old will get 6 etc. Also, I factorized the groups: cancer is 0 and control is 1. All other columns also have factors in them.

I think the problem is that some samples/variables have similar information in their rows such as sample 5 and 7. I have seen multiple similar posts to mine but I still don't understand how to solve this problem.

Can anyone help?

DESeq2 • 1.7k views
ADD COMMENT
1
Entering edit mode
3.0 years ago

I don't think a sample set this small can allow you to see the effects of five different experimental variables. Also, it looks like group and location are the same information, so including both will get the "not full rank" error.

ADD COMMENT
1
Entering edit mode
3.0 years ago

In the sample matrix you posted, location, time and group are perfectly confounded. That is, all of the control samples were taken from first_location at timepoint <12hrs. If this is all the samples in your study, it is not possible to disentangle what effects are due to group, what effects to location and what effects the timepoint. One can only estimate the effect of the combination of these effects.

ADD COMMENT
0
Entering edit mode

Hi, thanks for responding. I actually have more samples (88 cancer samples and 88 control samples from different locations, with different ages etc). Do you have any idea how I should proceed? From what I understand from the vignette I have to make a balanced design. But would it be allowed for samples to have the same column values as long as not all column values are the same? For example, sample 1 and sample 2 are from patients with different ages but have otherwise similar column values so is this allowed?

ADD REPLY
1
Entering edit mode

No, I don't think that should be a problem, but you may want to check that the design matrix doesn't have any columns that are all zero in it.

As in design_matrix <- model.matrix(~location+time+age+sex+group, metaData)

ADD REPLY
1
Entering edit mode
3.0 years ago
Mensur Dlakic ★ 28k

You may want to check out how a variance inflation factor can determine multiple co-linearity between features. It may help you decide which features to drop out of analysis. There is a practical example here and a python implementation is here.

ADD COMMENT

Login before adding your answer.

Traffic: 1903 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6