Question

model matrix error message 'unequal columns/rows' when using a conditional variable

0

Entering edit mode

4.3 years ago

RNAseqer ▴ 280

Hello all,

I have an RNAseq dataset where some of the samples were run twice, and my counts for those samples are the result of combining a first and second run's output. Since not all of these samples had a second run, I want to set up the second run's lane as a variable in my design. The way I have done this is to create a variable categorical variable "secondrun" that has "1" or "NA" and a conditional variable 'lanesecondrun' that nests within secondrun. As I understand it this should mean lanesecondrun is only considered when secondrun exists (like the commonly given example of a model where "age of spouse" is only considered when the variable "spouse" is turned on). Unfortunately this is giving me an error in the 'estimateDisp() step of the edgerR pipeline.

design <- model.matrix(~0 + Disease + Batch + secondrun + (secondrun:lanesecondrun) + RIN + Sex + PMI + Smoking + Ethanol, bigdemos)

dge.estDisp.out <- estimateDisp(dge.tmm.normfact.out, design)

Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.05,  : 
  nrow(design) disagrees with ncol(y)

Looking at my design matrix, I see that 'secondrun' has been assigned a 1 for all 300 rows, despite the fact that some of the 300 subjects had "NA" assigned to that column of the demos file. I also only see two of the three levels from 'lanesecondrun' represented in the design matrix (the first non NA value in the column from the demos file, F6, has no column in the design matrix.) This is my first time setting up a nested design.

I feel like I'm missing something obvious here. Can anyone tell me what I have done wrong?

rna-seq categorical variables edger • 1.2k views

ADD COMMENT • link updated 4.2 years ago by Gordon Smyth ★ 8.1k • written 4.3 years ago by RNAseqer ▴ 280

score 0 · Answer 1 · 2021-03-20

NA values are not allowed in a covariate or factor, so you are not setting up the factor. The no NA requirement is not specific to edgeR but applies to any statistical that uses design matrices or model.matrix.

Generally you should simply add the counts for the two runs, and the sumTechReps function is provided for that purpose.