EDIT:
Cross-posted on Bioconductor: https://support.bioconductor.org/p/114506/
Hello,
I am trying to figure out how to handle biological replicates across batches in DESeq2's design matrix. They're not quite technical replicates, but trying to include the sample ID as a factor in the design matrix produces a 'Model matrix not full rank' error. What do you think I should do?
Experiment:
We have three variables of interest in our analysis: Timepoint (A and B), Sex (M or F), and Condition (cond1 or cond2).
Note that a given sample can only have one timepoint, one sex, and one condition - so it's not like we can split the same tissue and look at it at Timepoint A and B. Each possible Timepoint x Sex x Condition has its own unique samples.
In our first sequencing batch, we collected samples for each possible combination of conditions.
In our second batch, we took some of the same RNA samples from the first sequencing batch PLUS some new RNA samples, re-generated libraries from all of these, and then sequenced.
In the end, we have a sample table that looks like this:
SAMPLE | BATCH | CONDITION | SEX | TIMEPOINT
------ ------- --------- ---- ---------
samp1 | 1 | cond1 | F | A
samp1 | 2 | cond1 | F | A
samp2 | 1 | cond1 | M | A
samp2 | 2 | cond1 | M | A
samp3 | 1 | cond1 | F | A
samp4 | 2 | cond1 | F | A
...and so forth.
Problem
I tried to run DESeq with this:
design(dds) <- ~Genotype + Sex + Condition + SequencingBatch + Sample
and also with this:
design(dds) <- ~Genotype + Sex + Condition + SequencingBatch_Sample
...where SequencingBatch_Sample is a combination of SequencingBatch and Sample, like "samp1_1", samp1_2" for sample 1/batch1 and sample1/batch2.
These designs, of course, throw a 'Matrix not full rank' error, presumably because it is not possible to have the same sample have every possible combination of Condition/Sex/Timepoint, since each sample can only have one quality each of Condition, Sex, and Timepoint.
However, I'm not quite sure what to do instead.
Using collapseReplicates() doesn't feel appropriate because these aren't really technical replicates - a whole new library was produced for the second batch, just using the same RNA.
Doing nothing seems problematic because then the 'duplicated' signal from a subset of samples threatens to overwhelm the differential expression results.
All suggestions are very welcome - thank you!
I would ask this question at the Bioconductor support page. Mike Love (the developer of DESeq2) is outstandingly responsive, especially when dealing with well-written questions like yours. I asked something about shrinkage estimators a few hours ago and got an elaborate answer including code back just a few hours later.
Ah thank you! I haven't interacted with Bioconductor forum much, I'll give it a try.
Please share his answer here once you get one. I would be interested in his expertise on this one, too.
Hello - I posted his reply below, hope it helps
I'd start out by doing PCA on these samples. Maybe you will luck out, and observe that some of your categories don't make much difference. My guess is you will not be able to correct based on sample, since each sample is only in there once or twice.
Thanks for the suggestion! Unfortunately, it does seem like all of the aforementioned factors do separate within the first two PCs - except for Sex, which we know should absolutely have a large expression difference in all conditions at all timepoints. Replicate samples in non-batch-corrected data usually, but don't always, cluster together.