I am trying to build a DESeq2 model design to find expression differences between sets of patient samples who fall into three categories:
1.) No response to treatment (progression of disease) - 5 samples
2.) Partial response to treatment (stable disease) - 2 samples
3.) Complete response to treatment (improved symptoms) - 2 samples
I understand that we should have at least 3 biological replicates for each group and it is not statistically sound to compare 2 samples versus 5 samples using DESeq2. However we are trying to generate preliminary results with our existing data to show that it is worth it to collect more samples for the "partial response" and "complete response" groups.
In order to have at least three samples in each comparison group, I would like to combine the "partial response" and "complete response" groups into a unified set of 4 samples. This is working on the assumption that there are overlapping DE gene networks involved in the "partial response" and "complete response" groups. I understand that I may lose signal of DE genes which exist in only one of the two groups.
I would like to design a model to test for DE in these two groups:
1.) No response to treatment (progression of disease) - 5 samples
versus
2.) Partial response AND complete response to treatment (stable disease and improved symptoms grouped) - 4 samples
I've already tried running DESeq2 with a simple design:
DESeq.ds <- DESeqDataSetFromTximport(txi.salmon, sampleTable, ~ broader_patient_response)
Where broader_patient_response maps to either 1). No response to treatment (5 samples) and 2.) Some response to treatment (4 samples from two somewhat similar phenotypic groups).
However, I am wondering if it is necessary to correct for variance between the two combined sets of samples (Partial response AND complete response to treatment) before comparing them to the 5 No response to treatment samples. Is an interaction term appropriate here?
Can someone suggest an appropriate DESeq2 model design to use to make this comparison? Thank you in advance!
Why don't you simply encode 2) and 3) with the same name in
colData
, like calling themresponse
and the othersno_response
. If you call this columngroup
it comes down to~group
. That woud obviously treat the samples as replicates, so it could get you the base differences between non-responders and responders. Or simply do a 5 vs 2 comparison and see what comes out. It is true that n=2 is underpowered but given that it is preliminary data I do not see a major problem. I would compare what a 5 vs 2 rather than 5 vs 4 comparison results into, and then decide for the strategy that, at this moment of your project, serves your purpose better. A PCA is also always a good idea to explore how and whether groups even separate.Thank you very much for your reply @ATpoint. I started by running the 5 "no response" versus 4 "grouped partial response" comparison and received around 1000 DE genes within padj <0.005 and abs(FC) > 2 and the network enrichment results appear to make sense biologically. I edited my original post to clarify this comparison design. I posted because I was concerned that I could improve this model design by correcting for differences between the two sets of grouped "some response" condition samples in the model, but I get the feeling that this is done inherently by treating them as replicates. Is this true?
I ran PCA on all 9 samples and see the separation along PC2 that I would expect:
I'm not sure what is causing the separation seen in PC1. Do you feel it's necessary to identify the source and correct for it before trusting the DESeq results? I have already looked at mseqs and average GC_content per sample.
I will try running 5 versus 2 as you suggested. I didn't realize that it would be acceptable to include a group of 2 samples. The justification you posed makes sense and I'll share this with my team. Thank you very much!
With only 9 samples, you only have 9 PCs. And the strongest only represents 22% of your variance. That's just not a lot. I wouldn't worry about it.
If you are generating the PCA plot using PCAtools you can see the genes that most contribute to each PC by adding
showLoadings=TRUE
tobiplot
, or making a dedicated loadings plot usingplotloadings
.