correcting for a batch in DESeq2
2
11
Entering edit mode
9.2 years ago
gracechappell ▴ 110

Hello,

I have the following experimental design for an experiment in which we conducted RNA sequencing: 2 treatment groups and 2 batches, but one of the batches is exclusively one treatment.

sample  treatment  batch
3       control    A
4       control    A
5       control    B
6       control    B
7       control    B
13      control    B
14      control    B
15      BD         A
16      BD         A
17      BD         A
18      BD         A
18      BD         A

It is a poor experimental design, but unfortunately it is the data that I currently must work with. To account for my the potential batch effect of the B batch, I am using the following design formula:

dds <- DESeqDataSetFromMatrix(countData = countData,colData = colData,design = ~ batch + treatment)
dds$treatment <- factor(dds$treatment, levels=c("control", "BD"))
dds$batch <- factor(dds$batch, levels=c("A", "B"))
dds <- DESeq(dds, full=design(dds), reduced = ~ batch)

The results give me many less DE genes than if I simply ignored the batches and only use ~ treatment. This makes sense, because according to PCA and clustering, there is a batch effect in my samples. I've read the DESeq2 manual and many posts, but am not a statistician and would love to hear feedback if the design I'm using here makes sense, with the lack of representation of both treatment groups in the batch I am intended to correct for. Thank you!

RNA-Seq R • 12k views
ADD COMMENT
3
Entering edit mode
9.2 years ago

Your design is correct, it's not an issue that the batch is just within one treatment since that treatment is itself present in both batch A and B. You only have a big problem when a whole group is present in its own batch.

ADD COMMENT
0
Entering edit mode

I think this design would not really perform the kind of comparison one would expect. Have a look at my answer if you wish!

ADD REPLY
0
Entering edit mode

It produces the exact comparison that one expects, it's just an LRT for the effect of treatment. The only real caveat to any of this is that a batch-specific change in one group will completely throw off the results...but the only way to deal with such things is to either ignore the possibility (what's typically done) or just toss most of the data.

ADD REPLY
0
Entering edit mode

Sorry,

I have such design

    condition   batch
A1  treatment   1
A2  treatment   1
A3  treatment   1
A4  treatment   1
A5  treatment   1
A6  treatment   1
A7  control     1
A8  control     1
A9  control     1
A10 control     1
A11 control     1
A12 control     1
B1  treatment   2
B2  treatment   2
B3  treatment   2
B4  treatment   2
B5  treatment   2
B6  treatment   2
B7  control     2
B8  control     2
B9  control     2
B10 control     2
B11 control     2
B12 control     2

I am only interested in knowing genes related to batch (two experiments) ignoring any change due to treatment; Does this give me these genes?

dds <- DESeqDataSetFromMatrix(countData = countData,colData = colData,design = ~ batch + condition)
ADD REPLY
0
Entering edit mode

if you want to completely ignore the change due to treatment, then remove it:

dds <- DESeqDataSetFromMatrix(countData = countData,colData = colData,design = ~ batch)
ADD REPLY
0
Entering edit mode
9.2 years ago
Martombo ★ 3.1k

I recently had a similar problem and it is actually quite a non-optimal situation. The samples present in the B batch of your case are not going to be completely useful for your analysis. I produced some results with my data, first only comparing control and BD from what in your case would be the A batch (so two samples of the control vs BD) and then comparing all controls vs BD, including the batch variable as a covariate. Here's a peek at these results:

==> A_only <==
gene mean_base log2FC log2FC_MLE FDR
gene1 15001.4 -4.81 -4.89 1.3e-242
gene2 1889.02 1.62 1.64 2.4e-43
gene3 5020.71 0.92 0.92 3.5e-39
gene4 3568.32 0.91 0.92 1.0e-13

==> BA_batch <==
gene mean_base log2FC log2FC_MLE FDR
gene1 15001.47 -4.51 -4.89 6.0e-243
gene2 1889.02 1.56 1.64 1.2e-43
gene3 5020.71 0.85 0.92 2.0e-39
gene4 14800.52 0.86 0.92 1.6e-56

gene4 was modified on purpose, just to be sure of what I'm telling you. So don't consider it for the moment. Have a look instead at gene1, 2 and 3. You can see that the mean_base is the same between the two analyses and so is the unmoderated log2FC estimation (log2FC_MLE). Basically whether you include your control B samples or not, will not affect the log2FC that is being computed by DESeq2, that is because the contrast is performed between your control A samples and your BD samples. It will instead have an effect on the significance of the comparison and on the moderation of the log2FC. For gene4 I applied an additional check. I multiplied by 10 its counts in the samples control B. This means that if these samples were taken into account for the contrast, the log2FC should be very different. That is not the case! Which means that you can have 10-fold differences in your control B data and these will not influence your results.

ADD COMMENT
0
Entering edit mode

The Log2FC should change with increasing samples, since the ability to shrink with change accordingly. Yes, of course the computed value won't change, since the batch will make the sub-group means the same, but the resulting shrinkage will differ. This is why you want more than 1 sample in each batch, otherwise the results aren't very meaningful.

ADD REPLY
1
Entering edit mode

yes, I'm not saying that this is the wrong approach, it's indeed the best possible, probably. I'm just pointing at some of its implications. For example if somebody wanted to invest more money in adding samples from control B, that's probably not a good idea...

ADD REPLY

Login before adding your answer.

Traffic: 2342 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6