Hi guys,
I'm analysing some data with DESeq2
which involves 4 groups, so I do some comparisons between some specific groups. I'm finding that, when labelling genes as outliers (hence giving NA p-values) DESeq2 uses the information from all the samples in each row. This leads to some outlier genes for some comparison not being adequately (for me) labelled.
See this example: I compare group3 vs group4, and the top result is a very obvious outlier: it has 0 counts for all samples in g3 and g4, except one of them, which has 1000 counts. It was not flagged as an outlier because samples in g1 and g2 have high levels of counts:
#######################################
# Analysis with all samples
#######################################
# This is the design
design(foo) <- ~0 + group
model.matrix(design(foo), colData(foo))
groupg1 groupg2 groupg3 groupg4
s1_g1 1 0 0 0
s2_g1 1 0 0 0
s3_g1 1 0 0 0
s4_g2 0 1 0 0
s5_g2 0 1 0 0
s6_g2 0 1 0 0
s7_g3 0 0 1 0
s8_g3 0 0 1 0
s9_g3 0 0 1 0
s10_g4 0 0 0 1
s11_g4 0 0 0 1
s12_g4 0 0 0 1
attr(,"assign")
[1] 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"
# Fit model
foo <- DESeq(foo, test = "Wald", parallel = T)
# Extract results comparing group 4 vs group 3
res <- as.data.frame(DESeq2::results(foo,
+ contrast = list(c("groupg4"),
+ c("groupg3")),
+ listValues = c(1, -1),
+ pAdjustMethod = "BH",
+ alpha = 0.05))
head(dplyr::arrange(res,pvalue))
baseMean log2FoldChange lfcSE stat pvalue padj
ENSMUSG00000095937.1 1053.02717 -25.0892237 3.7236480 -6.737807 1.607944e-11 3.003318e-07
ENSMUSG00000102500.1 94.16247 8.5057336 1.2968182 6.558925 5.419696e-11 5.061454e-07
ENSMUSG00000030757.13 465.86889 -0.5388158 0.1335390 -4.034895 5.462674e-05 3.401061e-01
ENSMUSG00000029754.13 136.73970 0.9530311 0.2500406 3.811505 1.381234e-04 6.449670e-01
ENSMUSG00000034457.10 15.72399 -2.0677856 0.5545838 -3.728536 1.925956e-04 7.194603e-01
ENSMUSG00000096370.8 188.22008 -0.8566324 0.2397271 -3.573365 3.524235e-04 8.425913e-01
# Look at the counts for the top gene:
counts(foo,normalized = T)["ENSMUSG00000095937.1",]
s1_g1 s2_g1 s3_g1 s4_g2 s5_g2 s6_g2 s7_g3 s8_g3 s9_g3 s10_g4 s11_g4 s12_g4
4162.4721 3140.3940 1398.9893 872.8687 2000.8897 0.0000 0.0000 0.0000 1060.7122 0.0000 0.0000 0.0000
This behavior does not happen if we subset the DESeq object to only contain groups 3 and 4. Then, this gene is correctly flagged as an outlier:
#######################################
# Only using the compared groups
#######################################
foo2 <- foo[,7:12]
foo2$group <- factor(foo2$group)
design(foo2) <- ~0 + group
model.matrix(design(foo2), colData(foo2))
groupg3 groupg4
s7_g3 1 0
s8_g3 1 0
s9_g3 1 0
s10_g4 0 1
s11_g4 0 1
s12_g4 0 1
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"
# Fit model
foo2 <- DESeq(foo2, test = "Wald", parallel = T)
# Extract results comparing group 4 vs group 3
res2 <- as.data.frame(DESeq2::results(foo2,
+ contrast = list(c("groupg4"),
+ c("groupg3")),
+ listValues = c(1, -1),
+ pAdjustMethod = "BH",
+ alpha = 0.05))
head(dplyr::arrange(res2,pvalue))
baseMean log2FoldChange lfcSE stat pvalue padj
ENSMUSG00000102500.1 35.85269 8.5051312 1.3046299 6.519191 7.068767e-11 1.322284e-06
ENSMUSG00000051985.12 49.43580 3.1169236 0.6825849 4.566353 4.962824e-06 4.641729e-02
ENSMUSG00000029754.13 146.64453 0.9528564 0.2444252 3.898356 9.684778e-05 5.151895e-01
ENSMUSG00000094800.8 80.41222 -4.3109181 1.1147858 -3.867037 1.101656e-04 5.151895e-01
ENSMUSG00000075334.2 1482.64888 0.4557637 0.1311512 3.475101 5.106618e-04 9.999976e-01
ENSMUSG00000023079.14 561.07149 0.6503654 0.1898563 3.425567 6.135187e-04 9.999976e-01
# The previous top gene is now flagged as NA
dplyr::arrange(res2,pvalue)["ENSMUSG00000095937.1",]
baseMean log2FoldChange lfcSE stat pvalue padj
ENSMUSG00000095937.1 176.7854 -25.08636 3.907085 -6.420737 NA NA
So, my question is, is there any way to change DESeq2 filtering behaviour to flag outliers only for the groups being compared? Or do you think the best approach is just to subset the data for each comparison, separately? Have you ever encountered this issue?
thanks for your help
Thanks a lot for taking the time to answer. I'm sorry, I am indeed using
minReplicatesForReplace=Inf
, because I want to mark the outliers, but forgot to include it when typing the example.I also expected that the outlying sample would affect the group3 coefficient in the same way in both designs, but the thing is that, the Cook's distances change a lot with the design, that's why I initially thought maybe using the full model was somehow dampening the Cook's distance of the outlier:
As you can see, I would need a very low Cook's value filter to flag the gene as an outlier in the full model, and this would lead to flagging thousands of genes as outliers.
See here another example which would look like an outlier for the groups 3 and 4 (this one was not flagged by the filter probably because it didn't reach the 99% quantile of F(4,8), but it has a low enough value that it could maybe be flagged by softly changing the Cook's cut-off threshold). Nonetheless, again the Cook's distances differences are more clear in the subset of the data model:
I also imagine that part of the problem here is having only n=3 per group, although it was very striking that the most significant result of the analysis with the full model was the first outlier.
I guess you could check yourself by removing the offending sample and checking how the coefficients change.