EDIT: Also posted at https://support.bioconductor.org/p/p134222/
Hi friends,
I'm handling some array data (DNA methylation) in which there are two groups (control vs treatment) which are moderate to small in size (n = 14 vs n = 11). I'm using limma
(so linear models) for the differential testing, and I was inspecting the effect of different models with different covariates.
The two groups appear, a priori, to have notable differences when I check the samples by PCA, because the clusters are visible. Accordingly, subsequently when testing with linear models (with no covariates), or even simply using a t-test or wilcoxon I get around 2000-3000 DMPs (differentially methylated probes).
After observing these results, I explored using sva
to find up to 5 significant SVs. I tried to explore the models by including sequentially 1, 2, 3... SVs. Interestingly, with 1 and 2 SVs I get the same numbers of DMPs (2-3k) while with 3, 4 and 5 the DMPs abruptly drop to 0. I also tried incorporating in the model (with or without the SVs) cell type composition predictions. Again, the DMPs drop to 0.
With these observations, I tried to first regress-out these covariates (SVs, or celltype compositions) from my data and afterwards test. I found that I get the previous numbers of DMPs (2k-3k) with any number of SVs, and also correcting for celltype composition.
I have the feeling that I'm overfitting my models by having too many coefficients when incorporating >2 SVs, or the celltype compositions, because of my low sample size (for example, celltypes are 6 coefficients, for CD4, CD8, NK, etc.) and this probably inflates the uncertainty in coefficient estimation and thus the testing.
Have you ever encountered this issue, what do you think is its cause and how have you handled it?
(I know the recommendation is to not modify the input data, but rather incorporate the covariates in the models)