Hi!
I am performing a validation study across a series of methylation datasets performed in illuminaEPIC Methylation platform. In short, we have found some interesting targets in our own sample group, and now we want to evaluate their performance in other external cohorts, when the raw .idat file is available. My first approach was to assess those targets by using the original data from the cohorts, and applying the same pre-filtering and normalization methods in all of them.
Additionally, I also performed batch correction across all the cohorts to further investigate the performance of such targets. I don't know if expected or not, but the results from batch-corrected samples did show a better outcome than if using "not-corrected" data. And that's where I am a bit cautious about: does that really show that the targets are strong, or rather that the results are now biased by the batch correction?.
Now, I am trying to tackle this and my best rationale for it so far is although batch correction does adjust the overall values, I would expect that there should not be variance between pre- and post-batch correction within the same group (cohort). Does that make sense? After reading some threads here and there, my other question is:
Would Levene's test a good post-hoc approach to evaluate batch correction bias?
And here is how I have been trying to approach it so far:
Example data (4 different cohorts and only 3 targets):
set.seed(465)
mydf <- data.frame(group=rep(paste0("cohort_", 1:4), each=54), batch=rep(rep(c("pre", "post"), each=27), 4), target1=rnorm(216, 10, 2), target2=rnorm(216, 10, 2), target3=rnorm(216, 10, 2))
To make things simple, I created a single categorical column by merging the group
and batch
:
mydf <- data.frame(group=as.factor(paste(mydf$group, mydf$batch, sep="_")), mydf[,3:5])
And what I have tried so far:
Approach 1:
Applying Levene's test for each cohort (4), pre- and post-batch correction (3), for each target - it is, repeating this process 4x3= 12times:
leveneTest(mydf$target1[grepl("cohort_1", mydf$group)] ~ mydf$group[grepl("cohort_1", mydf$group)])
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 0.5507 0.4614
52
Approach 2:
Applying Levene's test between pre- and post-batch correction, irrespective of group. This process, repeating for each target:
leveneTest(target1 ~ group, data=mydf)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 7 0.8296 0.5637
208
Approach 3
Performing a multiple comparison among all groups and targets at the same time. Here, adapting the Tukey HSD approach (trying to follow this thread:
A. First calculating the residuals:
for(i in 2:ncol(mydf)){
mydf <- data.frame(mydf, abs(resid(lm(mydf[,i] ~ group, data=mydf))))
colnames(mydf)[ncol(mydf)] <- paste0("resid_", colnames(mydf[i]))
}
B. And now performing the Tukey HSD and arranging a table with all targets:
mydfTukey <- data.frame(matrix(NA, nrow = 4))
for (i in 5:7){
temp <- TukeyHSD(aov(mydf[,i] ~ group, data=mydf))
#I'm just looking for those corresponding pre- and post-batch correction for each group
temp$group <- temp$group[c("cohort_1_pre-cohort_1_post", "cohort_2_pre-cohort_2_post", "cohort_3_pre-cohort_3_post", "cohort_4_pre-cohort_4_post"),]
colnames(temp$group) <- paste(colnames(temp$group), gsub("resid_", "", colnames(mydf[i])), sep= "_")
mydfTukey <- cbind(mydfTukey, temp$group)
}
mydfTukey[-1]
diff_target1 lwr_target1 upr_target1 p adj_target1 diff_target2 lwr_target2 upr_target2 p adj_target2
cohort_1_pre-cohort_1_post -0.2289024 -1.2229028 0.7650979 0.9967802 -0.29949107 -1.3095320 0.7105499 0.9850556
cohort_2_pre-cohort_2_post -0.2084944 -1.2024947 0.7855060 0.9982188 0.29747936 -0.7125616 1.3075203 0.9856324
cohort_3_pre-cohort_3_post 0.1732000 -0.8208004 1.1672003 0.9994659 -0.05661085 -1.0666518 0.9534301 0.9999998
cohort_4_pre-cohort_4_post -0.5945412 -1.5885416 0.3994592 0.5993827 -0.22884474 -1.2388857 0.7811962 0.9970932
diff_target3 lwr_target3 upr_target3 p adj_target3
cohort_1_pre-cohort_1_post 0.08096779 -0.9310778 1.0930134 0.9999973
cohort_2_pre-cohort_2_post 0.18977374 -0.8222718 1.2018193 0.9991363
cohort_3_pre-cohort_3_post 0.08224633 -0.9297992 1.0942919 0.9999970
cohort_4_pre-cohort_4_post -0.07725420 -1.0892998 0.9347914 0.9999980
And from this, use the p-value (p adj) for each target and each combination (row names).
As you can see from the different alternatives, I am not sure what the best approach is nor if this is the right direction to go. If you have any input on this, I truly appreciate it.
NOTE: I have also tried reaching out here and here, but haven't heard anything.
One reason why you probably got no answers is that, despite the extensive code is appreciated generally, you make it hard to the reader to really get to the bottom of the question because it is somewhat tl;dr.
Try to ask a precise question, pointing out what exactly you are correcting. Within the datasets, or across all datasets? Try to explain the concepts briefly and the data. The code is in this case I think secondary.