I'm trying to limit batch effects when combining two microarray experiments that were run on similar but perhaps slightly different versions of the Affymetrix U133A array. I want to use this combined dataset as a test dataset for a gene signature that I've found on an external dataset.
I've background corrected, log2 transformed, and then tried both RMA normalization and yugene transformation separately on the combined dataset to mitigate batch effects. When I make a PCA using the genes from my signature, I do see separation by my experimental groups (disease vs. healthy) in PC1, but the two experiments are separating on PC2. Is this okay to move forward with in terms of using this as test data on a classifier/further analysis? Or should I be doing something more to try to minimize the batch effects?
In the attached images, pink and aqua symbolize disease and healthy while blue and navy symbolize the two experiments.
Thanks for reading through.
can you apply your "classifier/further analysis" to each batch separately? else, why the need to combine them?
There are batch correction methods like
limma::removeBatchEffect
Hi thanks so much. I tried limma:removeBatchEffect and result looks roughly the same. I wanted to combine the data because that would give a larger validation dataset, but perhaps that's not statistically sound based on these PCAs.
I have a hard time believing that. The PCA separation in PC2 is clearly the difference between datasets and both datasets contain samples of both groups, so standard regression approaches should take care of that. Can you share your code?