I've downloaded someone else's microarray data (Affymetrix HG-133plus2, normalized with GCRMA) and noticed many unexpected genes were differentially expressed with the patient's sex (about 30 males, 30 females). Although a few genes (e.g. Y-chromosome located EIF1AY) will have obvious sex-linkage in any human sample, such effects are not usually so strong or pervasive in my experience. I checked the headers in the CEL files and noticed a very strong batch effect: files processed in years one and two were overwhelmingly male, while year three were all female. I concluded the effect is due to technical variation, or at least it cannot be distinguished from such bias.
Many tools such as SAM allow you to specify batches. However, I wish to do downstream analysis using my own methods. What is the best approach to transform the data set to reduce the batch effect? I am resigned to losing any ability to detect true sex-specific gene expression. If I were only performing linear modeling I could include the batch as a factor in my model. However, I'd like to (for example) analyze correlation using Spearman's rank correlation, for which I don't know an obvious solution.
A quick literature search turned up Johnson Biostatistics 2007, "Adjusting batch effects in microarray expression data using empirical Bayes methods", which in turn references Benito Bioinformatics 2003, "Adjustment of systematic microarray data biases". Before I dive in any further, anyone expert in this area want to comment on best practices?