I am analyzing microarray data generated using Illumina Human HT 12 chips, and there were multiple batches as the samples were analyzed. The data I have has been through the 'standard' genome studio normalization steps, but has not been adjusted for any batch effects.
In analyses testing an outcome of interested against the expression values it is common to 'adjust' (include as an independent variable) for the batch effects using a factor variable.
I have also seen elsewhere that analysts may adjust for the relative log expression (RLE) mean to account for technical bias. RLE means are more commonly used to assess the batch effects using boxplots - I can see from boxplots in my data the a couple of the batches have significantly higher RLE means, bot not all.
My question is which method most accurately accounts for the technical variability introduced by the batches?
My feeling is that using the RLE mean values is best because, not only is this a linear variable, but it is actually based on the data! The batches may not necessarily have affected the expression, but to include them as covariates anyway must introduce some noise to the model. Whereas including the RLE mean values as a covariate, which are based exclusively on the expression data itself, will only account for the observed technical variation. Is this rationale logical? Have I overlooked anything? Many thanks.