I am analyzing a dataset of ~1000 Illumina microarrays for a human population. There are no defined subgroups as the data is from a healthy "normal" population. Nevertheless, after performing quantile normalization (using the normalizeBetweenArrays function of the limma R package) on the log2-transformed data, a PCA plot reveals two very distinct clusters (the smaller consisting of ~130 samples).
The only pattern that I have found is that it seems like arrays with high raw expression (high average signal in the raw data) are overrepresented in this smaller cluster. However, they are not unique to that cluster, so not completely explaining the distinct separation.
I have failed to find any other explanation of this separation (which by the way is not visible for the raw data). There is no connection to age or gender, no genes in particular driving the separation (determined by inspecting PCA loadings), no potential influence from highly or lowly expressed genes or genes with high or low variance.
It could perhaps be some sort of batch effect, however we do not have access to information such as experiment date, operator etc.
Is there anyone who has encountered a similar situation or has any suggestions for other things to check, that could explain the unexpected separation.
Thanks!
I suspect you are seeing a batch effect. It is not uncommon for differences in batches to explain more of the variance in your data than the biological effects of interest. The fact that there are also differences in the total expression levels differentiating the two groups also supports that idea.