I am trying to analyze a microarray dataset from NCBI Geo (GSE92538, platform ID: GPL10526, GPL17027). I want to perform gene expression analysis for SCZ and Control, Before proceeding for that I want to look at if there is any batch effects since there are potential covariates (such as age, race post mortem interval, brain pH), and correct the batch effect if required.
But looking at the metadata, I found out all the 58 SCZ sample and 176 Control sample have different Platform ID (majorly GPL10526 for SCZ sample, GPL17027 for Control sample), different cohort (schiz_cohort 1,2), different processing location (UCDavis, UCIrvine, UcMichigan), different QC batch (qc_batch 1-7).
I am confused how do i find out the batch effect?
Is it possible that I particulrly go for a platform ID (GPL10526), pick out different cohort (e.g., schiz_cohort_1
for both SCZ and Control samples, schiz_cohort_2
for both SCZ and Control samples etc) and plot a PCA plot to find out the batch effect?
Then how do I define the batch effect? Or should I analyze them in different batches (e.g., qc_batch 1 for both scz and controls etc)?
And repeat the steps for Platform ID GPL17027?
I am confused.
Any help would be appreciated.
Thank you
Hi Malachi Griffith,
Thank you for your help. I performed batch effect based on Geo Platform. I merged the two datasets (DATASET1-GSE91528-GPL10526, DATASET2-GSE91528-GPL17027), and removed batch effects using removeBatchEffect function of limma R package. But the result I got is confusing.
The first UMAP plot (before removing batch effect) seemed quite ok, with little batch effect (a few samples from same platform clustered together). But after removing the batch effect, UMAP plot seemed to have even more batch effect with larger number of samples from same platform clustering together.
I am a bit confused. Did I mess up the whole thing somehow? Then why am I seeing opposite trend? Or am I interpreting the plot wrongly?
Photos are attached for your reference.