I'm trying to combine data from 4 studies with my own in order to add context to my data. There are some variations in regional collection that I want to highlight, but study differences seem to dominate the first and second eigenvector of my PCA, which makes me worry the batch effect could be beyond redemption. Of the studies I added one is supposed to be an outgroup, one has similar methodology to my own, one has geographical overlap, and one has both similar methodology and similar geography. The two larger studies involve multiple geographic areas. Ultimately, there's enough common ground that I think it should be possible to tease out batch effect.
So far I've tried being incredibly restrictive with SNP quality (I've tried both hard filtering and VQSR, as well as removing SNPs with different degrees of missiningness, and even some samples), and I've tried using Plink to remove SNPs that correlate highly to one study or another.
I'm using SNPRelate for PCA (though I'm open to other suggestions) and I'm removing variants with linkage disequilibrium (I've played with this variable a bit, but it doesn't seem to make a difference and with PCA being linear in nature I prefer erring with caution and using LD pruning).
Has anyone dealt with this issue before? Is it a lost cause? Can I still use this genome scans for selection if the population structure isn't working out quite right or do I need to abandon any part of the study that requires adding in other studies?
Thank you for your input!
Just to be sure... when you speak of 'batch effect', you are simply referring to the fact that you have samples from different geographical regions?