Hi, I have a GWAS dataset where cases were genotyped on one array and controls on another one. Before merging I did QC for each of them including PCA. Exactly like I did for many other previous datasets. The problem is that association tests produce high genomic inflation factor (~1.30 and 2.10 for two different control datasets). I assumed that though I already did PCA and removed population outliers my case dataset still has some population substructure. Then I did PCA for the merged case-control dataset and logistic regression analysis with PC1-2 (and PC1-5) as covariates. Basically, nothing changed. Lambda became 1.29 and 2.09 instead of 1.30 and 2.10. There are some extreme SNP outliers (n~=20) that are caused by technical problems based on the intensity plots. Their removal also doesn't help at all. QQ-plot suggests that about half of 1.5 million SNPs are slightly shifted from the linear curve. Could you help with the following questions: 1) What is(are) the most probable reason(s) of this weird result? 2) Why does correction for PCs not help at all? (eigenvalue for PC1 is 10, for PC2-5 is 3) 3) What are the ways to perform meaningful association analysis and to get statistically valid results instead of artifacts?
Thanks!
I try to answer to some of your questions. I hope that someone with more experience also steps in.
1) One of the following: a) the two arrays performed differently, b) you have some population structure/relatedness that is not detected by PCs, c) any other possible unaccounted for bias
2) No idea.
3) Maybe perform the genotyping on mixed arrays instead that having all the cases in one and all the controls in the other one? Do you have some sample that has been run on both? What are the concordance rate? Another idea would be to estimate kinship and population structure and perform analysis by explicitely correcting for this.