I am conducting a candidate gene association analysis (60 SNPs across 5 genes located in different chromosomes) These SNPs were selected based on significant associations from previous studies. Sample size 250 (110 cases , 130 controls).
What is the correct way to test for population stratification for a case control design with 60 SNPs? I have used PCA for GWAS with a larger dataset before for controlling population stratification, but i am not sure whether I can use the same approach with 60 SNPs ?
All the samples are supposedly from European cohort (self reported).
The [best] answer to this depends on what you have available.
For example, do you have other genotyping data specifically on those individuals? If you have those same patients and controls genotyped on another array, probably your best bet would be to generate principal component loadings and enter them as covariates as you describe.
If you do not have additional data on them, it is tougher. Ideally you would know this is an ancestry-matched comparison for sure. If you don't, though, what I would recommend is to wing it (i.e. generate the p-values without controlling for ancestry as a covariate), but then to double-check using allele frequency (AF)...Here is what I mean:
If the samples are all thought to be European, but you want to confirm this, you could get allele frequency from the most similar European reference population (e.g. from variant effect predictor). You could then see if the genotypes, (in particular the controls) have fairly similar AF to the reference, for all 60 SNPs ideally. If your controls overall have frequencies close the reference population (as a group) at least those are highly likely to be from the right population.
You can further increase confidence by seeing if the Odds Ratios you generate for the 60 SNPs are in the same direction as prior reports. If the AF and OR are consistent and QC looks good, that might be about as well as you can do unless you do have additional genotyping data on these individuals.
Caveats:
If you are relying on AF to this degree, in particular in the absence of linkage disequilibrium information (which you will not have because these SNPs are unlikely to be near to one another), you need to make doubly sure your strands are the same as the authors report, esp. for A/T G/C SNPs.
For the cases it is tougher because you expect the allele frequency to be different than the general population due to the disease (since you're validating recognized SNPs). Here what I would do is look at the manuscripts in which these associations are first described. Assuming that the original manuscript is in a similar population, and compare your results to prior (cases and controls)
The above reasoning assumes that your population is similar to the population in which these associations were first reported. If that is not true, it will be a lot tougher...
Thank you so much; luckily i have additional genotyping data for both the cases and controls which means i could use PCA. This data set is a subset of a larger dataset which i had used for GWAS, so i guess i can do PCA.
no problem man - consider accepting the answer if it is resolved so that it moves from Open to Closed - if you have additional (related) questions, shoot...
Thank you so much; luckily i have additional genotyping data for both the cases and controls which means i could use PCA. This data set is a subset of a larger dataset which i had used for GWAS, so i guess i can do PCA.
Thank you.
no problem man - consider accepting the answer if it is resolved so that it moves from Open to Closed - if you have additional (related) questions, shoot...
thnx again, but how to accept the answer ? ;) ;) (sorry i am new here, tried searching the forum for how to accept the post, but not successful ).
Thank you