I would like to obtain an accurate estimate of the number of SNPs in my SNP set that occupy LD-independent loci.
I am using a SNP set culled from large GWA data set (selection of gene-resident variants was based on an a priori hypothesis) to examine SNP-trait associations -- we feel the the 5E-8 alpha testing level is too stringent in this case, as we are examining association tests on ~13,000 SNPs. We'd like to get an idea of how many and which of these are in nearly complete LD (r2 >= 0.80 in CEU).
THE CHALLENGE
- I have access a to list of SNPs from a public source with values for marker ID & p-value.
- I DO NOT have access to genotype values for individuals (cannot compute linkage structure).
Essentially we want to select SNPs representative of LD-independent loci and compute Q-values on the SNP-trait tests. Is there a way to take my SNP set marker ID values and obtain information on LD structure in my sample using a proxy sample? The CEU or CEU+TSI would be appropriate as a reference.
I would suggest you look into haplotype analysis. LD structure is only present among SNPs which are relatively close to another. You could phase your samples and then derive haplotypes from phased reference data such as 1000genomes. Then you will be able to assign haplotypes to your samples and test for association between those and your trait of interest. This approach would account for LD structure. Depending on the data, the number of haplotypes can be a lot smaller than the number of SNPs you are testing and therefore will decrease your multiple testing penalty.