Hello all,
I have been looking at tagging SNPs based on r2 values > .7 and using a variety of plink window sizes (50 SNP window, 225 SNP window, 500 SNP window etc, all with 10% increments). The .ped files I am scanning are chunks of whole chromosome files from the 10000 genomes project.
Of course bigger window sizes mean slower calculations. As such, I was wondering what the default, tried-and-true, standard plink widow size is in the literature when looking for tagging SNPs, if there is one at all. It certainly seems as though increasing window size helps too a point, but then returns less and less new finds. Since I'm a bit pressed for time, I thought I'd ask this forum before launching an exhaustive survey of the literature.
In case it is of interest to anyone considering the trade-offs of bigger window sizes at slower analyses, here is a snippet of what Im seeing:
Chr4 Chr5 Chr6 Chr7 Chr8
50 0.211371226 0.20716799 0.181987946 0.219615699 0.175284882
225 0.17404574 0.17294384 0.143189064 0.182387517 0.144435708
500 0.170798296 0.170307534 0.136285545 0.178729012 0.140857786
5000 0.170303995 0.169839802 0.131607191 0.17781175 0.139813532
The numbers within the chart are, if you multiple by 100, the % of SNPs from the dataset that have NO correlation > .7 to any other SNP. Essentially the lone-wolf SNPs which will not require a tag. (I should mention I also filtered based on MAF prior to this scan in case those percentages seem weird for unfiltered SNP data)
I'm using the --show-tags and --show-all commands to get tags and their target SNPs after all is said and done. I'm thinking of using the 225 or 500 SNP windows with 23 and 50 SNP increments respectively in my final analysis. Would that be sufficient? Insufficient? Or overkill? Im not trying to find literally every correlation > .7, but just trying to make sure there are sufficient unique markers left over in a list of SNPs of interest.