I have two datasets from the same type of array in Plink map/ped format. The smaller of the two has about >586,662 SNPs and 93 samples, and the larger of the two has about 620,000 and 934 samples.
I want to merge the two datasets such that I have an intersection of the two (i.e., all 1027 samples but only SNPs present in both data sets).
From an experience a few days ago, I know that for about the larger dataset has about 7000 sites where the alleles are reverse of the large set (e.g., larger set is a C/A polymorphism and the smaller set is a G/T polymorphism). Happily, this array was designed to exclude symmetrical SNPs (A/T or C/G), so fixing this problem is a little less confusing; however, I do not have this list of flipped SNPs.
I know I can flip genotypes in Plink using
plink --file data --flip list.txt --flip-subset mylist.txt --recode
I was wondering, how can I identify these sites and get these sites merged showing the same strand? Thanks
To be clear, I would take that missnp file and flip one of the datasets and then re-merge, and then I would be good? (Also, the samples in each dataset are completely different)
Also, it appears that merge mode (http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml#merge) produces a union instead of an intersection, but I don't want the SNPs that are only present in one of two. I only want SNPs present in both.
Yes, that's correct, you flip just one dataset and remerge.
You will only get merge conflicts for SNPs present in both datasets.