I have a data set of 608,038 autosomal SNPs from the Affymetrix Human Origins array, which has 620,744 autosomal SNPs total. These were filtered in Genotyping Console, such that they are all autosomal SNPs with 95% SNP call-rates from 93+2 samples (+2 = positive control DNA samples) (and exclude ~4000 possibly triallelic sites).
I have a downloaded a dataset with 594,924 autosomal SNPs generated using the same type of array (http://genetics.med.harvard.edu/reich/Reich_Lab/Datasets_files/EuropeFullyPublic.tar.gz).
When I merge the data sets, the total number of SNPs fall to 545,956.
This happens when I merge the data sets in Plink as well as when I look at the intersection of rs/Affx-#s from both data sets in R
first <- read.table("first.map", header=F)
second <- read.table("second.map", header=F)
intersect(first$V2,second$V2) -> consensus
I was wondering if anyone has any idea where these ~50,000 SNPs are going.
This is still plenty of SNPs for human evolutionary/population genetic inferences, but I don't want to be asked one day by an article reviewer why I lack this many sites from data merger.
Thanks!
I don't think that is quite it.
See page #16 of this. There are 629,443 SNPs properly on the array with 633,625 probes for those SNPs. The ~4000 SNPs with multiple probes were excluded from my SNP list as they are thought to be triallelic. I believe the remaining from the 693,587 are for Dish QC and other quality control things.
I have checked both my data and the downloaded data for duplicated Affx-#s as well as rs #s (the comparative data has Affx-#s only, so I ran a script on it to substitution Affx to rs when rs is available). With and without said script, there are no duplicates in either data set.
One question that has come to mind: do SNPs ever see their Affx-#s change? The latest annotation file is fairly recent (from March), and the data set was published well before.
Could the discrepancy be because the Affx-#s changed between the generation of their data and the generation of my data?
So on the advice of the labmate, instead of loading the Affx numbers into R, I loaded the Hg19 locations into R as two lists in the format of chrom# hyphen bp# (e.g., 16-24417536). This intersection yielded 585,413 sites (compared to 545,956 when intersecting based on Affx numbers).
Now I am wondering, how do I tell Plink to ignore the Affx numbers and merge based upon Hg19 locations?