Hello,
so I have SNPs (RSIDs) from imputation done in 2011 on http://csg.sph.umich.edu/abecasis/MACH/tour/ (call it 2011 data) and I did imputation on the same genotype files on Michigan Imputation Server, Genotype Imputation (Minimac4) 1.2.4 (call it 2020 data)
using the same QC steps I perfomed GWAS using plink.
In 2011 I have ~2.5 million SNPs and in 2020 I have ~2.7 million SNPs. The issue is that only ~900000 SNPs are matching between those two data sets. Can someone please explain me why? Did RS names changed in the meantime? I did put both genotype files on Build 37. Here I am presenting number of SNPs per chromosome for old (2011) data and new (2020) data. Also I am comparing snps_that_can_be_found_in_old_but_not_in_new and snps_that_can_be_found_in_new_but_not_in_old.
Can someone please explain me what might be the issue and why there is only ~900000 SNPs matching SNPs?
HI Kevin,
I did match based on RS name, not on position. Do you advise me matching based on position?
what about SNPs like this: rs2905035, rs2519016...which can be found in 2011 imputation bot not in 2020
I am not sure. If any program is of high quality, it should output what it filters out, and the reason why. You can check log files?
I don't have log files for 2011 data.
What about comparing F_A Frequency of this allele in cases and F_U Frequency of this allele in controls. What would this be indication of?
https://imgur.com/JoFtE6P
https://imgur.com/53yGfNh
I could only speculate, to be honest. As I am generally familiar with your experiment, my advice to you and your supervisor (were I to provide it) would be to not try to replicate perfectly the results from ~10 years ago. Try to see this as an opportunity to process old data with new, better methods, which will produce results with higher confidence.
I completely understand you. And I already confirmed analysis I did in 2020 in other known GWAS results (doing the most recent QC steps, methods...). But we need to know why 2011 was so different.
it seems that characteristics of overlapping SNPs is that there is not many of them who have very low minor allele frequency. Attached is the plot for frequencies for overlapping SNPs. Any idea why this might be?
https://imgur.com/XkCxpvz
https://imgur.com/cWfsHxu
I suppose that it indicates how disagreement is on rare alleles; so, the differences could be accounted for by a filter for MAF. What we view as 'rare' has changed over the years as progressively larger datasets have been produced.
There really should be some way to check all filtering steps for both the 2011 dataset and the Michigan Imputation Server (2020) dataset. If no code exists for the 2011 dataset, then that is a negative point on your supervisor's behalf.
Yes indeed. Unfortunately I don't have any record on what kind of MAF threshold was used prior and post imputation in 2011. What number for MAF would you suggest me to impose on the new dataset judging by the old data? Should I do MAF 0.1 on both genotypes prior to doing regression analysis in plink or?