Question

Difference in imputation between platforms and RS names

0

Entering edit mode

4.7 years ago

anamaria ▴ 220

Hello,

so I have SNPs (RSIDs) from imputation done in 2011 on http://csg.sph.umich.edu/abecasis/MACH/tour/ (call it 2011 data) and I did imputation on the same genotype files on Michigan Imputation Server, Genotype Imputation (Minimac4) 1.2.4 (call it 2020 data)

using the same QC steps I perfomed GWAS using plink.

In 2011 I have ~2.5 million SNPs and in 2020 I have ~2.7 million SNPs. The issue is that only ~900000 SNPs are matching between those two data sets. Can someone please explain me why? Did RS names changed in the meantime? I did put both genotype files on Build 37. Here I am presenting number of SNPs per chromosome for old (2011) data and new (2020) data. Also I am comparing snps_that_can_be_found_in_old_but_not_in_new and snps_that_can_be_found_in_new_but_not_in_old.

Can someone please explain me what might be the issue and why there is only ~900000 SNPs matching SNPs?

enter image description here

minimac mach imputation rs snps • 2.0k views

ADD COMMENT • link updated 4.7 years ago by Kevin Blighe 89k • written 4.7 years ago by anamaria ▴ 220

score 0 · Answer 1 · 2020-09-18

0

Entering edit mode

4.7 years ago

Kevin Blighe 89k

How are you matching: based on rs ID or genomic position?

Different pre-filtering may be applied by the imputation tools, e.g., for linkage disequilibrium, resulting in different individual results at the SNP levels but, globally, the same overall result in terms of genomic loci affected.

Kevin

ADD COMMENT • link 4.7 years ago by Kevin Blighe 89k

0

Entering edit mode

HI Kevin,

I did match based on RS name, not on position. Do you advise me matching based on position?

ADD REPLY • link 4.7 years ago by anamaria ▴ 220

0

Entering edit mode

what about SNPs like this: rs2905035, rs2519016...which can be found in 2011 imputation bot not in 2020

ADD REPLY • link 4.7 years ago by anamaria ▴ 220

0

Entering edit mode

I am not sure. If any program is of high quality, it should output what it filters out, and the reason why. You can check log files?

ADD REPLY • link 4.7 years ago by Kevin Blighe 89k

0

Entering edit mode

I don't have log files for 2011 data.

What about comparing F_A Frequency of this allele in cases and F_U Frequency of this allele in controls. What would this be indication of?

https://imgur.com/JoFtE6P

https://imgur.com/53yGfNh

ADD REPLY • link 4.7 years ago by anamaria ▴ 220

0

Entering edit mode

I could only speculate, to be honest. As I am generally familiar with your experiment, my advice to you and your supervisor (were I to provide it) would be to not try to replicate perfectly the results from ~10 years ago. Try to see this as an opportunity to process old data with new, better methods, which will produce results with higher confidence.

ADD REPLY • link 4.7 years ago by Kevin Blighe 89k

0

Entering edit mode

I completely understand you. And I already confirmed analysis I did in 2020 in other known GWAS results (doing the most recent QC steps, methods...). But we need to know why 2011 was so different.

ADD REPLY • link 4.7 years ago by anamaria ▴ 220

0

Entering edit mode

it seems that characteristics of overlapping SNPs is that there is not many of them who have very low minor allele frequency. Attached is the plot for frequencies for overlapping SNPs. Any idea why this might be?

https://imgur.com/XkCxpvz

https://imgur.com/cWfsHxu

ADD REPLY • link 4.7 years ago by anamaria ▴ 220

1

Entering edit mode

I suppose that it indicates how disagreement is on rare alleles; so, the differences could be accounted for by a filter for MAF. What we view as 'rare' has changed over the years as progressively larger datasets have been produced.

There really should be some way to check all filtering steps for both the 2011 dataset and the Michigan Imputation Server (2020) dataset. If no code exists for the 2011 dataset, then that is a negative point on your supervisor's behalf.

ADD REPLY • link 4.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Yes indeed. Unfortunately I don't have any record on what kind of MAF threshold was used prior and post imputation in 2011. What number for MAF would you suggest me to impose on the new dataset judging by the old data? Should I do MAF 0.1 on both genotypes prior to doing regression analysis in plink or?

ADD REPLY • link 4.7 years ago by anamaria ▴ 220