Question

Inflated GWAS test statistic after merging multiple batches of genotype dataset imputed differently

1

Entering edit mode

21 months ago

Amy ▴ 20

I have 4 batches of genotype dataset. 3 of these batches are cases while the 4th one is the control dataset. They were all genoytped on different illumina platforms and were all imputed differently (because the number of common SNPs in the 4 batches is quite low. After imputation, QC was done on each of the genotype batches before merging the genotype and performing a GWAS.

The Manhattan plot of the GWAS shows inflated test statistic for a lot SNPs. Each of the batches were filtered at MAF > 0.01 and Impute info > 0.8. The plot looks slightly better when I use impute info>0.95, however I loose a lot of SNPs. I was wondering were the issue would come from?

Manhattan plot

genotype imputation gwas • 2.7k views

ADD COMMENT • link updated 20 months ago by LauferVA 4.7k • written 21 months ago by Amy ▴ 20

score 4 · Answer 1 · 2023-09-19

The Manhattan plot is inflated because these datasets, when they were imputed, were likely not imputed uniformly (same reference panel, same parameters, same chromosomal splits). The result will be small differences in genotype frequency between the case batches and the control batch that, in the context of tens of thousands of observations, generate spurious significance.

The fix is (I'm sorry to say) to ensure the imputation is performed identically for all datasets and, if possible, to perform the imputation simultaneously (i.e., by combining the raw genotypes into a large array with lots of "missing" genotypes, and performing imputation on this merged data).

score 2 · Answer 2 · 2023-09-19

Hi Amy ,

LChart and bk11 have discussed the issue of inflated test statistics (e.g., LChart 's answer), however, I want to take a bit of a step back before jumping into that. Specifically, I'll frame my response in a slightly different way: through the lens of mega-analysis versus meta-analysis for GWA studies. What the others have described are best practices for mega-analysis. Mega-analysis refers to combining studies into one harmonized dataset, which is imputed, tested for association, and so together (jointly).

The alternative is to conduct a meta analysis: here, you run the GWAS pipeline you have end to end for each study (each genotyping chip) separately, including imputation, post imputation QC, and association testing. One then combines the data at the level of summary test statistics - not at the level of raw or processed data.

The degree to which this is possible depends on the relative content of each study - e.g. do each of your chips have cases and controls, are they ancestry matched, etc. etc. etc. Unfortunately, the study design you propose is nothing short of atrocious. I am not trying to be mean and I know you cannot change this, but you have what is essentially a problem of perfect separation here: because case genotyping chip/study also segregates perfectly with case control status, teasing out whether thorny data merging steps have been done well is problematic; however without additional data a meta-analytic framework will also run into problems...

While it is an oversimplification, generally speaking mega-analysis outperforms meta-analysis when done very well. However, the increase to statistical power is not always appreciable, and are many thorny issues with a mega-analytic approach (as you seem to have discovered and as LChart describes). Thus, while mega-analysis is in theory usually better, practically it tends to be 1) labor intensive and 2) to introduce the possibilty that a kind of error unlikely to occur in meta-analysis could hamper results (through exactly the kind of imperfect data processing of the disparate studies that others describe).

In your study, if you do end up pooling any raw data, I'd also recommend controlling for study batch ID as a covariate during association testing over and above the best practices for imputation others have mentioned. There is a slim chance this would actually "fix" your problem without much other work, but of course you'd need to confirm that ...

--Personal opinion only--

For myself, if I am publishing a dedicated paper that is just a GWAS study, I do the legwork for Mega. However, if for instance this is one of many steps and data analyses that will be cross-indexed against other omics assays, functional studies, etc., anyway, then the modest bump to statistical power may not be worth the more rigorous data prep and QC steps.

score 0 · Answer 3 · 2023-09-19

0

Entering edit mode

21 months ago

bk11 ★ 3.1k

Genotyping the samples (cases or controls) is not a problem, people usually perform this type of approaches and even get the population controls from from some public repositories like (dbGaP, etc. etc.). As LChart already mentioned, you need to find the common SNPs between your batches of datasets (both cases and population controls) and merge them before imputation. Post merge, you could perform imputation in publicly available servers like Michigan Imputaion or TOPMed Imputation servers.

ADD COMMENT • link 21 months ago by bk11 ★ 3.1k

0

Entering edit mode

you need to find the common SNPs between your batches of datasets (both cases and population controls) and merge them before imputation

I would caution against this. Imputing each array into the same reference file separately (but with the same parameters) would be preferred to losing many genotype calls by restricting only to sites shared across all arrays. Most of the software should successfully impute the complete "outer" merge without serious issue, unless sites get dropped by an aggressive missingness filter.

ADD REPLY • link 21 months ago by LChart 5.0k

0

Entering edit mode

OP does not mention about gender info in the datasets. Hope these data are not gender biased. What would you think if the datasets are gender biased?

ADD REPLY • link 20 months ago by bk11 ★ 3.1k

0

Entering edit mode

Sex bias would only inflate GWAS statistics if the disease prevalence differed by sex and sex was not included as a covariate in the model OR if sex chromosomes were inappropriately handled by the software. I doubt this is an issue in this case as the inflation is impacting the autosomes, and sex is typically included as a covariate by default.

ADD REPLY • link 20 months ago by LChart 5.0k

score 0 · Answer 4 · 2023-09-20

Different platforms and different imputations might be causing some noise in your GWAS. Maybe you could try some kind of batch correction method? I've heard that can sometimes help when you're merging different datasets. Inflated test stats could be a real headache if you're looking for true Goku associations. Just my two cents, but maybe consult with someone more experienced in the field?