Inflated GWAS test statistic after merging multiple batches of genotype dataset imputed differently
4
1
Entering edit mode
14 months ago
Amy ▴ 20

I have 4 batches of genotype dataset. 3 of these batches are cases while the 4th one is the control dataset. They were all genoytped on different illumina platforms and were all imputed differently (because the number of common SNPs in the 4 batches is quite low. After imputation, QC was done on each of the genotype batches before merging the genotype and performing a GWAS.

The Manhattan plot of the GWAS shows inflated test statistic for a lot SNPs. Each of the batches were filtered at MAF > 0.01 and Impute info > 0.8. The plot looks slightly better when I use impute info>0.95, however I loose a lot of SNPs. I was wondering were the issue would come from?

Manhattan plot

genotype imputation gwas • 1.9k views
ADD COMMENT
4
Entering edit mode
14 months ago
LChart 4.6k

The Manhattan plot is inflated because these datasets, when they were imputed, were likely not imputed uniformly (same reference panel, same parameters, same chromosomal splits). The result will be small differences in genotype frequency between the case batches and the control batch that, in the context of tens of thousands of observations, generate spurious significance.

The fix is (I'm sorry to say) to ensure the imputation is performed identically for all datasets and, if possible, to perform the imputation simultaneously (i.e., by combining the raw genotypes into a large array with lots of "missing" genotypes, and performing imputation on this merged data).

ADD COMMENT
0
Entering edit mode

even these measures will very often not fix the problem in the context of a problem of a problem of perfect separation such as OP describes.

ADD REPLY
2
Entering edit mode
14 months ago
LauferVA 4.5k

Hi Amy ,

LChart and bk11 have discussed the issue of inflated test statistics (e.g., LChart 's answer), however, I want to take a bit of a step back before jumping into that. Specifically, I'll frame my response in a slightly different way: through the lens of mega-analysis versus meta-analysis for GWA studies. What the others have described are best practices for mega-analysis. Mega-analysis refers to combining studies into one harmonized dataset, which is imputed, tested for association, and so together (jointly).

The alternative is to conduct a meta analysis: here, you run the GWAS pipeline you have end to end for each study (each genotyping chip) separately, including imputation, post imputation QC, and association testing. One then combines the data at the level of summary test statistics - not at the level of raw or processed data.

The degree to which this is possible depends on the relative content of each study - e.g. do each of your chips have cases and controls, are they ancestry matched, etc. etc. etc. Unfortunately, the study design you propose is nothing short of atrocious. I am not trying to be mean and I know you cannot change this, but you have what is essentially a problem of perfect separation here: because case genotyping chip/study also segregates perfectly with case control status, teasing out whether thorny data merging steps have been done well is problematic; however without additional data a meta-analytic framework will also run into problems...

While it is an oversimplification, generally speaking mega-analysis outperforms meta-analysis when done very well. However, the increase to statistical power is not always appreciable, and are many thorny issues with a mega-analytic approach (as you seem to have discovered and as LChart describes). Thus, while mega-analysis is in theory usually better, practically it tends to be 1) labor intensive and 2) to introduce the possibilty that a kind of error unlikely to occur in meta-analysis could hamper results (through exactly the kind of imperfect data processing of the disparate studies that others describe).

In your study, if you do end up pooling any raw data, I'd also recommend controlling for study batch ID as a covariate during association testing over and above the best practices for imputation others have mentioned. There is a slim chance this would actually "fix" your problem without much other work, but of course you'd need to confirm that ...

--Personal opinion only--

For myself, if I am publishing a dedicated paper that is just a GWAS study, I do the legwork for Mega. However, if for instance this is one of many steps and data analyses that will be cross-indexed against other omics assays, functional studies, etc., anyway, then the modest bump to statistical power may not be worth the more rigorous data prep and QC steps.

ADD COMMENT
1
Entering edit mode

I can't do a meta-analysis because each of these batches are not case-control, rather I have batches of cases and one batch of control. I have tried to control for batch in the analysis but ended up with all the P-value being greater than 0.1, this is also because each of the batches are not case-control, and controlling for batches effectively takes out all the effects of the case control status.

ADD REPLY
0
Entering edit mode

Hi Amy, I understand, this is why I said,

"The degree to which this is possible depends on the relative content of each study - e.g. do each of your chips have cases and controls, are they ancestry matched, etc. etc. etc. Unfortunately, the study design you propose is nothing short of atrocious..."

the atrocious part is the problem of perfect separation that I refer to in the post. having said that, its important to remember that no problem is insoluble in all conceivable circumstances. here, you actually CAN resurrect this study, e.g. by finding additional ancestry matched controls genotyped on the same chip, etc. in this day and age, that is usually not difficult to find unless you are dealing with a very poorly studied population.

at any rate, there are lots of reviews out there on how to perform this kind of task, though its very possible none would cover a case exactly like this ... my hunch is that you are going to be dealing with some pretty nasty analytic problems here , and that imputation may not even be the worst of it ... It is unfortunate to say so, but more than anything else, the takeaway here was aptly summarized by RA Fisher nearly a century ago:

"To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. (S)He can perhaps say what the experiment died of"

ADD REPLY
0
Entering edit mode
14 months ago
bk11 ★ 3.0k

Genotyping the samples (cases or controls) is not a problem, people usually perform this type of approaches and even get the population controls from from some public repositories like (dbGaP, etc. etc.). As LChart already mentioned, you need to find the common SNPs between your batches of datasets (both cases and population controls) and merge them before imputation. Post merge, you could perform imputation in publicly available servers like Michigan Imputaion or TOPMed Imputation servers.

ADD COMMENT
0
Entering edit mode

you need to find the common SNPs between your batches of datasets (both cases and population controls) and merge them before imputation

I would caution against this. Imputing each array into the same reference file separately (but with the same parameters) would be preferred to losing many genotype calls by restricting only to sites shared across all arrays. Most of the software should successfully impute the complete "outer" merge without serious issue, unless sites get dropped by an aggressive missingness filter.

ADD REPLY
0
Entering edit mode

OP does not mention about gender info in the datasets. Hope these data are not gender biased. What would you think if the datasets are gender biased?

ADD REPLY
0
Entering edit mode

Sex bias would only inflate GWAS statistics if the disease prevalence differed by sex and sex was not included as a covariate in the model OR if sex chromosomes were inappropriately handled by the software. I doubt this is an issue in this case as the inflation is impacting the autosomes, and sex is typically included as a covariate by default.

ADD REPLY
0
Entering edit mode
14 months ago
Fatima • 0

Different platforms and different imputations might be causing some noise in your GWAS. Maybe you could try some kind of batch correction method? I've heard that can sometimes help when you're merging different datasets. Inflated test stats could be a real headache if you're looking for true Goku associations. Just my two cents, but maybe consult with someone more experienced in the field?

ADD COMMENT

Login before adding your answer.

Traffic: 1641 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6