Question

Post-imputation QC for input into GWAS analyses

0

Entering edit mode

6 months ago

graeme.thorn ▴ 100

I have many population-level datasets where I have QC'd initial vcfs from multiple samples (c. 400-1000) using plink and imputed using the Michigan Imputation Server using TOPMED on the hg38 assembly.

However, these datasets now have 40million+ sites, and I was wondering what, if any post-imputation QCs I can do to possibly reduce that number, as I'm finding that calculating PRS scores for gene-phenotype associations (using PRSice) is taking far too long to run. I have previously (with a smaller number of sites, around 450k) split the sites into smaller chunks (on the order of those contained within specific genes) to run in an embarrasingly-parallel way on a computing cluster, but I calculated that doing this using the current dataset would mean that I would need to max out this cluster for an extended period which would make me very unpopular with other users.

Is there any best practice for this QC? I can't change the tools I'm using to run PRS as this is to be an input into a meta-analysis with other data that uses the same protocol. Failing this, is there any other way I could reduce the number of sites I'm looking at to speed up the calculation?

gwas prs • 337 views

ADD COMMENT • link updated 6 months ago by LauferVA 4.5k • written 6 months ago by graeme.thorn ▴ 100

score 0 · Answer 1 · 2024-05-14

Imputation algorithms will output log files indicating, among other things, accuracy of imputation. We can't comment on specifics as we do not know the algorithm used, but you're looking for metrics like I^2, INFO score, R^2, etc. Cut the lowest accuracy markers first.
Using Plink or what have you, now filter on allele frequency. It is somewhat safe to cut variants that don't contribute much to population attributable risk.
Consider fIltering on Hardy Weinberg Equilibrium, or HWE.
Prune based on linkage disequilibrium LD if 1. - 3. has not made your analysis tractable.