I have many population-level datasets where I have QC'd initial vcfs from multiple samples (c. 400-1000) using plink and imputed using the Michigan Imputation Server using TOPMED on the hg38 assembly.
However, these datasets now have 40million+ sites, and I was wondering what, if any post-imputation QCs I can do to possibly reduce that number, as I'm finding that calculating PRS scores for gene-phenotype associations (using PRSice) is taking far too long to run. I have previously (with a smaller number of sites, around 450k) split the sites into smaller chunks (on the order of those contained within specific genes) to run in an embarrasingly-parallel way on a computing cluster, but I calculated that doing this using the current dataset would mean that I would need to max out this cluster for an extended period which would make me very unpopular with other users.
Is there any best practice for this QC? I can't change the tools I'm using to run PRS as this is to be an input into a meta-analysis with other data that uses the same protocol. Failing this, is there any other way I could reduce the number of sites I'm looking at to speed up the calculation?