Question

IMPUTE2 minimal number of SNPs per chunk

0

Entering edit mode

3.4 years ago

nhaus ▴ 420

I am using an Illumina omni 2.5 genotyping array and plan on using impute2 to perform imputation.

The documentation of impute2 recommends to process whole chromosomes in chunks of ~5MB. However, I did not find any information regarding the minimal number of SNPs that should be present per chunk.

It feels "wrong" to impute tens of thousands of genotypes from the reference if there are only ~100 SNPs in the chunk that I am analyzing.

Is my gut feeling just wrong here or can you tell me any recommendations on how to deal with this?

Cheers!

impute gwas • 1.4k views

ADD COMMENT • link updated 3.4 years ago by LauferVA 4.5k • written 3.4 years ago by nhaus ▴ 420

0

Entering edit mode

Tip - there are many faster/more memory imputation algorithms than IMPUTE2 - check out beagle5 or IMPUTE5 - they will be much easier to use.

ADD REPLY • link 3.4 years ago by 4galaxy77 2.9k

0

Entering edit mode

Thanks! I will do that

ADD REPLY • link 3.4 years ago by nhaus ▴ 420

score 2 · Accepted Answer · 2021-07-02

Impute2 is no longer considered state-of-the-art.
Regarding how many variants you can impute ... If I were in your shoes, I would start by reading about linkage disequilibrium and how imputation algorithms actually work. Fact is, thousands of SNPs can be in strong linkage with one another; HLA is a prime example. Thus, if you can impute rs12345 accurately, you can impute any number of its 'buddies' (other SNPs in perfect LD with it) just as accurately. For relate reasons, it is more or less standard for 90% of your SNPs to be imputed when starting with DNA microarray data.
The number of SNPs per block is in many contexts an important predictor of accuracy. In the event that your LD estimates are bad, you will ascertain this on the back end by looking at imputation accuracy. Google imputation accuracy and start reading. You'll want to exclude variants with poor imputation accuracy as the estimates you are getting aren't reliable. Pretty much every protocol under the sun gives a cutoff for imputation accuracy. For example, see Anderson CA 2010 Nature Protocols. The way I would approach it is, start with their recommendation. If after you're done, you scan your results and imputation accuracy for one 5Mb block is really low, and that block also has few SNPs, then re-run that area with a larger window.