Trying to understand the various ways to perform imputation with the available tools published in the field. My understanding is that imputation is frequently used to help fill in missing data when working with SNP arrays. However, what do you do if you have a large variant call file from whole genome sequence data with about less than 1% missing calls (still 1000s) after filtering for quality and genomes with high missing rates. I also removed calls that were 40-77% missing in my cohort. I wish to have no missing genotypes because I want to try clustering on these calls and some metrics cannot have missing values.
It looks like the simplest way to perform the imputation (short of just using mean, mode, etc) would be to use Beagle as it doesn't require a reference map. SHAPEIT/IMPUTE2 looks to be the best option when phased reference panels are used according to a recent comparison based on SNP chips. What is the general approach when missing calls are low and the data is WGS?