When a principle component analysis is done on genome-wide SNP data how should missing genotypes be handled?
Naively I can think of two approaches: i) Drop the markers with any missing data - but this loses too much data with a big cohort of samples and relatively random genotyping failure. ii) Set the missing markers to the average of the sample present (assuming each marker is coded as 0,1,2)
Is approach (ii) reasonable? What would be better approaches?
Is KNN considered appropriate for genotype data and its typical structure? There is much research effort in doing genotype imputation. I am looking for the simplest thing that could possibly work to get my data into a PCA for a first pass. It sounds like the danger with using the average is that it will be biased when data isn't missing a random. Provided I'm using a lot of markers (1000s +) and each marker has only a small percent missingness do I risk much bias?