How To Deal With Missing Genotypes In Population Pca Analysis
3
2
Entering edit mode
12.1 years ago
Alex Stoddard ▴ 190

When a principle component analysis is done on genome-wide SNP data how should missing genotypes be handled?

Naively I can think of two approaches: i) Drop the markers with any missing data - but this loses too much data with a big cohort of samples and relatively random genotyping failure. ii) Set the missing markers to the average of the sample present (assuming each marker is coded as 0,1,2)

Is approach (ii) reasonable? What would be better approaches?

pca genomics population • 6.3k views
ADD COMMENT
5
Entering edit mode
12.1 years ago

The process of substituting a reasonable guess for missing data is called imputation and is fairly common practice for large data sets. Packages for performing imputation (using a k-nearest neighbors approach, for example) are available in R. I haven't used any of them recently so I can't comment on which one you should pick.

ADD COMMENT
4
Entering edit mode
12.1 years ago
brentp 24k

How many markers do you lose if you drop those with any missing data?

You can set the missing markers to some value. But you may run into problems if there is bias in the missing data. as @Eugen says, inferring a value from KNN would be better than an average.

There's a very simple-to-use R package that will do the imputation for you using KNN: http://www.bioconductor.org/packages/release/bioc/html/impute.html

ADD COMMENT
1
Entering edit mode

Is KNN considered appropriate for genotype data and its typical structure? There is much research effort in doing genotype imputation. I am looking for the simplest thing that could possibly work to get my data into a PCA for a first pass. It sounds like the danger with using the average is that it will be biased when data isn't missing a random. Provided I'm using a lot of markers (1000s +) and each marker has only a small percent missingness do I risk much bias?

ADD REPLY
1
Entering edit mode
12.1 years ago
zx8754 12k

AISNPs?

Analyses of a set of 128 ancestry informative single-nucleotide polymorphisms in a global set of 119 population samples http://www.investigativegenetics.com/content/2/1/1

ADD COMMENT

Login before adding your answer.

Traffic: 3175 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6