I have a very sparse SNP matrix (~90% missing genotypes by sample and ~90% missing samples per SNP) which I would like to perform some sort of probabilistic PCA on. I have been using the packages VariantAnnotation to get the my snpMatrix object and originally tried to mimic a method shown here (https://www.bioconductor.org/packages/release/bioc/vignettes/snpStats/inst/doc/pca-vignette.pdf ) with the package snpStats. However, I don't believe this package was intended to work with extremely sparse SNP matrices and it struggles to correct for missing values within the SNP matrix.
I have tried to use the ppca function from the package pcaMethods but have not had a huge amount of success in finding any clusters of cells. Does anyone have any experience working with very sparse matrices for pca?
what's your goal, i.e. what insights do you hope to get from the probabilistic PCA?
Can you first filter the sites that always have missing values first?