I have a VCF file containing genotype date for a few thousand SNPs across a few thousand samples. I would like to firstly convert this to a matrix (possibly using the VariantAnnotation package) and then perform a PCA analysis on the samples followed by some sort of clustering algorithm. I have very little experience with any of SNP matrix packages, PCA or clustering algorithms so I was wondering if anyone knew of any good tutorials which may be able to help me.
It is also worth noting that due to the nature of the analysis I am running, the SNP matrix will be extremely sparse. I would therefore also like to get information on the fraction of missing genotypes for each sample and the fraction of missing samples for each SNP - is this possible?