Hi I am trying to use the thousand genomes data snp data along with common snps from my exome sequencing project to perform principal component analysis (PCA). I have generated a combined PLINK binary file of my data and the snp data from the thousand genomes data. Then I am using the R package SNPRelate to perform the PCA analysis. Unfortunately regardless of whatever LD value I use to generate a pruned snpset, my samples do not cluster with any of the population groups of the thousand genomes data. In fact they always cluster around the (0,0) mark in the PCA plot. Does anyone know as to why this might be happening and have some suggestions as to how this should be done? Sincere thanks for any suggestions in advance.
Yes. What allele frequency off are you using? See the PCAs in the 1kg publications.