Hi all,
I have an exome vcf dataset which I am trying to do produce a PCA plot with data of 1000 Genomes Project Phase III (1000G_2504_high_coverage - WGS). Using plink, I transformed my dataset into .bed/.bim/.fam and I filtered both datasets with --maf 0.1
and --indep 50 5 1.5
. I merged both datasets with the common variants and run a pca with plink. However, plotting the results, my samples do not overlap with any datapoints from the reference panel.
I am using variants from all chromosomes except chromosome X and I have used different filtering thresholds in plink but still I get this batch effect:
Any suggestions on what am I doing wrong?
Thank you very much for your help!
My guess is strand flips. Did you make sure to check the reference alleles are the same between the two datasets?
I checked for strand issues but there weren't in both datasets. I also checked and adjusted the reference build but still I am getting the same isolated cluster for my dataset. A remark is that before working with plink the exome dataset was joint genotyped with gl_nexus, and then normalized and decomposed. I also imputed it using BEAGLE. Is there a chance that I introduced a technical error before convert the vcf dataset into plink format?