Question

Batch effect in population stratification with 1000 genome project data

0

Entering edit mode

3.3 years ago

Alejandro Rojo ▴ 30

Hi all,

I have an exome vcf dataset which I am trying to do produce a PCA plot with data of 1000 Genomes Project Phase III (1000G_2504_high_coverage - WGS). Using plink, I transformed my dataset into .bed/.bim/.fam and I filtered both datasets with --maf 0.1 and --indep 50 5 1.5. I merged both datasets with the common variants and run a pca with plink. However, plotting the results, my samples do not overlap with any datapoints from the reference panel.

I am using variants from all chromosomes except chromosome X and I have used different filtering thresholds in plink but still I get this batch effect:

enter image description here

Any suggestions on what am I doing wrong?

Thank you very much for your help!

pca plink 1kg • 867 views

ADD COMMENT • link 3.3 years ago by Alejandro Rojo ▴ 30

0

Entering edit mode

My guess is strand flips. Did you make sure to check the reference alleles are the same between the two datasets?

ADD REPLY • link 3.3 years ago by 4galaxy77 2.9k

0

Entering edit mode

I checked for strand issues but there weren't in both datasets. I also checked and adjusted the reference build but still I am getting the same isolated cluster for my dataset. A remark is that before working with plink the exome dataset was joint genotyped with gl_nexus, and then normalized and decomposed. I also imputed it using BEAGLE. Is there a chance that I introduced a technical error before convert the vcf dataset into plink format?

ADD REPLY • link 3.3 years ago by Alejandro Rojo ▴ 30