Question

A Shift In Pca Plot For Population Stratification

9

Entering edit mode

10.7 years ago

User 1933 ▴ 360

To do the population comparison between cohort of patient and 1KG. So, I have converted their VCF file to PED format;

vcftools --gzvcf 1000g/1000g_myvariants.vcf.gz --plink --out 1000g
vcftools --vcf myvariants.vcf --plink --out myvariants

and then, I took variant with snp ID

grep -o 'rs[0-9]*' 1000g_myvariants.map > rs.snplist.raw

and sorted and removed those were duplicated

sort rs.snplist.raw | uniq > rs.snplist.dedup

then I removed those were not matched allele codes

plink --file myvariants --extract rs.snplist.dedup --exclude all.missnp --recode --out myvariants.subset
plink --file 1000g_myvariants --extract rs.snplist.dedup --exclude all.missnp --recode --out 1000g_myvariants.subset

and finally I merged them

plink --file 1000g_myvariants.subset --merge myvariants.subset.ped myvariants.subset.map --recode --out all

and I created MDS plot

plink --file all --read-genome all.genome --cluster --mds-plot 2 --out all_mds_2

and plotted component 2 versus component 1

tab = read.table("plink.mds", h = T)
tab$pop = factor(c(rep("1KG", 1212), rep("mycohort", 285)))
plot(tab$C1, tab$C2, col=as.integer(tab$pop),xlab="eigenvector 2", ylab="eigenvector 1")

and here is how the result look like,

PCA

basically, there is a shift which I am curious what could be the reason ? do I have to filter more SNP to get the right match ? is there any other tools to run PCA rather than PLINK?

Is the 1000 genome variants some how normalized while the other cohort is not ?

pca exome-sequencing • 6.8k views

ADD COMMENT • link updated 23 months ago by Ram 44k • written 10.7 years ago by User 1933 ▴ 360

0

Entering edit mode

Are the black dots individuals from 1000genomes? Which dataset are you using, exactly? Check if the separate groups are due to different sequencing technology.

ADD REPLY • link 10.7 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

yes, there are 1212 individuals in 1KG which are represented by black in the plot. mm, the 1KG sequencing has been done both with Illumina and ABI sequencing; I feel I should have normalize these two cohort separately somehow before hand.

ADD REPLY • link 10.7 years ago by User 1933 ▴ 360

0

Entering edit mode

Have you tried doing factor analysis to see which SNPs are underlying this? You could also just look at the rotated data (if you were to do the PCA with the prcomp() function in R, this would be output$x). That's the next thing I would try.

ADD REPLY • link 10.7 years ago by Devon Ryan 104k

Ram · Answer 1 · 2015-06-12

2

Entering edit mode

9.5 years ago

Zhenyu Zhang ★ 1.2k

Hi. I am wondering if you have figured out the reason. I recently did a similar smartpca analysis, and also see such phenomena of population stratification shift.

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.5 years ago by Zhenyu Zhang ★ 1.2k

2

Entering edit mode

OK, let me answer myself. I filter the data with HWE, and then it looks great now.

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.5 years ago by Zhenyu Zhang ★ 1.2k

1

Entering edit mode

also you have to append 1KG data and yours and doing one pca. not two separate.

ADD REPLY • link 9.3 years ago by Quak ▴ 520