choice of snps for pca analysis
1
0
Entering edit mode
6.5 years ago
wrab425 ▴ 50

I have followed advice given following another post and have extracted pca's from the vcf file containing the snps from 64 strains of our haploid organism. The vcf was made using the GATK best practices pipeline through a set of gvcfs and all ran smoothly. I converted this into plink format files using vcf tools and extracted unlinked snaps using the plink --indep-pairwise command and then extracted the pca's using --pca in plink_9. All in all straightforward, and a recommended route, but the thing is that when I extract different snp sets according to different parameters in --indep-pairwise I get very different pcas! Please can somebody provide a guide as to parameter choice for choosing the unlinked snps as otherwise it would seem that this standard procedure has a large arbitrary component. My choices have been 5000 10 0.25 and 1500 10 0.3. I note that some folk use r squared criteria as high as 0.5 which would include weakly linked snps. Why I wonder?

Plink PCA population genetics indep-pairwise • 2.0k views
ADD COMMENT
1
Entering edit mode
6.5 years ago

In my tutorial, I use --indep 50 5 1.5: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format

The choice of these parameters will, in a nutshell, greatly hinge on these parameters:

  • sample n
  • sample ethnicity / ethnicities, and the respective proportions of each sub-group
  • genotyping density
  • exact positions genotyped

The choice will also depend on what the goal of the particular study is.

For these reasons, you will not find any standard for these settings.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 1633 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6