PCA on subset of patients, from 22 separate chromosome vcf files
1
0
Entering edit mode
3.8 years ago

Dear all,

I would like to perform a principal component analysis using plink, in order to use these PCs to adjust for genetic ancestry in a GWAS. I have 22 vcf files (per chromosome), with genotype data of ~6000 people. I would like to perform the following steps, but I struggle a bit with the correct order and commands.

  • I want to subset to ~4500 people of interest. I think I can use bcftools -S for this, using a .txt file with patient IDs that I want to include. Is this correct?
  • I am not sure whether to prune first or to merge the 22 files first. My guess is I prune the 22 chromosomes using plink first.
  • Then I would like to merge the files to yield one file with all genotype data. The options merge and concatenate confuse me a little bit, but I think that for the purpose of performing a PCA I need to go for (bcftools) merge, is this right?
  • I can then perform a pca in plink, and use the eigenvectors of the first n principal components as covariates in my GWAS. Is there a common number of PCs to be used or do I determine this based on the eigenvalues?

Hope you can help me out

Thanks in advance,
Vincent

principal-component-analysis vcf PCA • 2.7k views
ADD COMMENT
0
Entering edit mode
3.8 years ago
4galaxy77 2.9k

I'd do something like this

  1. Convert all your 22 vcfs to plink format - e.g. plink --vcf data.chr1.vcf --make-bed --out data.chr1
  2. Merge all chromosomes together e.g. plink --bfile data.chr1 --merge-list mylist.txt --make-bed --out data.allChr
  3. Subset your samples using plink - e.g. plink --bfile data.allChr --keep samples.txt --make-bed --out data.allChr.subset
  4. Then LD prune your merged in plink using --indep-pairwise command in plink
  5. Then run your PCA using plink on the merged/pruned file

other things you might want to consider are filtering for minor allele frequency --maf 0.01 and also doing some IBD checks using --genome. This is a useful tutorial

Hope that helps.

ADD COMMENT
0
Entering edit mode

Thank you very much! I'm almost there. At step 4, do I choose for --make-bed as well? This gives me three files pruned.bed, pruned.bim and pruned.fam. If I use these for pca, I get the impression that all variants are still used in pca, not only the ones that remained after pruning. But if I don't use --make-bed, and try to use the file .prune.in for pca, there is an error with not finding a .map file.

ADD REPLY
0
Entering edit mode

Step 4 will give you a list of SNPs which are in linkage equilibrium (i.e. the ones you want to keep) in a file called xxx.prune.in. You can then keep this snps by running something like plink --bfile data --extract xxx.prune.in --make-bed --out data.pruned, which will create a new set of plink files with the base-name data.prune. You should then use this for your PCA. The steps in the practical I linked should give some more detailed info.

ADD REPLY
0
Entering edit mode

Yes that's it. I tried it with one of my vcf files, as they do in the practical, which worked. For the plink files I missed this extra conversion step. Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 1916 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6