Entering edit mode
3.8 years ago
Dear all,
I would like to perform a principal component analysis using plink, in order to use these PCs to adjust for genetic ancestry in a GWAS. I have 22 vcf files (per chromosome), with genotype data of ~6000 people. I would like to perform the following steps, but I struggle a bit with the correct order and commands.
- I want to subset to ~4500 people of interest. I think I can use bcftools -S for this, using a .txt file with patient IDs that I want to include. Is this correct?
- I am not sure whether to prune first or to merge the 22 files first. My guess is I prune the 22 chromosomes using plink first.
- Then I would like to merge the files to yield one file with all genotype data. The options merge and concatenate confuse me a little bit, but I think that for the purpose of performing a PCA I need to go for (bcftools) merge, is this right?
- I can then perform a pca in plink, and use the eigenvectors of the first n principal components as covariates in my GWAS. Is there a common number of PCs to be used or do I determine this based on the eigenvalues?
Hope you can help me out
Thanks in advance,
Vincent
Thank you very much! I'm almost there. At step 4, do I choose for --make-bed as well? This gives me three files pruned.bed, pruned.bim and pruned.fam. If I use these for pca, I get the impression that all variants are still used in pca, not only the ones that remained after pruning. But if I don't use --make-bed, and try to use the file .prune.in for pca, there is an error with not finding a .map file.
Step 4 will give you a list of SNPs which are in linkage equilibrium (i.e. the ones you want to keep) in a file called xxx.prune.in. You can then keep this snps by running something like
plink --bfile data --extract xxx.prune.in --make-bed --out data.pruned
, which will create a new set of plink files with the base-name data.prune. You should then use this for your PCA. The steps in the practical I linked should give some more detailed info.Yes that's it. I tried it with one of my vcf files, as they do in the practical, which worked. For the plink files I missed this extra conversion step. Thank you!