I am trying to convert the 1000G genotypes into plink format so I can try to run a PCA.
I used Plink 1.9 to recode all the vcf.gz to binary bed files. Now I am using --merge-list
to merge each chromosome together into one file. I am curious if I should be worried about the warnings about multiple positions for variants. If that is an issue why was it not mentioned in the vcf to plink conversion, and how does a rsID have more than one position unless they meant more than one base pair like it was a structural variant? The multiple chromosomes seen I am not so sure what that means unless it is an error?
Also I assume I also merge my case population with the 1kG dataset then prune them by LD. After that I can use plink to make a MDS plot or use GCTA?
Just saw this: https://groups.google.com/forum/#!topic/plink2-users/RNztDLWCfB8
I guess those SNPs in 1kG are multi-allelics?
actually just going back and I saw when I did the vcf to plink conversion it already filters for only biallelic loci so I don't understand how I would get multiallelic sites...
For multiallelic sites, Plink 1.9 defaults to keeping only the reference allele and the most common alternate allele; any call involving a lower-frequency alt allele is treated as missing data. If you want such sites to be entirely skipped, you need to add the
--biallelic-only
flag.I see, so if I understand you correctly, even though it says filtering biallelic it is really just assigning missing data to the third allele? If I use the biallelic-only flag it will just skip that SNV entirely?
I found through browsing around google and your threads the genetics for fun blog which has exactly what I needed. It was not easy to find despite the obvious title, so I'll post it here for future reference:
http://apol1.blogspot.com/2014/11/best-practice-for-converting-vcf-files.html
Also what are your thoughts on using GATKs VariantsToBinaryPed for vcf to plink?