Can anyone recommend a good software for doing Principal Component Analysis from data in VCF file format, or the most straightforward format to convert the VCF into for doing PCA. I hear that Plink is quite suitable for this. I also have some experience using eigenstrat for SNP data but have no experience using eigenstrat with whole genome VCF encoded data. Any tips or experience appreciated.
I agree, it would be nice to have a tool that does this. Meanwhile, You could create your own file with values of 0, 1, or 2 for homozygous ref, het, homoz alt. Then it'd be simple to use a standard PCA library to do the reduction.
The original question was posted almost 8 years ago. You may consider creating a new question relating to your specific issue. Also, if you choose to do this, then provide a lot more details and show the code that you have already used.
Note that there are many other questions already online (on other forums) where it is explained how to colour vectors differently in a plot.
Experienced the same issue. It seem the problem is that by default, chromosome names are not in the form "chr1" etc., but just "1" etc. The solution is to use function snpgdsOption() to redefine your chromosome names to whatever form they are in your vcf file :
snpgdsVCF2GDS(vcf, "ccm.gds", method="copy.num.of.ref", option=snpgdsOption(chr1=1, chr2=2, chr3=3, chr4=4, chr5=5, chr6=6, chr7=7, chr8=8, chr9=9, chr10=10, chr11=11, chr12=12, chr13=13, chr14=14, chr15=15, chr16=16, chr17=17, chr18=18, chr19=19, chr20=20, chr21=21, chr22=22, chrX=23, chrY=24, chrM=25))
Another solution is to add autosome.only=FALSE in snpgdsPCA() - it then takes all your chromosomes whatever their names are.
Thank you - this was very helpful. After struggling with Eigenstrat, I managed to produce a nice graph with this R package. I am relatively new to R and this is my question: Is there a way of labelling the different individuals in order to be able to distinguish the outliers in the graph?
I was able to run the PCA without errors and I got a nice plot. But I want to specify my two populations. How can I create the population file? Or do I need to add this info in the GDS file?
Thanks!
Hi, I found this thread really helpful . My data is grouped in 4 different populations and I managed to get them labelled as such by doing something similar to the following. Following on from Zev's code above....
In a text file, list the group name of each sample in sample.id, one group per line with each line corresponding to the group of the corresponding sample. For example if Sample 1 and 2 in sample.id were in 'Group 1' and Sample 3 - 5 were in 'Group 2', you'd have:
Group1
Group1
Group2
Group2
Group2
Save the file as 'pops.txt' and then:
>pop_code <- scan("pops.txt", what=character())
>cbind( sample.id, pop_code)
>ccm_pca<-snpgdsPCA(genofile, autosome.only=FALSE, num.thread=4)
>tab <- data.framesample.id = ccm_pca$sample.id,
pop = factor(pop_code)[match(ccm_pca$sample.id, sample.id)],
EV1 = ccm_pca$eigenvect[,1], # the first eigenvector
EV2 = ccm_pca$eigenvect[,2], # the second eigenvector
stringsAsFactors = FALSE)
>plot(tab$EV2, tab$EV1, col=as.integer(tab$pop), xlab="eigenvector 2", ylab="eigenvector 1")
legend("bottomright", legend=levels(tab$pop), pch="o", col=1:nlevels(tab$pop))
SNPRelate is an R package that is able to read from VCF files directly and perform PCA and IBD/IBS. According to the documentation, it runs 10-45x faster than EIGENSTRAT (v3.0) and PLINK (v1.07) respectively.
Update (Oct 2014): The package seems to be moved to GitHub (link)
Neilfws I used a multi-individual Unified Genotyper VCF as an input. I can't quite remember what the error was, but I can remember how I got it to work. See revised post above.
Thanks for that. I had no joy with the VCF; I converted to PLINK bed format using vcftools then used snpgdsBED2GDS() in SNPRelate. That converted to GDS no problem.
which version if R can be ran into?
I got this error:
library("SNPRelate")
Error in library("SNPRelate") : there is no package called ‘SNPRelate’
install.packages("SNPRelate")
Installing package into ‘/Users/ib7/Library/R/3.4/library’
(as ‘lib’ is unspecified)
Warning in install.packages :
package ‘SNPRelate’ is not available (for R version 3.4.2)
I recently developed a brand new pca analysis software MingPCACluster that can go from vcf to pca and graph( (VCF2PCA and figture)). Very fast and low memory, accurate and very precise
### run without pop.info
# ./bin/MingPCACluster -InVCF Khuman.vcf.gz -OutPut OUT
### run with pop.info
./bin/MingPCACluster -InVCF Khuman.vcf.gz -OutPut OUT -InSampleGroup pop.info
Sure! - take a look here: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format
I agree, it would be nice to have a tool that does this. Meanwhile, You could create your own file with values of 0, 1, or 2 for homozygous ref, het, homoz alt. Then it'd be simple to use a standard PCA library to do the reduction.