Question

Pca From Vcf Files

17

Entering edit mode

13.2 years ago

Rubal7 ▴ 850

Can anyone recommend a good software for doing Principal Component Analysis from data in VCF file format, or the most straightforward format to convert the VCF into for doing PCA. I hear that Plink is quite suitable for this. I also have some experience using eigenstrat for SNP data but have no experience using eigenstrat with whole genome VCF encoded data. Any tips or experience appreciated.

Many thanks,
Rubal

pca genome vcf • 45k views

ADD COMMENT • link updated 3.0 years ago by hewm2008 ▴ 50 • written 13.2 years ago by Rubal7 ▴ 850

1

Entering edit mode

Sure! - take a look here: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format

ADD REPLY • link 7.2 years ago by Kevin Blighe 89k

0

Entering edit mode

I agree, it would be nice to have a tool that does this. Meanwhile, You could create your own file with values of 0, 1, or 2 for homozygous ref, het, homoz alt. Then it'd be simple to use a standard PCA library to do the reduction.

ADD REPLY • link 13.2 years ago by brentp 24k

Leszek · Answer 1 · 2012-05-11

20

Entering edit mode

13.2 years ago

Zev.Kronenberg 12k

You can use VCFtools to make a PED and MAP file from VCF. This is PLINK format. Many PCA programs take PLINK input or offer conversion scripts.

I ended up using SNPRelate. After some silly errors here is how I got it to work:

setwd("/xxx/pca")
library("SNPRelate")
vcf.fn<-"~/xxx/tmp.vcf"
snpgdsVCF2GDS(vcf.fn, "ccm.gds",  method="biallelic.only")
genofile <- openfn.gds("ccm.gds")
ccm_pca<-snpgdsPCA(genofile)
plot(ccm_pca$eigenvect[,1],ccm_pca$eigenvect[,2] ,col=as.numeric(substr(ccm_pca$sample, 1,3) == 'CCM')+3, pch=2)

ADD COMMENT • link 12.0 years ago by Zev.Kronenberg 12k

1

Entering edit mode

This works absolutely well. If anyone can share how to change the shape and color to distinguish between the samples taken would be very helpful.

ADD REPLY • link 5.4 years ago by tothepoint ▴ 940

0

Entering edit mode

The original question was posted almost 8 years ago. You may consider creating a new question relating to your specific issue. Also, if you choose to do this, then provide a lot more details and show the code that you have already used.

Note that there are many other questions already online (on other forums) where it is explained how to colour vectors differently in a plot.

ADD REPLY • link 5.4 years ago by Kevin Blighe 89k

0

Entering edit mode

I was wondering if you were able to find any assistance regarding this? I am not easily able to change the color vectors on my plot! Thank you!

Update: I believe I found a specific solution for the code posted above and will share here after testing out.

Update: To change the color palette after plotting you can define a color vectors as such:

jColors <- c('chartreuse3', 'cornflowerblue', 'darkgoldenrod1', 'peachpuff3',
         'mediumorchid2', 'turquoise3', 'wheat4', 'slategray2') ## choosing specific colors for your plot!

Then include your color palette in your plot. I assigned colors based on the 8 populations in my PCA.

plot(tab_popr$EV2, tab_popr$EV1, col=jColors[as.integer(tab_popr$pop)], xlab = "EV2", ylab = "EV1")
legend("topleft", legend = levels(tab_popr$pop), pch = "o", col = jColors[1:nlevels(tab_popr$pop)])

ADD REPLY • link 3.9 years ago by nataliagru1 ▴ 90

0

Entering edit mode

1000 Genomes also has a tool for producing plink http://www.1000genomes.org/vcf-ped-converter

ADD REPLY • link 13.2 years ago by Laura ★ 1.8k

0

Entering edit mode

tried it, and after ccm_pca<-snpgdsPCA(genofile) got: Removing 3604 non-autosomal SNPs. Error in snpgdsPCA(genofile) : There is no SNP!. Any idea?

ADD REPLY • link 11.5 years ago by Leszek 4.2k

2

Entering edit mode

Experienced the same issue. It seem the problem is that by default, chromosome names are not in the form "chr1" etc., but just "1" etc. The solution is to use function snpgdsOption() to redefine your chromosome names to whatever form they are in your vcf file : snpgdsVCF2GDS(vcf, "ccm.gds", method="copy.num.of.ref", option=snpgdsOption(chr1=1, chr2=2, chr3=3, chr4=4, chr5=5, chr6=6, chr7=7, chr8=8, chr9=9, chr10=10, chr11=11, chr12=12, chr13=13, chr14=14, chr15=15, chr16=16, chr17=17, chr18=18, chr19=19, chr20=20, chr21=21, chr22=22, chrX=23, chrY=24, chrM=25))

Another solution is to add autosome.only=FALSE in snpgdsPCA() - it then takes all your chromosomes whatever their names are.

ADD REPLY • link updated 11.3 years ago by Leszek 4.2k • written 11.4 years ago by jockbanan ▴ 440

2

Entering edit mode

you meant autosome.only=FALSE ? since TRUE returns the same error ("Removing 362090 non-autosomal SNPs. - there is no SNP")

ADD REPLY • link 11.3 years ago by User 1933 ▴ 360

0

Entering edit mode

that works. thx!

ADD REPLY • link 11.3 years ago by Leszek 4.2k

0

Entering edit mode

Hi All,

Can you please explain about this line:

plot(ccm_pca$eigenvect[,1],ccm_pca$eigenvect[,2] ,col=as.numeric(substr(ccm_pca$sample, 1,3) == 'CCM')+3, pch=2)

ADD REPLY • link updated 3.8 years ago by Ram 45k • written 11.1 years ago by always_learning ★ 1.2k

0

Entering edit mode

ccm_pca$eigenvect[,1],ccm_pca$eigenvect[,2] implies that you are plotting between eigen vectors 1 and eigen vectors 2 ... PC1 and PC2...

ADD REPLY • link 11.0 years ago by GouthamAtla 12k

0

Entering edit mode

Thank you - this was very helpful. After struggling with Eigenstrat, I managed to produce a nice graph with this R package. I am relatively new to R and this is my question: Is there a way of labelling the different individuals in order to be able to distinguish the outliers in the graph?

ADD REPLY • link 10.8 years ago by zinzin.steenkamp • 0

0

Entering edit mode

Have you find how to manage it ?

ADD REPLY • link 9.3 years ago by Picasa ▴ 690

0

Entering edit mode

I was able to run the PCA without errors and I got a nice plot. But I want to specify my two populations. How can I create the population file? Or do I need to add this info in the GDS file? Thanks!

ADD REPLY • link 9.1 years ago by CB ▴ 20

0

Entering edit mode

Hi, I found this thread really helpful . My data is grouped in 4 different populations and I managed to get them labelled as such by doing something similar to the following. Following on from Zev's code above....

>genofile <- snpgdsOpen("ccm.gds")
>sample.id <- read.gdsn(index.gdsn(genofile, "sample.id"))
>sample.id
[1] "Sample1"   
[2] "Sanple2"  
[3] "Sample3"  
[4] "Sample4" 
[5] "Sample5"

In a text file, list the group name of each sample in sample.id, one group per line with each line corresponding to the group of the corresponding sample. For example if Sample 1 and 2 in sample.id were in 'Group 1' and Sample 3 - 5 were in 'Group 2', you'd have:

Group1
Group1
Group2
Group2
Group2

Save the file as 'pops.txt' and then:

>pop_code <- scan("pops.txt", what=character())
>cbind( sample.id, pop_code)

>ccm_pca<-snpgdsPCA(genofile, autosome.only=FALSE, num.thread=4)

>tab <- data.framesample.id = ccm_pca$sample.id,
              pop = factor(pop_code)[match(ccm_pca$sample.id, sample.id)],
              EV1 = ccm_pca$eigenvect[,1],    # the first eigenvector
              EV2 = ccm_pca$eigenvect[,2],    # the second eigenvector
              stringsAsFactors = FALSE)

>plot(tab$EV2, tab$EV1, col=as.integer(tab$pop), xlab="eigenvector 2", ylab="eigenvector 1")
legend("bottomright", legend=levels(tab$pop), pch="o", col=1:nlevels(tab$pop))

And that should do it!

ADD REPLY • link 9.0 years ago by graham.etherington • 0

0

Entering edit mode

Just tried it but got some error...

tab <- data.framesample.id = ccm_pca$sample.id Error in (tab <- data.framesample.id) = ccm_pca$sample.id : object 'tab' not found

What is data.framesample.id exactly?

ADD REPLY • link 8.4 years ago by nwhza12_psk015 • 0

0

Entering edit mode

It seems that the above line

tab <- data.framesample.id = ccm_pca$sample.id,

misses a parenthesis. Here is the correct code:

tab <- data.frame(sample.id = ccm_pca$sample.id,pop = factor(pop_code)[match(ccm_pca$sample.id, sample.id)],EV1 = ccm_pca$eigenvect[,1], EV2 = ccm_pca$eigenvect[,2], stringsAsFactors = FALSE)

ADD REPLY • link 4.0 years ago by Manuel Corpas ▴ 270

score 12 · Answer 2 · 2012-09-12

12

Entering edit mode

12.9 years ago

sa9 ▴ 870

SNPRelate is an R package that is able to read from VCF files directly and perform PCA and IBD/IBS. According to the documentation, it runs 10-45x faster than EIGENSTRAT (v3.0) and PLINK (v1.07) respectively.

Update (Oct 2014): The package seems to be moved to GitHub (link)

ADD COMMENT • link 10.8 years ago by sa9 ▴ 870

0

Entering edit mode

thanks, that's a good find!

ADD REPLY • link 12.9 years ago by brentp 24k

0

Entering edit mode

Just tried it. It hates my Unified genotyper VCF??

ADD REPLY • link 12.1 years ago by Zev.Kronenberg 12k

1

Entering edit mode

What is the error message? Can you post part of the VCF?

ADD REPLY • link 12.1 years ago by sa9 ▴ 870

0

Entering edit mode

There was no problem with the function, it was user error :-).

ADD REPLY • link 12.1 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Can you elaborate? I get "file has different number of columns" with a UnifiedGenotyper VCF. Is there a fix, or this type of VCF not suitable?

ADD REPLY • link 12.0 years ago by Neilfws 49k

0

Entering edit mode

Neilfws I used a multi-individual Unified Genotyper VCF as an input. I can't quite remember what the error was, but I can remember how I got it to work. See revised post above.

ADD REPLY • link 12.0 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Thanks for that. I had no joy with the VCF; I converted to PLINK bed format using vcftools then used snpgdsBED2GDS() in SNPRelate. That converted to GDS no problem.

ADD REPLY • link 12.0 years ago by Neilfws 49k

0

Entering edit mode

This sounds great

ADD REPLY • link 11.3 years ago by Rubal7 ▴ 850

0

Entering edit mode

which version if R can be ran into? I got this error:

library("SNPRelate") Error in library("SNPRelate") : there is no package called ‘SNPRelate’ install.packages("SNPRelate") Installing package into ‘/Users/ib7/Library/R/3.4/library’ (as ‘lib’ is unspecified) Warning in install.packages : package ‘SNPRelate’ is not available (for R version 3.4.2)

any advice? thanks

ibseq

ADD REPLY • link 7.7 years ago by ibseq12 • 0

0

Entering edit mode

SNPRelate has been removed from CRAN. You need to install it from Bioconductor now.

ADD REPLY • link 7.7 years ago by Neilfws 49k

0

Entering edit mode

How can we add label in PCA plot generated by snprelate to see samples?

ADD REPLY • link 6.7 years ago by waqaskhokhar999 ▴ 160

score 5 · Answer 3 · 2016-06-08

5

Entering edit mode

9.1 years ago

mkulecka ▴ 380

While this thread in very old, I think it would be useful to add that PLINK now directly supports vcfs link to new (1.9) version of PLINK. .

ADD COMMENT • link 9.1 years ago by mkulecka ▴ 380

0

Entering edit mode

It makes things easier than ever. plink --pca --allow-extra-chr --vcf samples.vcf

ADD REPLY • link 8.6 years ago by Xiaowei • 0

0

Entering edit mode

How would you plot the resulting output?

ADD REPLY • link 7.4 years ago by BeetleCheers • 0

1

Entering edit mode

With R it's pretty easy:


library(ggplot2)
library("ggrepel")

df<-read.delim("plink.eigenvec") #read in eigenvectors

eigens<-read.delim("plink.eigenval",header=F) #read in eignevalues

sum_eig<-sum(eigens$V1)

sum_eigs<-lapply(eigens$V1,function(x){
  rt<-(x/sum_eig)*100
  rt<-round(rt)
  return(rt)
})

ggplot(df, aes(PC1, PC2, color=condition)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",sum_eigs[[1]],"% variance")) +
  ylab(paste0("PC2: ",sum_eigs[[2]],"% variance"))+geom_text_repel(aes(label=SampleID), size=3) #The part with geom_text_repel adds easy-readable labels.

I have slightly modified my eigenvectors file I have added SampleID and condition columns.

ADD REPLY • link 7.4 years ago by mkulecka ▴ 380

score 0 · Answer 4 · 2022-08-04

I recently developed a brand new pca analysis software MingPCACluster that can go from vcf to pca and graph( (VCF2PCA and figture)). Very fast and low memory, accurate and very precise

https://github.com/hewm2008/MingPCACluster

### run without pop.info
     #   ./bin/MingPCACluster   -InVCF  Khuman.vcf.gz   -OutPut OUT
### run with  pop.info
    ./bin/MingPCACluster    -InVCF  Khuman.vcf.gz   -OutPut OUT -InSampleGroup  pop.info